Closed lshpaner closed 1 month ago
This seems to work as a fix but will cause issues with imputation later if we are using our current implementation. Currently imputation defaults as treating "nan" values as missing but if we replace these values with '' then this will no longer be imputed as empty strings are not treated as nan in Python.
A suggested fix for this: If we are imputing and stratifying by cols we should change the missing value in the SimpleImputer.
SimpleImputer(missing_values='')
Another fix would be to remove the current way we handle imputation and completely convert to custom pipeline steps so that the user would need to implement this code outside the class.
Description
The current implementation of the stratification process in the code does not robustly handle scenarios where
stratify_cols
may beNone
, a DataFrame, or contain missing values. This can lead to issues, especially in feature spaces with incomplete data. The proposed enhancement aims to address these shortcomings by introducing additional checks and processing steps.Problem
Handling of
stratify_cols
: The current code assumes thatstratify_cols
is either a list of column names or a truthy value without considering if it might beNone
or a DataFrame. This lack of flexibility can cause errors whenstratify_cols
is a DataFrame orNone
.Missing Data in Stratification Keys: The existing implementation does not account for missing values in the stratification key, which can lead to errors or incorrect stratification results when there are missing data points in the features or labels.
Proposed Solution
Type Checking and Flexibility:
stratify_cols
isNone
or a DataFrame.stratify_cols
is a DataFrame, directly concatenate it with the target variable (y
ory_valid_test
).stratify_cols
is not a DataFrame, treat it as a list of column names to be selected fromX
orX_valid_test
.Handling Missing Data:
stratify_key
orstrat_key_val_test
), check if it is notNone
.''
). This ensures that the stratification process works even with incomplete data, preventing potential errors.Code Changes
Introduce the following block after creating the stratification key:
Similarly, for the validation/testing dataset:
Modify the logic for creating the stratification keys to include checks for
None
andDataFrame
types, as well as handling missing data.This enhancement will make the code more robust, flexible, and reliable, especially in scenarios where data completeness cannot be guaranteed. By ensuring proper handling of different types and missing data, the code will be better suited for real-world applications where these challenges are common.