Enhance Stratification Key Handling with Missing Data and Type Flexibility

Description

The current implementation of the stratification process in the code does not robustly handle scenarios where stratify_cols may be None, a DataFrame, or contain missing values. This can lead to issues, especially in feature spaces with incomplete data. The proposed enhancement aims to address these shortcomings by introducing additional checks and processing steps.

Problem

Handling of stratify_cols: The current code assumes that stratify_cols is either a list of column names or a truthy value without considering if it might be None or a DataFrame. This lack of flexibility can cause errors when stratify_cols is a DataFrame or None.
Missing Data in Stratification Keys: The existing implementation does not account for missing values in the stratification key, which can lead to errors or incorrect stratification results when there are missing data points in the features or labels.

Proposed Solution

Type Checking and Flexibility:
- Implement explicit checks to verify if stratify_cols is None or a DataFrame.
- If stratify_cols is a DataFrame, directly concatenate it with the target variable (y or y_valid_test).
- If stratify_cols is not a DataFrame, treat it as a list of column names to be selected from X or X_valid_test.
Handling Missing Data:
- After creating the stratification key (stratify_key or strat_key_val_test), check if it is not None.
- Make a copy of the stratification key and fill any missing values with an empty string (''). This ensures that the stratification process works even with incomplete data, preventing potential errors.

Code Changes

Introduce the following block after creating the stratification key:

if stratify_key is not None:
    stratify_key = stratify_key.copy()
    stratify_key = stratify_key.fillna('')

Similarly, for the validation/testing dataset:

if strat_key_val_test is not None:
    strat_key_val_test = strat_key_val_test.copy()
    strat_key_val_test = strat_key_val_test.fillna('')

Modify the logic for creating the stratification keys to include checks for None and DataFrame types, as well as handling missing data.

This enhancement will make the code more robust, flexible, and reliable, especially in scenarios where data completeness cannot be guaranteed. By ensuring proper handling of different types and missing data, the code will be better suited for real-world applications where these challenges are common.

uclamii / model_tuner