uclamii / model_tuner

A library to tune the hyperparameters of common ML models. Supports calibration and custom pipelines.
Apache License 2.0
3 stars 0 forks source link

Enhance Stratification Key Handling with Missing Data and Type Flexibility #30

Closed lshpaner closed 1 month ago

lshpaner commented 1 month ago

Description

The current implementation of the stratification process in the code does not robustly handle scenarios where stratify_cols may be None, a DataFrame, or contain missing values. This can lead to issues, especially in feature spaces with incomplete data. The proposed enhancement aims to address these shortcomings by introducing additional checks and processing steps.

Problem

  1. Handling of stratify_cols: The current code assumes that stratify_cols is either a list of column names or a truthy value without considering if it might be None or a DataFrame. This lack of flexibility can cause errors when stratify_cols is a DataFrame or None.

  2. Missing Data in Stratification Keys: The existing implementation does not account for missing values in the stratification key, which can lead to errors or incorrect stratification results when there are missing data points in the features or labels.

Proposed Solution

  1. Type Checking and Flexibility:

    • Implement explicit checks to verify if stratify_cols is None or a DataFrame.
    • If stratify_cols is a DataFrame, directly concatenate it with the target variable (y or y_valid_test).
    • If stratify_cols is not a DataFrame, treat it as a list of column names to be selected from X or X_valid_test.
  2. Handling Missing Data:

    • After creating the stratification key (stratify_key or strat_key_val_test), check if it is not None.
    • Make a copy of the stratification key and fill any missing values with an empty string (''). This ensures that the stratification process works even with incomplete data, preventing potential errors.

Code Changes

Introduce the following block after creating the stratification key:

if stratify_key is not None:
    stratify_key = stratify_key.copy()
    stratify_key = stratify_key.fillna('')

Similarly, for the validation/testing dataset:

if strat_key_val_test is not None:
    strat_key_val_test = strat_key_val_test.copy()
    strat_key_val_test = strat_key_val_test.fillna('')

Modify the logic for creating the stratification keys to include checks for None and DataFrame types, as well as handling missing data.

This enhancement will make the code more robust, flexible, and reliable, especially in scenarios where data completeness cannot be guaranteed. By ensuring proper handling of different types and missing data, the code will be better suited for real-world applications where these challenges are common.

elemets commented 1 month ago

This seems to work as a fix but will cause issues with imputation later if we are using our current implementation. Currently imputation defaults as treating "nan" values as missing but if we replace these values with '' then this will no longer be imputed as empty strings are not treated as nan in Python.

A suggested fix for this: If we are imputing and stratifying by cols we should change the missing value in the SimpleImputer.

SimpleImputer(missing_values='')

Another fix would be to remove the current way we handle imputation and completely convert to custom pipeline steps so that the user would need to implement this code outside the class.