koaning / scikit-lego

Extra blocks for scikit-learn pipelines.
https://koaning.github.io/scikit-lego/
MIT License
1.25k stars 117 forks source link

[FEATURE] Ability to stratify with cols that contain some Nans values, this way people can hyperparameter tune best imputation methods #681

Open dec1costello opened 2 months ago

dec1costello commented 2 months ago

Hello!

Here's my attempt to stratify cols with some Nans for more context, I am a beginner so open to better ideas or comments if this feature request is out of scope. Thanks in advance!! Appreciate everyone's contributions to this package!

Strat attempt:

X = result_df[feature_cols]
y = result_df['strokes_to_hole_out']

#Extract the columns for stratification
stratify_cols = ['from_location_scorer','from_location_laser']
stratify_data = result_df[stratify_cols]

#Split the data, using 'stratify_data' for stratification
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42, stratify=stratify_data)

error I receive come training: Trial failed with exception: Found unknown categories ['blue'] in column 9 during transform

FBruzzesi commented 2 months ago

Hey @dec1costello , thank for the feature request. I have a few questions: