mbernico / snape

Snape is a convenient artificial dataset generator that wraps sklearn's make_classification and make_regression and then adds in 'realism' features such as complex formating, varying scales, categorical variables, and missing values.
Apache License 2.0
165 stars 21 forks source link

shuffle categoricals before binning #13

Open mbernico opened 7 years ago

mbernico commented 7 years ago

For categoricals like [jan, feb, mar...dec] we should shuffle before applying the binning. The ordinal nature of this categorical binned to the gaussian column is 'too easy.'

tgsmith61591 commented 7 years ago

Maybe I misunderstand, but for binning categorical, won't a LabelEncoder implicitly "shuffle" in the sense that it will be ordered by alpha internally? Maybe I'm missing the use case.

On Mar 22, 2017 3:35 PM, "Mike Bernico" notifications@github.com wrote:

For categoricals like [jan, feb, mar...dec] we should shuffle before applying the binning. The ordinal nature of this categorical binned to the gaussian column is 'too easy.'

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mbernico/snape/issues/13, or mute the thread https://github.com/notifications/unsubscribe-auth/AF10oj97KkUCGEt5wlFvxJaQyPcCepNcks5roYYZgaJpZM4MlyK5 .

mbernico commented 7 years ago

I might not be thinking about this completely right either. I experienced a situation where some students were able to numerically encode a categorical in a snape dataset column (1=jan, 2=feb), and the results were superior to something like one hot. I think it might be unrealistically easy because of how we do create_categoricals. All snape columns are normal distributions (another weakness prob) so it might be better to shuffle labels that might happen to be ordinal so that strategy works less well?

Consider this label_list=[[jan, feb, mar, ...dec]]

def create_categorical_features(df, label_list, random_state=None):

stuff happens that chooses a random numerical column called 'chosen_col' and then runs

df[chosen_col] = pd.cut(df[chosen_col], bins=len(label_list[0]), labels=label_list[0])

What I'm thinking is that it might be more difficult/realistic to do:

def create_categorical_features(df, label_list, random_state=None):

stuff happens that chooses a random numerical column called 'chosen_col' and then runs

shuffle(label_list[0]) df[chosen_col] = pd.cut(df[chosen_col], bins=len(label_list[0]), labels=label_list[0])

Please tell me if I'm wrong though, as always!