mbernico commented 7 years ago

For categoricals like [jan, feb, mar...dec] we should shuffle before applying the binning. The ordinal nature of this categorical binned to the gaussian column is 'too easy.'

tgsmith61591 commented 7 years ago

Maybe I misunderstand, but for binning categorical, won't a LabelEncoder implicitly "shuffle" in the sense that it will be ordered by alpha internally? Maybe I'm missing the use case.

On Mar 22, 2017 3:35 PM, "Mike Bernico" notifications@github.com wrote:

For categoricals like [jan, feb, mar...dec] we should shuffle before applying the binning. The ordinal nature of this categorical binned to the gaussian column is 'too easy.'

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mbernico/snape/issues/13, or mute the thread https://github.com/notifications/unsubscribe-auth/AF10oj97KkUCGEt5wlFvxJaQyPcCepNcks5roYYZgaJpZM4MlyK5 .

mbernico commented 7 years ago

I might not be thinking about this completely right either. I experienced a situation where some students were able to numerically encode a categorical in a snape dataset column (1=jan, 2=feb), and the results were superior to something like one hot. I think it might be unrealistically easy because of how we do create_categoricals. All snape columns are normal distributions (another weakness prob) so it might be better to shuffle labels that might happen to be ordinal so that strategy works less well?

Consider this label_list=[[jan, feb, mar, ...dec]]

def create_categorical_features(df, label_list, random_state=None):

stuff happens that chooses a random numerical column called 'chosen_col' and then runs

df[chosen_col] = pd.cut(df[chosen_col], bins=len(label_list[0]), labels=label_list[0])

What I'm thinking is that it might be more difficult/realistic to do:

def create_categorical_features(df, label_list, random_state=None):

stuff happens that chooses a random numerical column called 'chosen_col' and then runs

shuffle(label_list[0]) df[chosen_col] = pd.cut(df[chosen_col], bins=len(label_list[0]), labels=label_list[0])

Please tell me if I'm wrong though, as always!

mbernico / snape

shuffle categoricals before binning #13

stuff happens that chooses a random numerical column called 'chosen_col' and then runs

stuff happens that chooses a random numerical column called 'chosen_col' and then runs