Open mbernico opened 7 years ago
Maybe I misunderstand, but for binning categorical, won't a LabelEncoder implicitly "shuffle" in the sense that it will be ordered by alpha internally? Maybe I'm missing the use case.
On Mar 22, 2017 3:35 PM, "Mike Bernico" notifications@github.com wrote:
For categoricals like [jan, feb, mar...dec] we should shuffle before applying the binning. The ordinal nature of this categorical binned to the gaussian column is 'too easy.'
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mbernico/snape/issues/13, or mute the thread https://github.com/notifications/unsubscribe-auth/AF10oj97KkUCGEt5wlFvxJaQyPcCepNcks5roYYZgaJpZM4MlyK5 .
I might not be thinking about this completely right either. I experienced a situation where some students were able to numerically encode a categorical in a snape dataset column (1=jan, 2=feb), and the results were superior to something like one hot. I think it might be unrealistically easy because of how we do create_categoricals. All snape columns are normal distributions (another weakness prob) so it might be better to shuffle labels that might happen to be ordinal so that strategy works less well?
Consider this label_list=[[jan, feb, mar, ...dec]]
def create_categorical_features(df, label_list, random_state=None):
df[chosen_col] = pd.cut(df[chosen_col], bins=len(label_list[0]), labels=label_list[0])
What I'm thinking is that it might be more difficult/realistic to do:
def create_categorical_features(df, label_list, random_state=None):
shuffle(label_list[0]) df[chosen_col] = pd.cut(df[chosen_col], bins=len(label_list[0]), labels=label_list[0])
Please tell me if I'm wrong though, as always!
For categoricals like [jan, feb, mar...dec] we should shuffle before applying the binning. The ordinal nature of this categorical binned to the gaussian column is 'too easy.'