Data leakage in scalers

microsoft / dstoolkit-hierarchical-multilabel-classification

MIT License

23 stars 2 forks source link

Data leakage in scalers #2

Open FlorianPydde opened 1 year ago

FlorianPydde commented 1 year ago

https://github.com/microsoft/dstoolkit-hierarchical-multilabel-classification/blob/093a988bfb3a0d4c4711d5fe9bec9ce645cbd8e3/src/hmlc.py#L488

In the prep_input function, you use scalers (standard, power transformer, etc) on the entire dataset before performing the train-test split. I think this is a small data leakage are you are using "knowledge" from the test during training. I think the fix consist in creating a sklearn pipeline and a columntransformer object to chain everything together.

Any thoughts ?

Senani-Nori commented 1 year ago

You are right, @FlorianPydde. In the case of numerical columns, the scaler is applied before the train test split, which should not be done. I will fix that.

FlorianPydde commented 1 year ago

I suppose the same applies for categorical columns. What happens if the transformer has not seen the some strings before ? You need to have a "not seen" placehoders and see how it affect it.

Senani-Nori commented 1 year ago

Yes, OOV (Out of vocabulary in vectorization of strings) and 'not_seen' in categorical columns need to be handled, though they are not issues of data leakage. When I work on this, I will take care of all these three. Thank you for taking the time to review and point out these issues.