microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.48k stars 3.82k forks source link

Data leakage in cross-validation functions in R and Python packages #4319

Open fabsig opened 3 years ago

fabsig commented 3 years ago

Description

The cross-validation functions in the R and Python packages (here and here) currently produce data leakage. First, the entire data set is used to create the feature mapper (which maps features into bins), and afterwards the data is split into training and validation sets and both the training and the validation data sets use the same feature mapper (see here). Crucially, the test / validation data is also used to create this feature mapper. I.e., part of the model (the feature mapper) has already "seen" a part of the validation data on which the model is evaluated, supposedly, in an out-of-sample manner. Note that no label data has leaked, but information on feature data should also not leak.

The code below demonstrates the problem. The two versions ((i) splitting the data into training and validation "by hand" and (ii) using the cv function) should produce identical results, but they do not.

I will make a pull request / proposal with a partial fix when free_raw_data=False. Also, I suggest adding a warning message to notify users about this form of data leakage.

Reproducible example

data_leakage_CV_lightgbm.py ``` import lightgbm as lgb import numpy as np def f1d(x): """Non-linear function for simulation""" return (1.7 * (1 / (1 + np.exp(-(x - 0.5) * 20)) + 0.75 * x)) # Simulate data n = 500 # number of samples np.random.seed(1) X = np.random.rand(n, 2) f = f1d(X[:, 0]) y = f + np.sqrt(0.01) * np.random.normal(size=n) # Split into train and test data train_ind = np.arange(0,int(n/2)) test_ind = np.arange(int(n/2),n) folds = [(train_ind, test_ind)] params = { 'objective': 'regression_l2', 'learning_rate': 0.05, 'max_depth': 6, 'min_data_in_leaf': 5, 'verbose': 0 } # Using cv function data = lgb.Dataset(X, y) cvbst = lgb.cv(params=params, train_set=data, num_boost_round=10, early_stopping_rounds=5, folds=folds, verbose_eval=True, show_stdv=False, seed=1) # Results for last 3 iterations: #[8] cv_agg's l2: 0.536216 #[9] cv_agg's l2: 0.48615 #[10] cv_agg's l2: 0.440826 # Using train function and manually splitting the data does not give the same results data_train = lgb.Dataset(X[train_ind, :], y[train_ind]) data_eval = lgb.Dataset(X[test_ind, :], y[test_ind], reference=data_train) evals_result = {} bst = lgb.train(params=params, train_set=data_train, num_boost_round=10, valid_sets=data_eval, early_stopping_rounds=5, evals_result=evals_result) # Results for last 3 iterations: #[8] valid_0's l2: 0.534485 #[9] valid_0's l2: 0.484565 #[10] valid_0's l2: 0.438873 ```

Environment info

LightGBM version or commit hash: da3465cbf1d1cf43e0d14908f110814e960905bc

Command(s) you used to install LightGBM: python setup.py install

fabsig commented 3 years ago

Note that with the proposed fix in #4320, the above example produces two times the same results when setting free_raw_data=False, i.e., data = lgb.Dataset(X, y, free_raw_data=False)

StrikerRUS commented 3 years ago

Thanks a lot for the detailed description!

both the training and the validation data sets use the same feature mapper

According to this answer it looks like this is done by design for some reason.

Set reference will use the reference's (usually trainset) bin mapper to construct the valid set. https://github.com/microsoft/LightGBM/issues/2553#issuecomment-551897638

Maybe @guolinke can comment?

Also, see this https://github.com/microsoft/LightGBM/issues/3362#issuecomment-702696059.

fabsig commented 3 years ago

@StrikerRUS: Thank you for your feedback. The fact that both the training and validation data use the same feature mapper is, per se, no problem as long as the feature mapper is only constructed using information from the training data. But this feature mapper is part of the model and must not be constructed using the validation data.

mayer79 commented 3 years ago

IMHO, the leakage from joint binning is negligible (e.g. it does not use the response variable). My suggestion is to mention it in the help of lgb.cv instead of making the code longer and run-time slower. The note could be: "Note that feature binning is done for the combined data, not per fold."

StrikerRUS commented 3 years ago

I'm +1 for @mayer79's proposal of documenting this problem.

fabsig commented 3 years ago

I would recommend a zero-tolerance policy for all kinds of data leakage in cross-validation. The fact that one needs to add more lines of code does not seem to be a sound argument for not fixing the problem. Intuitively, the amount of information leakage is often small, and I agree that this intuition holds true in many applications. But, can you guarantee that there are no datasets where this data leakage might be a serious issue?

jameslamb commented 2 years ago

@fabsig I apologize for the long delay in responding to this issue! I'd like to pick it up again and try to move it to resolution before LightGBM 4.0.0.

@shiyu1994 @btrotta @Laurae2 @guolinke could you take a look at this issue and give us your opinion?

Specifically, this question:

Today, lgb.cv() in the R and Python packages constructs a single Dataset from the full raw training data, then performs k-fold cross validation by taking subsets of that Dataset.

Should it be modified to instead subset the raw training data in each cross validation trial, and create new Datasets from each of those subsets?

If we decide to move forward with this change, I'd be happy to start providing a review on the specific code changes in #4320.

guolinke commented 2 years ago

sorry for the late response. IMO, I agree with @fabsig , zero-tolerance for leakage is the best solution. @shiyu1994 can you help for this?

shiyu1994 commented 2 years ago

Sorry for the slow response. I'm busy with several large PRs these days and missed this.

Yes, I fully support this idea. Actually, @guolinke and I have discussed about this issue before. A strict cv function should do everything without looking at any data in the fold for test.

I'll provide a review for #4320.

mayer79 commented 2 years ago

@shiyu1994 : Agreed but please monitor memory footprint for large data. In my view, it would not be acceptable that the footprint would increase by a factor of k, where k is the fold count, compared to the current solution. (This depends on how lgb.cv() currently stores the data, which I am actually not sure.)

shiyu1994 commented 2 years ago

@mayer79 The current solution only stores one single copy of data. I think to fully avoid data leakage, it is unavoidable to store k copies of discretized version of data (each copy has k-1/k of the original size). Because each copy will have different boundaries for feature discretization. Do you think it is worthy to provide such an alternative to allow fully avoiding data leakage but increase the memory cost? We can still keep the current approach as a memory saving way, so that users can make the trade-off.

mayer79 commented 2 years ago

Thanks so much for clarifying. Having both options would be an ideal solution. Still, I think the potential of leakage is negligible compared to all the bad things the average data scientist might do during modeling (like e.g. doing stratified instead of grouped splitting when rows are dependent etc.) ;-(

harshsarda29 commented 1 month ago

There is an issue with using the entire data for binning in the cases where the data distribution changes with time which is often the case in real world applications. While working on my data, I faced this similar issue and saw that if reference = entire dataset is not provided when I try to reproduce the results generated by lightgbm.cv, the metric (average precision score) differs by a lot. In my case, the difference is 0.06. With the current approach, the cross validation scores would always look quite high as compared to retraining the model on the entire data and then evaluating on test.