Possible data leakage - Githubissues

szilard / benchm-ml

A minimal benchmark for scalability, speed and accuracy of commonly used open source implementations (R packages, Python scikit-learn, H2O, xgboost, Spark MLlib etc.) of the top machine learning algorithms for binary classification (random forests, gradient boosted trees, deep neural networks etc.).

MIT License

1.87k stars 335 forks source link

Possible data leakage #33

Closed arogozhnikov closed 8 years ago

arogozhnikov commented 8 years ago

Hi, szilard! thanks for your benchmarks, I think that you found an interesting dataset for comparison.

HOWEVER

The time of departure present in the data is exact time when aircraft takes off. Thus, by analyzing the aircrafts from airport X to airport Y by carrier Z one can establish at which time aircrafts should take off to be in time (and that's what deep trees do, to my belief).

At least, I could easily see such patterns in data.

It doesn't seem to be very useful to predict if aircraft departures in time given you already know this information.

So, my suggestion is either to replace DepTime with PlannedDepTime (if you know how to get this infomation) or put DepTime = DepTime // 200 to reduce possibility of using this information, while this altered feature gives approximate information about the flight schedule.

szilard commented 8 years ago

Thanks for feedback. Indeed that's possible data leakage (at least partial).

The main goal of this project is to compare the scalability+speed+accuracy of various implementations of the same algos, so this should not matter for this goal.

It might be a problem though for the comparison of e.g. GBM and DL, though I don't think either of them could exploit this without some feature engineering.

It might be worth to try to run e.g. RF/GBM/DL with DepTime replaced with PlannedDepTime and compare the AUCs. I might do that later on, but you are free to do that now if you want.

arogozhnikov commented 8 years ago

It might be worth to try to run e.g. RF/GBM/DL with DepTime replaced with PlannedDepTime and compare the AUCs. I might do that later on, but you are free to do that now if you want.

Agree. At this moment I am testing different LibFM implementations on this data, I'll try to compare RF/GBDT on PlannedDepTime when I'm done.

Also I don't see DL in benchmarks on flight. Results are bad or you don't have time to test?

szilard commented 8 years ago

Re: RF/GBDT on PlannedDepTime sounds great, thanks.

Re: DL. I started to do something, but I did not add results to the README yet. Last few weeks I could not work on this, but I hope to get back soon. Anyway, here are some preliminary results with H2o and mxnet, but I'm planning to look at the other tools as well: https://github.com/szilard/benchm-ml/issues/28