szilard / benchm-ml

A minimal benchmark for scalability, speed and accuracy of commonly used open source implementations (R packages, Python scikit-learn, H2O, xgboost, Spark MLlib etc.) of the top machine learning algorithms for binary classification (random forests, gradient boosted trees, deep neural networks etc.).
MIT License
1.87k stars 335 forks source link

other dataset of such type for benchmarking? #11

Closed szilard closed 9 years ago

szilard commented 9 years ago

@tqchen I moved your last question to a new issue:

Thanks for the clarification! BTW, do you have any idea if there is any other dataset of such type for benchmarking? For example, a dataset with more columns and rows.

One thing I noticed about this dataset is that seems the output was very dependent on one variable(when the features are randomly dropped at rate of 50%, one output tree could be very bad). This might make the result become a singular case where the result simply repeatively cut on a single feature.

szilard commented 9 years ago

Excellent question! The data I use is ~10M x 10 (8 more precisely), mix of categoricals and numerics. If you expand categoricals in 1-hot encoding, it's ~10M x 1K.

In fact I use years 2005 and 2006 for training. That's 14M rows. I could add a few more years, but with too many years probability distributions change etc. Also, the original dataset has some more columns, but most of them "leak" info on the target.

I wish there was a similar dataset 100M x 100 (10x bigger each way). Though in that case you might reach ~100G (current data is ~1G), so single machine algos might run in trouble.

Most public datasets I've seen with large number of rows are sparse, therefore not relevant to what I want to do (and many people use linear models such as VW on those esp. if they are extremely sparse).

If all I want from the data is to keep the CPUs busy, I could use a 10x replica and get ~100M x 10. However, that would not work for learning about the scaling for the AUC vs data size.

Taking all this into account and my interest in ~10M row datasets at work, I decided to go with the current setup (also easy to reproduce from someone else). However, if there are other options I'd like to hear them here. :)

tqchen commented 9 years ago

One dataset I know was criteo ads CTR data. http://labs.criteo.com/downloads/download-terabyte-click-logs/

It has 1 billion examples. There are high dimension attributes which do not suits the purpose of this benchmark. But it also contains 13 integer columns, testing only on integer columns could give some reasonable results and was suitable for trees

szilard commented 9 years ago

Also, this setup has a peculiar oddity as I consider Month and others as (non-ordered) factors, but if you make them ordered, you actually can get more accurate models. So, I'm artificially restricting my models (all of them across the board for consistency) to consider these variables as plain factors.

szilard commented 9 years ago

Thanks, I thought Criteo data was ~40M rows, but I see the one on that link must be bigger. I'll have to take a look if that structure is relevant to my stuff, but good to know.

tqchen commented 9 years ago

It was a bit unfortunate that there wasn't public credit/fraud detection dataset available at such scale.