szilard / benchm-ml

A minimal benchmark for scalability, speed and accuracy of commonly used open source implementations (R packages, Python scikit-learn, H2O, xgboost, Spark MLlib etc.) of the top machine learning algorithms for binary classification (random forests, gradient boosted trees, deep neural networks etc.).
MIT License
1.87k stars 335 forks source link

More datasets and regression problems #53

Open PhilippPro opened 6 years ago

PhilippPro commented 6 years ago

Did you consider using more datasets?

And how about regression problems?

There is for example this benchmarking suite, accessible via the OpenML packages: https://arxiv.org/abs/1708.03731

szilard commented 6 years ago

Re more datasets: https://github.com/szilard/GBM-perf/issues/4#issuecomment-362651796

My focus now is top GBM implementations (including on GPUs). Doing more by doing less. I dockerized the most important things in a separate repo https://github.com/szilard/GBM-perf

Also read this summary I wrote recently: https://github.com/szilard/benchm-ml#summary

PhilippPro commented 6 years ago

I just watched your talk, very interesting.

In my opinion one of the directions that should be further developed (and you already mentioned) is AutoML: packages for automatic tuning, automatic ensembling, automatic feature engineering etc. in a time efficient way.

szilard commented 6 years ago

Oh, I forgot to say last comment that RE OpenML, those datasets are ridiculously small: https://gist.github.com/szilard/b82635fa9060227514af3423b3225a29

There is also another set of datasets, that's also too small datasets: https://gist.github.com/szilard/d8279374646fb5f372317db2a4074f2f

I would want a set of datasets with sizes from 1000 to 10M with median size 100K (so should cover 1K-10K-100K-1M-10M).

RE AutoML: Indeed that's super interesting. However, benchmarking that is way more difficult because there is the tricky tradeoff between computation time and accuracy. I've been looking at a few solutions but nothing formally (just tried out). Btw most of them have GBMs are building blocks, so benchmarking the components can give you already some idea on performance.

Btw when you say my talk, is is the KDD one? That's probably the most up to date, though my experiments with autoML and a few other things/results happened after the talk.

PhilippPro commented 6 years ago

Ok, there are only some datasets with size above 10 K in the OpenML or PMLB benchmarking suite.

The AutoML solutions should have a time constraint parameter, so e.g. one can compare the results after 1 hour between these algorithms. Of course in reality they often miss this feature.

Yes, the KDD one, quite inspiring.