rambler-digital-solutions / criteo-1tb-benchmark

Benchmark of different ML algorithms on Criteo 1TB dataset
145 stars 20 forks source link

run this benchmark with autosklearn (zeroconf) #2

Open Motorrat opened 7 years ago

Motorrat commented 7 years ago

https://github.com/paypal/autosklearn-zeroconf

d-nosov commented 7 years ago

Hi, @Motorrat,

Thank you for your interest in our benchmark!

Are you going to perform the autosklearn test yourself, or should we try to do it?

Motorrat commented 7 years ago

I'd like to try it myself eventually but it may take a while. As you already have the dataset and the environment it may be much easier for you to kick that off.

d-nosov commented 7 years ago

Okay, I'll give it a try. Stay tuned!

d-nosov commented 7 years ago

After waiting (with no result) for almost the whole working day for AutoSklearnClassifier to complete its training on the smallest possible dataset (10000 lines) with a very limited number of features (20 hashes after hashing trick, which is not even close to the numbers used in already tested algorithms - e.g. I used 100000 in Spark.ML LogisticRegression, and it seems not enough), I tend to think that this benchmark is not very practical case to apply AutoSklearn to: if it takes serveral hours to train it even on such small data, then it won't scale to millions of lines of train data well. I admit that I may be doing something wrong, so it would be great if you also give it a try - I'm not closing the issue, but I cease working on it.

Motorrat commented 7 years ago

Thanks for trying out! Zeroconf is designed to work for at most ~24 hours looking for the best prediction pipeline. Could just let it run overnight? Besides, could you share the code you have written to prepare the dataset for your experiment here? It would help to quickly set up the experiment in my environment as you suggest.