AutoML() doesn't seem to use Ray's object store (for large datasets)

microsoft / FLAML

A fast library for AutoML and tuning. Join our Discord: https://discord.gg/Cppx2vSPVP.

https://microsoft.github.io/FLAML/

MIT License

3.93k stars 513 forks source link

AutoML() doesn't seem to use Ray's object store (for large datasets) #365

Open ottobricks opened 2 years ago

ottobricks commented 2 years ago

Hi there,

I have a large dataset (+100GB) that I have been trying to make FLAML (AutoML) work with, no success so far. Since FLAML uses Ray, shouldn't it take advantage of Ray's object store (object spill to disk)? If not, any suggestions on how we should go about out-of-memory compute with FLAML?

ottobricks commented 2 years ago

When I try to pass a Ray objectRef to AutoML's fit, I get an error that either a Numpy array, Pandas DataFrame or Scipy sparse matrix is expected.

sonichi commented 2 years ago

@ottok92 How do you perform training currently? If you have a working training function already, you can use flaml.tune to perform hyperparameter tuning.

ottobricks commented 2 years ago

Thank you for the quick reply, @sonichi. Currently I do everything in Spark with Scala. I'm interested in using FLAML both because of the impressive CFO algorithm and also to make it easier for my colleagues to collaborate (everybody knows Python). I'll go through the doc you suggested to see if this will enable us to run FLAML on our large datasets. Thanks for the support!

ottobricks commented 2 years ago

It looks very promising to integrate with Ray's object store. Thank you for the suggestion. I will run some experiments and post feedback in this thread for future reference.

sonichi commented 2 years ago

@ottok92 That's great. I'm very interested in how it works with your use case. Another question is what learner do you use, for example, lightgbm? flaml has built-in search space for the built-in learners, which might be useful. For example, here is an example of tuning lgbm: https://github.com/microsoft/FLAML/blob/main/test/tune_example.py To make it work for your dataset, you can modify the train_lgbm function, metric, mode, time_budget_s, and set use_ray=True if you would like to do parallel tuning.

ottobricks commented 2 years ago

Perfect! I'm working with XGBoost, which is also built-in. Once I finish playing with this, I will share my train_xgboost function. Maybe we can create a section in the Docs for "handling large datasets with Ray".

sonichi commented 2 years ago

Perfect! I'm working with XGBoost, which is also built-in. Once I finish playing with this, I will share my train_xgboost function. Maybe we can create a section in the Docs for "handling large datasets with Ray".

That'll be super cool. Looking forward to it.

sonichi commented 2 years ago

BTW, flaml provides two search spaces for XGBoost. XGBoostSklearnEstimator tunes "max_leaves", and XGBoostLimitDepthEstimator tunes "max_depth".