microsoft / FLAML

A fast library for AutoML and tuning. Join our Discord: https://discord.gg/Cppx2vSPVP.
https://microsoft.github.io/FLAML/
MIT License
3.84k stars 506 forks source link

Set splitter for cross-validation #238

Closed wuchihsu closed 2 years ago

wuchihsu commented 2 years ago

Is it possible to set a splitter or a iterable when fitting?

sonichi commented 2 years ago

Currently, the splitter is decided by the split_type: https://github.com/microsoft/FLAML/blob/a99e939404caeda88f32724cc264841f2f5dcfca/flaml/automl.py#L812-L853 Is your use case beyond one of them?

wuchihsu commented 2 years ago

Yes, my case is beyond them. I want to set a customized splitter. Just like the parameter 'cv' in sklearn.model_selection.RandomizedSearchCV.

sonichi commented 2 years ago

Not supported currently. Would you like to work on adding that feature?

wuchihsu commented 2 years ago

OK, I would give it a try.

sonichi commented 2 years ago

OK, I would give it a try.

Great. Feel free to discuss on gitter if you have questions.

sonichi commented 2 years ago

@wuchihsu what's the status of this feature? #320 requests the same feature.

wuchihsu commented 2 years ago

@wuchihsu what's the status of this feature? #320 requests the same feature.

Sorry, I have not started yet. I can start this month.

coffepowered commented 2 years ago

Hi, thank you for your answer. I've tried to take a look myself at the code, but it seems like that such a modification would require refactors at multiple levels. If you can provide a high-level overview, I might be of some help as well.

slhuang commented 2 years ago

Thanks! We would love to add support for customized splitters. We envision the API for a customized splitter to be like a class with an implemented function called split(X_train), which will be called here https://github.com/microsoft/FLAML/blob/dd60dbc5eb7e6d298c5f1506a8e0b7a8bb390052/flaml/ml.py#L323. Do you think this API would fulfill your need or in your split() function, you will need information more than X_train, e.g., y_train or group information? It would be even better if you could share your test code/data with us.

Do you want to work on adding this feature? At a high-level, there are a few places that need to get changed:

  1. [API changing]: a customized splitter class can be passed via split_type in init() https://github.com/microsoft/FLAML/blob/3111084c0766e8dd77b8ca8d36eea358d939adc2/flaml/automl.py#L414 or fit() https://github.com/microsoft/FLAML/blob/3111084c0766e8dd77b8ca8d36eea358d939adc2/flaml/automl.py#L1628
  2. [pre-processing]: search for "_split_type" and see how a customized splitter would fit. E.g., https://github.com/microsoft/FLAML/blob/3111084c0766e8dd77b8ca8d36eea358d939adc2/flaml/automl.py#L958 https://github.com/microsoft/FLAML/blob/3111084c0766e8dd77b8ca8d36eea358d939adc2/flaml/automl.py#L1186 https://github.com/microsoft/FLAML/blob/3111084c0766e8dd77b8ca8d36eea358d939adc2/flaml/automl.py#L1402
  3. [Split() call]: the calling of split() in a customized splitter needs to get handled properly in ml.py. E.g., https://github.com/microsoft/FLAML/blob/dd60dbc5eb7e6d298c5f1506a8e0b7a8bb390052/flaml/ml.py#L323 https://github.com/microsoft/FLAML/blob/dd60dbc5eb7e6d298c5f1506a8e0b7a8bb390052/flaml/ml.py#L319

Please let me know your thoughts. thanks.

wuchihsu commented 2 years ago

I have made a work version -> wuchihsu/FLAML@a89f8837835c64f9be3cc6c28b6b0d6209881d9e. But I use a new param called custom_split, because I haven't seen @slhuang reply when I began to work. The custom_split can be Iterable or a class with split and get_n_splits method. Any suggestions? I can also change it to only use the param split_type.

slhuang commented 2 years ago

@wuchihsu Can you use the param split_type and make a PR? We can then review your code. thanks!

wuchihsu commented 2 years ago

@wuchihsu Can you use the param split_type and make a PR? We can then review your code. thanks!

I made a PR #333