microsoft / FLAML

A fast library for AutoML and tuning. Join our Discord: https://discord.gg/Cppx2vSPVP.
https://microsoft.github.io/FLAML/
MIT License
3.75k stars 495 forks source link

Cannot reproduce Flaml predictions using SkLearn RF #1287

Open zuoxu3310 opened 3 months ago

zuoxu3310 commented 3 months ago

Discussed in https://github.com/microsoft/FLAML/discussions/1054

Originally posted by **Therrm** May 26, 2023 Hi there! After running Flaml on RF only, I get the following best parameters: `best_hyperparams={"subsample": 1.0, "num_leaves": 256, "n_estimators": 300, "min_split_gain": 0.0, "min_child_samples": 30, "max_depth": -1, "learning_rate": 0.01, "colsample_bytree": 1}` But when I try to reproduce those predictions with the same parameters using sklearn rf , I get quite different results. For instance, I get only 3 to 4 different predictions while those from Flaml were close to a random distribution. What else Flaml does that the RF doesn't? Is there some additional post-processing done by Flaml? Note: I already pre-process my data by removing rows with empty data and normalizing the dataset (for both for Flaml and RF). Thanks

I have the same issue. I use sklearn pipeline with flaml and then reproduce with sklearn pipeline. The results are totally different. Not only rf, but also for k neighbor (without random seed effect). automl_pipeline = Pipeline([ ("standardizer", standardizer), ("automl", automl) ]) automl_settings = { "time_budget": 240, "estimator_list": ['kneighbor'], #rf "eval_method": 'cv', "split_type": 'stratified', "n_splits": 5, "metric": 'accuracy', "task": 'classification', "log_file_name": "data.log", "seed": 42, "verbose":5 }

thinkall commented 3 months ago

Hi @zuoxu3310 , have you tried https://github.com/microsoft/FLAML/discussions/1054#discussioncomment-6016340? If it doesn't work, you can set skip_transform to True in the automl_settings and try again. It should be reproducible.