microsoft / FLAML

A fast library for AutoML and tuning. Join our Discord: https://discord.gg/Cppx2vSPVP.
https://microsoft.github.io/FLAML/
MIT License
3.91k stars 508 forks source link

Hyperparameter search space for Catboost? #144

Closed stepthom closed 3 years ago

stepthom commented 3 years ago

The search space for Catboost is rather limited; it only includes early_stopping_rounds and learning_rate:

https://github.com/microsoft/FLAML/blob/072e9e458819324f9f9436c3febeb034e80e6f4f/flaml/model.py#L620-L633

Is there a reason why other hyperparameters are not searched? I was thinking it might be interesting to include:

https://catboost.ai/docs/concepts/python-reference_parameters-list.html

sonichi commented 3 years ago

We have not tried those. Would you like to explore a different search space?

stepthom commented 3 years ago

A different search space might yield better results for my current project, yes. I have noticed that the best loss for catboost is always worse than xgboost and lgbm. I was wondering if there was a particular reason that catboost's search space is smaller, but it sounds like there is not. So I will experiment with a different/larger search space, and if I learn anything interesting, I will report back here FYI.

AlgoAIBoss commented 3 years ago

If you are interested in checking Catboost hyperparameters here are most of them.

`params = {"iterations": 100, # Default 1000 if decreased learning_rate should be increased. While tuning iteration:hight, learning_rate:low "learning_rate":0.03, "depth": 2, # Up to 16 for loss functions except ranking for which up to 8 "loss_function": "CrossEntropy", # "LogLoss", LoglossObjective() "eval_metric":'Accuracy', "custom_loss":['Accuracy'] "verbose": False, #Output shows //"Silent":True, "logging_level":'Silent' "od_type":'Iter', # Early Stopping "od_wait":40, # Early Stopping
'use_best_model': True, # By default True "random_seed":42, "one_hot_max_size":30 # default 3 more categorical features are calculated statistically. but expensive "early_stopping_rounds":20, # overfitting detector
"bagging_temperature": 1, # assigns weights to values iff "bootstrap_type":"Bayesian" "bootstrap_type":"Bayesian", # Bernoulli "nan_mode":'Min', # Min default, Max, Forbidden-does not handle missing values "task_type": 'GPU', "max_ctr_complexity":5, # Feature combination default 3 disable it by 1 MAX=Cat_feature size "boosting_type":"Ordered", # by default "Ordered", "Plain" "rsm":0.1, # speeds up training and not affect quality. Use only for 100s of features "border_count":32, # default 128 if GPU used set 32 to speed up training not affecting quality "leaf_estimation_method":'Newton',
"l2_leaf_reg":3, 'auto_class_weights':'Balanced', # for imbalanced data 'has_time': True, # Time-Series determines datetime and splits accordingly 'combinations_ctr' : ['FloatTargetMeanValue', 'FeatureFreq', 'BinarizedTargetMeanValue', 'Borders', 'Buckets', 'Borders:TargetBorderCount=4', 'Counter:CtrBorderCount=40:Prior=0.5/1'], # Feature engineering supported by GPU 'simple_ctr':['FloatTargetMeanValue', 'FeatureFreq', 'BinarizedTargetMeanValue', 'Borders', 'Buckets', 'Borders:TargetBorderCount=4', 'Counter:CtrBorderCount=40:Prior=0.5/1'],

      }      `

But I would not recommend to tune all of them. Because according to my experience tuning all the parameters will not yield good results. But I would recommend going to this catboost's official tutorial link where they give more information about feature generating hyperparameters that improve the accuracy. This was my first contrabution to the open source comunity. I hope you found it helpful.