h2oai / driverlessai-recipes

Recipes for Driverless AI
Apache License 2.0
235 stars 95 forks source link

Catboost generates very large artifacts and can have unstable learning #47

Open pseudotensor opened 4 years ago

pseudotensor commented 4 years ago

https://github.com/catboost/catboost/issues/1023 https://github.com/catboost/catboost/issues/1028

pseudotensor commented 3 years ago

Hello! I guess you have lot's of categorical features in your dataset (possibly with high cardinality). When we are training models, we generate CTR tables for categorical features on-the-fly as they are needed, so it's totally normal, that GPU memory usage shows practically no correlation with resulting model size - we calculate all selected CTR tables after training savinge them in the model object. To reduce model size, in 0.24 we have finally implemented model size regularization - now we are penalyzing model splits that are using large CTR tables. model_size_reg is now turned on by default both on CPU and GPU and set to 0.5. You can play with this parameter, raising it to achieve smaller model size. Also, you could reduce model size by limiting CTR complexity, setting max_ctr_complexity parameter - by default we are trying to greedily make combinations with up to 4 categorical features. You can read about this params in the new blog post on towardsdatascience and in tutorial covering categorical feature parameters