sb-ai-lab / LightAutoML

Fast and customizable framework for automatic ML model creation (AutoML)
https://developers.sber.ru/portal/products/lightautoml
Apache License 2.0
1.08k stars 47 forks source link

[TabularNLPAutoML] Add the ability to pass text features directly to CatBoost #141

Open EmotionEngineer opened 8 months ago

EmotionEngineer commented 8 months ago

🐛 Bug

Comparing notebooks using text features, LAMA / CatBoost I get a significantly higher test RMSE using LAMA Tried everything, in LAMA leave only CatBoost, adjust CB params manually. Maybe something wrong with my LAMA implementation?

To Reproduce

CatBoost Notebook LAMA Notebook

Expected behavior

Comparable accuracy to CatBoost when using LightAutoML

EmotionEngineer commented 7 months ago

I've identified the issue to be related to CatBoost receiving embedding-encoded numeric values from LightAutoML instead of direct text features. In my case, utilizing the 'text_features' directly in CatBoost yields better results compared to using embeddings or TF-IDF from LightAutoML.

I suggest enhancing the functionality of the 'text_features' parameter in CatBoost by adding an option for 'direct', allowing users to leverage CatBoost's built-in text processing functions for improved performance.