[TabularNLPAutoML] Add the ability to pass text features directly to CatBoost

sb-ai-lab / LightAutoML

Fast and customizable framework for automatic ML model creation (AutoML)

Apache License 2.0

1.08k stars 47 forks source link

🐛 Bug

Comparing notebooks using text features, LAMA / CatBoost I get a significantly higher test RMSE using LAMA Tried everything, in LAMA leave only CatBoost, adjust CB params manually. Maybe something wrong with my LAMA implementation?

To Reproduce

Expected behavior

Comparable accuracy to CatBoost when using LightAutoML

I've identified the issue to be related to CatBoost receiving embedding-encoded numeric values from LightAutoML instead of direct text features. In my case, utilizing the 'text_features' directly in CatBoost yields better results compared to using embeddings or TF-IDF from LightAutoML.

I suggest enhancing the functionality of the 'text_features' parameter in CatBoost by adding an option for 'direct', allowing users to leverage CatBoost's built-in text processing functions for improved performance.

sb-ai-lab / LightAutoML