Open EmotionEngineer opened 8 months ago
I've identified the issue to be related to CatBoost receiving embedding-encoded numeric values from LightAutoML instead of direct text features. In my case, utilizing the 'text_features' directly in CatBoost yields better results compared to using embeddings or TF-IDF from LightAutoML.
I suggest enhancing the functionality of the 'text_features' parameter in CatBoost by adding an option for 'direct', allowing users to leverage CatBoost's built-in text processing functions for improved performance.
🐛 Bug
Comparing notebooks using text features, LAMA / CatBoost I get a significantly higher test RMSE using LAMA Tried everything, in LAMA leave only CatBoost, adjust CB params manually. Maybe something wrong with my LAMA implementation?
To Reproduce
CatBoost Notebook LAMA Notebook
Expected behavior
Comparable accuracy to CatBoost when using LightAutoML