microsoft / FLAML

A fast library for AutoML and tuning. Join our Discord: https://discord.gg/Cppx2vSPVP.
https://microsoft.github.io/FLAML/
MIT License
3.92k stars 510 forks source link

Handling categorical variables on new data #1101

Open krolikowskib opened 1 year ago

krolikowskib commented 1 year ago

Hey, I have 2 questions regarding how FLAML handles categorical variables on new data, different from the initial training dataset (for example, during inference after model deployment).

  1. Does it handle new categories in categorical features (unseen during training)?
  2. SKLearn and XGBoost estimators use ordinal encodings of categorical features. But it seems the categorical codes are extracted during inference (code). Doesn't it mean that the encodings will be different when running on a different dataset, thus mixing the categories passed to the model? If so, then sklearn's OrdinalEncoder would be a better choice here (persisting correct category codes).
skzhang1 commented 1 year ago

Hi, Thanks for your feedback.

  1. Does it handle new categories in categorical features (unseen during training)? No, it doesn't handle new categories. XGBoost library assumes that category mappings are managed by the application, both in the training/testing phase and we follow the same logic in FLAML. When new categories come in the test phase, the categories-encoding map will be different from the map in the training phase if we do not do any process.
  2. Doesn't it mean that the encodings will be different when running on a different dataset, thus mixing the categories passed to the model? Yes, when running on a different dataset with mixing categories, the encoding will change. OrdinalEncoder may be a better choice here. Thanks for your suggestion.

Reference: https://stackoverflow.com/questions/75698242/when-using-categorical-data-in-xgboost-how-do-i-maintain-the-implied-encoding

krolikowskib commented 1 year ago

Thanks for your answer, @skzhang1.

I think that it may be misleading for people who want to reuse the model on a different dataset, like in a production setting. Even if the categories are the same, the current implemention doesn't guarantee they will be encoded in the same way.

Did you consider to make that more explicit in the documentation or provide a way to easily reuse the best selected model without having to worry about categorical variables?

skzhang1 commented 1 year ago

Thanks for your suggestion! We will make it clear in the doc. @krolikowskib