Open damiandraxler opened 1 year ago
@jameslamb Hi, is the description clear enough or shall I provide more information? I do think that this is quite a sever and hidden issue/bug. In fact, as soon as someone trained a model on a loaded LightGBM dataset with categorical columns the model is very likely in deep troubles without the user even noticing it (no warnings or errros). Of course I could also be missing something, any feedback would be very much appreciated. :-) thx
Thanks for using LightGBM and reporting this. I've reformatted your question a bit to make it easier to understand... please see https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax for some information about formatting text on GitHub.
Someone here will get back to you when we have time. Otherwise, the fastest way to resolve this issue is probably to investigate it further yourself.
Thanks for formatting the text. :-)
I have tried to come up with a solution but unfortunately the parts where the categorical features are actually encoded seem to happen in the C++ backend which was beyond my scope.
With some hints I could give it another try though. Could you perhaps highlight where exactly in the C++ code the label encoding is happening both in the LightGBM Dataset and the train (or the predict function) call?
Hey @damiandraxler. The encoding is done in the wrappers. For python it uses the codes of the categorical features and stores them in the pandas_categorical
attribute. However these encodings are lost when saving to binary (it stores the encoded features), which is why you see a difference.
For example in your train the mapping is A -> 0, B -> 1, C -> 2, D -> 3
, but when you load it back from disk and use new categories the mapping becomes B -> 0, D -> 1
, similarly for the AA, B, D
categories.
You can get your example to yield the same results by storing and restoring the mappings like so:
train_lgb = lgb.Dataset(
train_data.drop(columns=["target"]),
label=train_data["target"],
categorical_feature=categorical_features,
)
train_lgb.save_binary('train_lgb.bin')
mappings = train_lgb.pandas_categorical # store mapping
print(mappings) # [['A', 'B', 'C', 'D']]
del train_lgb
train_lgb = lgb.Dataset('train_lgb.bin')
train_lgb.pandas_categorical = mappings
model = lgb.train(params, train_lgb)
which I agree is far from optimal. I'm not sure if categorical features were originally considered when saving the dataset but we'd need to store these mappings in the file to be able to restore them on load.
Thanks a lot @jmoralez, that's really reassuring and setting the mappings as you suggested indeed works.
I also think that it's not ideal though as in fact it's then almost equivalent to doing the encoding myself in the first place and only provide integer columns instead of category columns to lgb.Dataset (in both cases I have to track/store the mapping).
Any idea how difficult it would be to modify the save_binary function to store both the mappings and the Dataset together?
If that's not easily doable then we should at least mention this in the documentation of the save_binary function (or perhaps already directly in the categorical feature description of the lgb.Dataset docs).
Description
When a model is trained using a saved LightGBM Dataset (from binary, saved via lgb.save_binary), the label encoding mapping seems to be lost and not passed to the booster object. As a result, when scoring data, the booster object performs label encoding on the fly based on the alphabetical order of distinct categories in the batch of data. This can lead to arbitrary predictions, especially when only a few categories are present in the batch. Changing the metadata information of the Pandas category can indeed cause the same data point to receive different scores. This issue doesn't occur when the model is trained while the LightGBM Dataset is still in memory.
The documentation doesn't mention that LightGBM's internal handling of categorical features is incompatible with the save_binary function. It would be beneficial if the categorical feature functionality also worked on loaded Datasets. Currently, the internal categorical feature functionality breaks completely when using Datasets loaded from binary.
Environment info
LightGBM version or commit hash: 3.3.5
Command(s) you used to install LightGBM
Additional Comments
The same issue was observed on older LightGBM versions (3.2.1, 3.3.2).