microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.7k stars 3.83k forks source link

Categorical feature functionality not working as expected for models trained on LightGBM DataSets stored to binary. #5902

Open damiandraxler opened 1 year ago

damiandraxler commented 1 year ago

Description

When a model is trained using a saved LightGBM Dataset (from binary, saved via lgb.save_binary), the label encoding mapping seems to be lost and not passed to the booster object. As a result, when scoring data, the booster object performs label encoding on the fly based on the alphabetical order of distinct categories in the batch of data. This can lead to arbitrary predictions, especially when only a few categories are present in the batch. Changing the metadata information of the Pandas category can indeed cause the same data point to receive different scores. This issue doesn't occur when the model is trained while the LightGBM Dataset is still in memory.

The documentation doesn't mention that LightGBM's internal handling of categorical features is incompatible with the save_binary function. It would be beneficial if the categorical feature functionality also worked on loaded Datasets. Currently, the internal categorical feature functionality breaks completely when using Datasets loaded from binary.

## Reproducible example
import lightgbm as lgb
import pandas as pd
import numpy as np
import pickle
from scipy.stats import bernoulli

print(lgb.__version__)

# -----------------------------  Step 1 ------------------------------
# Dummy training and validation dataset - one categorical one integer feature:
np.random.seed(42)
num_samples = 1000

## Training data:
train_categories = ['A', 'B', 'C', 'D']

feature1 = np.random.randint(1, 3, size=num_samples)
feature2 = np.random.choice(train_categories, size=num_samples)
noise = np.random.normal(loc=0, scale=2, size=num_samples)

pvec = 1/(1 + np.exp(-(feature1 * 0.5 + [int(x) for x in ((feature2 == 'A') | (feature2 == 'B')) + noise > 1])))
target_values = np.array([bernoulli.rvs(pvec[k],size=1)[0] for k in range(num_samples)])

train_data = {
    'feature1': feature1,
    'feature2': feature2,
    'target': target_values
}

## Validation data - the validation categories are a subset of those in the training data:
val_categories = ['B', 'D']
feature1 = np.random.randint(1, 3, size=num_samples)
feature2 = np.random.choice(val_categories, size=num_samples)
noise = np.random.normal(loc=0, scale=2, size=num_samples)

pvec = 1/(1 + np.exp(-(feature1 * 0.5 + [int(x) for x in ((feature2 == 'A') | (feature2 == 'B')) + noise > 1])))
target_values = np.array([bernoulli.rvs(pvec[k],size=1)[0] for k in range(num_samples)])

valid_data = {
    'feature1': feature1,
    'feature2': feature2,
    'target': target_values
}

train_data = pd.DataFrame(train_data)
valid_data = pd.DataFrame(valid_data)
print('training dataset:')
print(train_data.head())
print('validation dataset:')
print(valid_data.head())
# -----------------------------  Step 2 ------------------------------
# Create LightGBM Dataset by converting feature2 to type pandas categorical and store it to binary:
categorical_features = ['feature2']
train_data[categorical_features] = train_data[categorical_features].astype('category')

train_lgb = lgb.Dataset(train_data.drop(columns=["target"]), label=train_data["target"], categorical_feature=categorical_features)
train_lgb.save_binary('train_lgb.bin') 
del train_lgb
# -----------------------------  Step 3 ------------------------------
# Train a LightGBM model on the loaded dataset:
train_lgb = lgb.Dataset('train_lgb.bin')
params = {
    'objective': 'binary',
    'metric': 'binary_logloss',
    "deterministic":True,
    'verbose': -1
}
model = lgb.train(params, train_lgb)
# -----------------------------  Step 4 ------------------------------
# Make predictions on training and validation datasets with equal entries:
data_point_query = "(feature1==1) & (feature2=='D')"

valid_data[categorical_features] = valid_data[categorical_features].astype('category')
print('\n')
print('Distinct categories in training data set: ',train_data.feature2.cat.categories)
print('Distinct categories in validation data set: ',valid_data.feature2.cat.categories)
print('\n')
print("Train instance prediction: {}".format(model.predict(train_data.query(data_point_query).drop("target", axis=1))[0]))
print("Valid instance prediction: {}".format(model.predict(valid_data.query(data_point_query).drop("target", axis=1))[0]))
print('\n')
print('The predicitons are not the same even though the input data is exactly the same! The only difference is the Pandas metadata information of distinct categories.')
print('\n')
print('This observation is made even clearer by adding an arbitrary category to Pandas categorical:')
valid_data.feature2 = pd.Categorical(valid_data.feature2, categories=['AA', 'B', 'D'])
print('\n')
print('Distinct categories in validation data set: ',valid_data.feature2.cat.categories)
print('\n')
print("Valid instance prediction: {}".format(model.predict(valid_data.query(data_point_query).drop("target", axis=1))[0]))
print('Indeed, we get a third different prediction on the same data!')
print('\n')
# -----------------------------  Step 5 ------------------------------
# We now repeat step 2-4 but instead of storing the LightGBM data set we train the model while the LightGBM dataset still in memory:
del model, train_lgb
# -----------------------------  Step 6 ------------------------------
# Create LightGBM Dataset by converting feature2 to type pandas categorical:
categorical_features = ['feature2']
train_data[categorical_features] = train_data[categorical_features].astype('category')

train_lgb = lgb.Dataset(train_data.drop(columns=["target"]), label=train_data["target"], categorical_feature=categorical_features)
# -----------------------------  Step 7 ------------------------------
# Train a LightGBM model:
params = {
    'objective': 'binary',
    'metric': 'binary_logloss',
    "deterministic":True,
    'verbose': -1
}
model = lgb.train(params, train_lgb)
# -----------------------------  Step 8 ------------------------------
# Make predictions on training and validation datasets with equal entries:
data_point_query = "(feature1==1) & (feature2=='D')"

valid_data[categorical_features] = valid_data[categorical_features].astype('category')
valid_data.feature2 = pd.Categorical(valid_data.feature2, categories=['B', 'D'])

print('Distinct categories in training data set: ',train_data.feature2.cat.categories)
print('Distinct categories in validation data set: ',valid_data.feature2.cat.categories)
print('\n')
print("Train instance prediction: {}".format(model.predict(train_data.query(data_point_query).drop("target", axis=1))[0]))
print("Valid instance prediction: {}".format(model.predict(valid_data.query(data_point_query).drop("target", axis=1))[0]))
print('\n')
print('The predicitons are now exactly the same even though the list of distinct categories is still different.')
print('\n')
print('Let us add again an arbitrary category to Pandas categorical:')
valid_data.feature2 = pd.Categorical(valid_data.feature2, categories=['AA', 'B', 'D'])
print('\n')
print('Distinct categories in validation data set: ',valid_data.feature2.cat.categories)
print('\n')
print("Valid instance prediction: {}".format(model.predict(valid_data.query(data_point_query).drop("target", axis=1))[0]))
print('We now get the exact same prediction as it should be! The only difference was indeed whether we trained on a loaded LightGBM dataset or we trained on one still in memory.')

Environment info

LightGBM version or commit hash: 3.3.5

Command(s) you used to install LightGBM

pip install lightgbm==3.3.5
pandas 1.3.5
numpy 1.21.2

Additional Comments

The same issue was observed on older LightGBM versions (3.2.1, 3.3.2).

damiandraxler commented 1 year ago

@jameslamb Hi, is the description clear enough or shall I provide more information? I do think that this is quite a sever and hidden issue/bug. In fact, as soon as someone trained a model on a loaded LightGBM dataset with categorical columns the model is very likely in deep troubles without the user even noticing it (no warnings or errros). Of course I could also be missing something, any feedback would be very much appreciated. :-) thx

jameslamb commented 1 year ago

Thanks for using LightGBM and reporting this. I've reformatted your question a bit to make it easier to understand... please see https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax for some information about formatting text on GitHub.

Someone here will get back to you when we have time. Otherwise, the fastest way to resolve this issue is probably to investigate it further yourself.

damiandraxler commented 1 year ago

Thanks for formatting the text. :-)

I have tried to come up with a solution but unfortunately the parts where the categorical features are actually encoded seem to happen in the C++ backend which was beyond my scope.

With some hints I could give it another try though. Could you perhaps highlight where exactly in the C++ code the label encoding is happening both in the LightGBM Dataset and the train (or the predict function) call?

jmoralez commented 1 year ago

Hey @damiandraxler. The encoding is done in the wrappers. For python it uses the codes of the categorical features and stores them in the pandas_categorical attribute. However these encodings are lost when saving to binary (it stores the encoded features), which is why you see a difference.

For example in your train the mapping is A -> 0, B -> 1, C -> 2, D -> 3, but when you load it back from disk and use new categories the mapping becomes B -> 0, D -> 1, similarly for the AA, B, D categories.

You can get your example to yield the same results by storing and restoring the mappings like so:

train_lgb = lgb.Dataset(
    train_data.drop(columns=["target"]),
    label=train_data["target"],
    categorical_feature=categorical_features,
)
train_lgb.save_binary('train_lgb.bin')
mappings = train_lgb.pandas_categorical  # store mapping
print(mappings) # [['A', 'B', 'C', 'D']]
del train_lgb
train_lgb = lgb.Dataset('train_lgb.bin')
train_lgb.pandas_categorical = mappings
model = lgb.train(params, train_lgb)

which I agree is far from optimal. I'm not sure if categorical features were originally considered when saving the dataset but we'd need to store these mappings in the file to be able to restore them on load.

damiandraxler commented 1 year ago

Thanks a lot @jmoralez, that's really reassuring and setting the mappings as you suggested indeed works.

I also think that it's not ideal though as in fact it's then almost equivalent to doing the encoding myself in the first place and only provide integer columns instead of category columns to lgb.Dataset (in both cases I have to track/store the mapping).

Any idea how difficult it would be to modify the save_binary function to store both the mappings and the Dataset together?

If that's not easily doable then we should at least mention this in the documentation of the save_binary function (or perhaps already directly in the categorical feature description of the lgb.Dataset docs).