predict() requires DataFrame to have category dtype, but should be able to infer which fields are categorical

johnpaulett commented 2 years ago

Description

A DataFrame containing several categorical columns is used for train(). If a DataFrame with the same column names (but none as type category) is used for predict(), a ValueError occurs:

ValueError: train and valid dataset categorical_feature do not match.

However, this categorical information is persisted with the saved model as pandas_categorical and seems like it should be inferable at time of prediction

I understand one "solution" would be to convert the predict DataFrame's columns to categories, but this is not feasible when using a tool like kserve, where the model is loaded from a saved .bst file and the input is dynamically converted from JSON in a HTTP POST.

Thus for kserve, which receives a JSON input, it will convert that input into a category-less DataFrame. But it would seem that set of categorical features could be inferred from the saved model.

I'd be open to writing a PR, but hoping to get any guidance/thoughts. It would seem like _get_data_from_pandas could use pandas_categorical and booster's feature_names to determine if data contains features that should be converted to category dtype.

Reproducible example

import lightgbm as lgb
import pandas as pd

train = pd.DataFrame([
    {"age": 28, "city": "london", "label": 1.0},
    {"age": 29, "city": "london", "label": 2.0},
    {"age": 30, "city": "london", "label": 3.0},
    {"age": 31, "city": "london", "label": 4.0},
    {"age": 32, "city": "london", "label": 5.0},
    {"age": 33, "city": "london", "label": 6.0},
    {"age": 28, "city": "nyc", "label": 11.0},
    {"age": 29, "city": "nyc", "label": 12.0},
    {"age": 30, "city": "nyc", "label": 13.0},
    {"age": 31, "city": "nyc", "label": 14.0},
    {"age": 32, "city": "nyc", "label": 15.0},
    {"age": 33, "city": "nyc", "label": 16.0},
])
train = pd.concat([train for _ in range(0, 100)]).reset_index(drop=True)

train = train.astype({
    "city": "category"
})
print(train.dtypes)

bst = lgb.train({}, lgb.Dataset(train[["age", "city"]], train["label"]))

age         int64
city     category
label     float64
dtype: object

Success

Successful predict() when forcing data to category

test = pd.DataFrame([
    {"age": 28, "city": "london"},
    {"age": 31, "city": "london"},
    {"age": 31, "city": "nyc"},
]).astype({
    "city": "category"
})
bst.predict(test)

# array([ 1.00019921,  4.00011953, 13.99985391])

Failure

Fails when city is not set as a category (even though the Booster.pandas_categorical effectively knows it is a categorical column. This is the situation with kserve

test = pd.DataFrame([
    {"age": 28, "city": "london"},
    {"age": 31, "city": "london"},
    {"age": 31, "city": "nyc"},
])
bst.predict(test)

File .venv/lib/python3.8/site-packages/lightgbm/basic.py:575, in _data_from_pandas(data, feature_name, categorical_feature, pandas_categorical)
    573 else:
    574     if len(cat_cols) != len(pandas_categorical):
--> 575         raise ValueError('train and valid dataset categorical_feature do not match.')
    576     for col, category in zip(cat_cols, pandas_categorical):
    577         if list(data[col].cat.categories) != list(category):

ValueError: train and valid dataset categorical_feature do not match.

Environment info

LightGBM version: 3.3.2 (python)

Command(s) you used to install LightGBM

poetry add lightgbm

Additional Comments

jmoralez commented 2 years ago

Hi @johnpaulett, thanks for raising this. I'm not sure if LightGBM is the right place to do that conversion. Wouldn't it be possible to do the change on kserve to take more information in the request? It could:

Take the categorical features in the request and do the conversion there.
Take an option on whether the input is a pandas dataframe or a numpy array. That way you could do the conversion to the categorical codes yourself and send an array to predict. This would also allow having sparse arrays and other structures. Having only the option of reading a dict as a dataframe seems very unflexible.

Please feel free to discuss this further here.

johnpaulett commented 2 years ago

@jmoralez Thanks for the feedback!

I do think LightGBM is the right place. The fact that a feature is categorical is an implementation detail that I, as an ML Engineer, have made about my model. Ignoring kserve, as LightGBM exists right now if I hand off the saved model to my colleague to use in their application, they then need to also implement the pipeline to cast specific fields to categorical. If I later decide that the field is not categorical and train a new model, then that pre-processing needs to be equivalently adjusted in their application.

Effectively the inferencing process is poorly encapsulated because my model implementation details leak out into their code -- but LightGBM has all the knowledge necessary to encapsulate this categorical decision on prediction via feature_name and the categorical_column of the params.

Furthermore, I'd argue that LightGBM via the saved pandas_categorical in the model is already partially attempting to accomplish this encapsulation: without it, I would need to coordinate with my colleagues to ensure consistent encoding of categorical values to an int32 representation. With it, they can simply send me a string in the feature and LightGBM consistently will translate it. Allowing the predict() to automatically apply this encoding to columns (whether with a category dtype or not), seems to be a logical conclusion to the use of pandas_categorical.

To address your suggestions directly:

Take the categorical features in the request and do the conversion there.

Again, the caller would need to still know implementation details about my model.

That said, I did investigate an alternative way of hosting LightGBM in kserve to the default lgbserver, using Seldon's mlserver inside kserve. They do allow passing specific data types, but do not appear to implement the ability to pass fields as a category dtype (only the underlying numpy dtypes).

An alternative workaround: kserve's lgbserver could read LightGBM's saved model file again after loading via Booster(model_file=...), parse out the feature_names=... and the [categorical_feature: ...], and use .astype('category') on any relevant fields prior to sending into predict(). This feels like kserve then needs to gain internal knowledge of the LightGBM file format (and assume it does not change significantly in the future) as well as re-reading the file that LightGBM already has looked at, but just ignored pulling the categorical_feature param out of. I'm likely going to try this out as a quick workaround as I hit a wall with https://github.com/microsoft/LightGBM/pull/5246 (the C code does not read the params back out of the model_file), but I have low expectation that kserve would merge that PR.

That way you could do the conversion to the categorical codes yourself and send an array to predict.

If I understand correctly, that would mean losing the benefit of pandas_categorical -- so my colleagues and I would then need to communicate the encoding out of band (e.g. in a separate file or database).

We have found great success using LightGBM from a research perspective, but now are looking in the cycle of working to make use of these models in a production environment, and think making LightGBM's predict() more intelligent would be a step towards much easier MLOps.

jameslamb commented 2 years ago

the C code does not read the params back out of the model_file

I haven't read the whole conversation here closely yet, but just want to add that this particular issue you called out is something that maintainers are aware of, tracked in #2613.

One contributor attempted a fix in #4802, but that PR has been stalled since January 2022 waiting on some other maintainers to respond and help move it forward: https://github.com/microsoft/LightGBM/pull/4802#discussion_r786429915.

liangfu commented 1 year ago

Just for a quick note, I'm currently using following code snippet to fetch categorical_feature from the model

#### HACK, HACK and HACK
def get_categorical_feature(model):
    import ast
    model_str_list = str(model.model_to_string()).split("\n")
    categorical_feature_str = list(filter(lambda x: 'categorical_feature' in x, model_str_list))[0]
    categorical_feature = ast.literal_eval("["+categorical_feature_str.split()[1])
    return categorical_feature

Then convert these columns to category type for prediction.

feature_name = model.feature_name()
categorical_feature = get_categorical_feature(model)
X = pd.read_csv("xx.csv").astype(dict(
    [(feature_name[c], 'category') for c in categorical_feature]
))
y_pred_ref = model.predict(X)

This is very much unreliable, but can act as a tempory solution.

hzh8311 commented 1 year ago

@liangfu when I process the train data with lgb.Dataset(categorical_feature=["xx",], ...) API, and predict with code you posted, the results differ between with and without set "xx" to category type. When convert "xx" to category, the result drop from 4.5 to 4.0 comparing to not converting.

microsoft / LightGBM