jpmml / jpmml-lightgbm

Java library and command-line application for converting LightGBM models to PMML
GNU Affero General Public License v3.0
174 stars 58 forks source link

Incomplete `pandas_categorical` section? #62

Closed vruusmann closed 9 months ago

vruusmann commented 9 months ago

See https://github.com/jpmml/jpmml-lightgbm/issues/22#issuecomment-1824377083

In the sample LightGBM model file, the pandas_categorical section contains a two-element list (a list of two lists). This kind of suggests that the model is dealing with two categorical features.

However, during parsing, the LigthGBM model file appears to contain an instruction "use the third categorical feature". This instruction is in conflict with the pandas_categorical section in the same file.

What could be happening? Perhaps the LightGBM model file only "materializes" complex categorical features? If a categorical feature is not present in the pandas_categorical section, then perhaps it is a simple/primitive categorical feature (ie. an integer, whose value range is 0, 1, .., n - 1)?

Should check with LightGBM-SkLearn library that how exactly the pandas_categorical section is formed. Perhaps there were breaking changes between v2/v3 and v4.

vruusmann commented 9 months ago

This issue happens, when the JPMML-LightGBM library attempts to reconstruct the model schema based on the LightGBM model file.

This issue will not happen if the schema information is specified explicitly. For example, when the LightGBM model is trained using a Scikit-Learn pipeline approach.

jean-jm commented 9 months ago

This issue happens, when the JPMML-LightGBM library attempts to reconstruct the model schema based on the LightGBM model file.

This issue will not happen if the schema information is specified explicitly. For example, when the LightGBM model is trained using a Scikit-Learn pipeline approach.

i trained my model use lightgbm api and did not use the sklearn api,so I do not know the situation about sklearn.

jean-jm commented 9 months ago

See #22 (comment)

In the sample LightGBM model file, the pandas_categorical section contains a two-element list (a list of two lists). This kind of suggests that the model is dealing with to categorical features.

However, during parsing, the LigthGBM model file appears to contain an instruction "use the third categorical feature". This instruction is in conflict with the pandas_categorical section in the same file.

What could be happening? Perhaps the LightGBM model file only "materializes" complex categorical features? If a categorical feature is not present in the pandas_categorical section, then perhaps it is a simple/primitive categorical feature (ie. an integer, whose value range is 0, 1, .., n - 1)?

Should check with LightGBM-SkLearn library that how exactly the pandas_categorical section is formed. Perhaps there were breaking changes between v2/v3 and v4.

your guess maybe right! in my model, there are more than 3 categorical features, but only two feature, i changed their types into "pandas category", because others were int type. So, if i changed all of categorical features into "pandas category", this fail won't happend, i guess.

i solved this issue by encoding these "pandas category" into int type(hash)

vruusmann commented 9 months ago

i solved this issue by encoding these "pandas category" into int type(hash)

That's a great feedback!

It would be prudent to always encode all categorical columns using the pandas.CategoricalDtype data type. I believe that this happens semi-automatically in "Scikit-Learn pipeline" approach, but needs to be remembered and done manually in "direct LightGBM" approach.

Anyway, the JPMML-LightGBM converter should contain a handler for such a scenario, and do the following:

  1. Issue a warning - "The pandas_categorical section might be incomplete, please make sure that all categorical columns have been cast into the pandas.CategoricalDtype data type".
  2. Proceed with a conversion, by creating a virtual-synthetic categorical feature (of int data type).

Will close this issue with an appropriate code change(s) when done.

jean-jm commented 9 months ago

thank you so much!

jean-jm commented 9 months ago

您好,邮件已收到,谢谢啦~

vruusmann commented 9 months ago

This IllegalArgumentException indicates a conflict between the header and pandas_categorical sections of a LightGBM file. The conflict could be characterized as "the pandas_categorical section appears incomplete". It typically cannot be corrected on-the-fly by the JPMML-LightGBM library.

The conflict is characteristic only to LightGBM models that were trained using the Learning API. It indicates a combined data preparation/model configuration error, where some of the columns that were marked as categorical using the categorical_feature attribute use non-pandas.CategoricalDtype data type.

In the following example, the categorical_feature suggests that the first three columns should be interpreted as categorical. However, two of them carry the int data type (rather than the category data type), which means that they do not get included into the pandas_categorical section (contains one entry instead of three entries).

import lightgbm
import pandas

df = pandas.read_csv("Auto.csv")

X = df[["origin", "model_year", "cylinders"]]
y = df["mpg"]

# Cast two columns to integer, and only one to Pandas' categorical
X["origin"] = X["origin"].astype(int)
X["model_year"] = X["model_year"].astype("category")
X["cylinders"] = X["cylinders"].astype(int)

# Declare all three columns as categorical
ds = lightgbm.Dataset(X, label = y, categorical_feature = [0, 1, 2])

params = {
    "objective" : "regression",
    "max_depth" : 3
}

booster = lightgbm.train(params, train_set = ds, num_boost_round = 11)

booster.save_model("Auto.txt")