Closed vruusmann closed 10 months ago
This issue happens, when the JPMML-LightGBM library attempts to reconstruct the model schema based on the LightGBM model file.
This issue will not happen if the schema information is specified explicitly. For example, when the LightGBM model is trained using a Scikit-Learn pipeline approach.
This issue happens, when the JPMML-LightGBM library attempts to reconstruct the model schema based on the LightGBM model file.
This issue will not happen if the schema information is specified explicitly. For example, when the LightGBM model is trained using a Scikit-Learn pipeline approach.
i trained my model use lightgbm api and did not use the sklearn api,so I do not know the situation about sklearn.
See #22 (comment)
In the sample LightGBM model file, the
pandas_categorical
section contains a two-element list (a list of two lists). This kind of suggests that the model is dealing with to categorical features.However, during parsing, the LigthGBM model file appears to contain an instruction "use the third categorical feature". This instruction is in conflict with the
pandas_categorical
section in the same file.What could be happening? Perhaps the LightGBM model file only "materializes" complex categorical features? If a categorical feature is not present in the
pandas_categorical
section, then perhaps it is a simple/primitive categorical feature (ie. an integer, whose value range is0, 1, .., n - 1
)?Should check with LightGBM-SkLearn library that how exactly the
pandas_categorical
section is formed. Perhaps there were breaking changes between v2/v3 and v4.
your guess maybe right! in my model, there are more than 3 categorical features, but only two feature, i changed their types into "pandas category", because others were int type. So, if i changed all of categorical features into "pandas category", this fail won't happend, i guess.
i solved this issue by encoding these "pandas category" into int type(hash)
i solved this issue by encoding these "pandas category" into int type(hash)
That's a great feedback!
It would be prudent to always encode all categorical columns using the pandas.CategoricalDtype
data type. I believe that this happens semi-automatically in "Scikit-Learn pipeline" approach, but needs to be remembered and done manually in "direct LightGBM" approach.
Anyway, the JPMML-LightGBM converter should contain a handler for such a scenario, and do the following:
pandas_categorical
section might be incomplete, please make sure that all categorical columns have been cast into the pandas.CategoricalDtype
data type".int
data type).Will close this issue with an appropriate code change(s) when done.
thank you so much!
您好,邮件已收到,谢谢啦~
This IllegalArgumentException
indicates a conflict between the header and pandas_categorical
sections of a LightGBM file. The conflict could be characterized as "the pandas_categorical
section appears incomplete". It typically cannot be corrected on-the-fly by the JPMML-LightGBM library.
The conflict is characteristic only to LightGBM models that were trained using the Learning API. It indicates a combined data preparation/model configuration error, where some of the columns that were marked as categorical using the categorical_feature
attribute use non-pandas.CategoricalDtype
data type.
In the following example, the categorical_feature
suggests that the first three columns should be interpreted as categorical. However, two of them carry the int
data type (rather than the category
data type), which means that they do not get included into the pandas_categorical
section (contains one entry instead of three entries).
import lightgbm
import pandas
df = pandas.read_csv("Auto.csv")
X = df[["origin", "model_year", "cylinders"]]
y = df["mpg"]
# Cast two columns to integer, and only one to Pandas' categorical
X["origin"] = X["origin"].astype(int)
X["model_year"] = X["model_year"].astype("category")
X["cylinders"] = X["cylinders"].astype(int)
# Declare all three columns as categorical
ds = lightgbm.Dataset(X, label = y, categorical_feature = [0, 1, 2])
params = {
"objective" : "regression",
"max_depth" : 3
}
booster = lightgbm.train(params, train_set = ds, num_boost_round = 11)
booster.save_model("Auto.txt")
See https://github.com/jpmml/jpmml-lightgbm/issues/22#issuecomment-1824377083
In the sample LightGBM model file, the
pandas_categorical
section contains a two-element list (a list of two lists). This kind of suggests that the model is dealing with two categorical features.However, during parsing, the LigthGBM model file appears to contain an instruction "use the third categorical feature". This instruction is in conflict with the
pandas_categorical
section in the same file.What could be happening? Perhaps the LightGBM model file only "materializes" complex categorical features? If a categorical feature is not present in the
pandas_categorical
section, then perhaps it is a simple/primitive categorical feature (ie. an integer, whose value range is0, 1, .., n - 1
)?Should check with LightGBM-SkLearn library that how exactly the
pandas_categorical
section is formed. Perhaps there were breaking changes between v2/v3 and v4.