Error on a pipeline with OneHotEncoder and xgboost

Hao-Jiang commented 2 years ago

Hello,

I trained a PMMLPipeline with OneHotEncoder and XGBClassifier using the following code snippet.

from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import OneHotEncoder
from sklearn2pmml import sklearn2pmml, PMMLPipeline
from xgboost.sklearn import XGBClassifier

mapper = DataFrameMapper(
    [(col, None) for col in numerical_cols] +
    [([col], OneHotEncoder(handle_unknown='ignore')) for col in categorical_cols]
)

pipeline = PMMLPipeline(
    steps=[
        ('mapper', mapper),
        ('classifier', XGBClassifier())
    ]
)

pipeline.fit(X,  y)

The pipeline seemed to work and I was able to use it to do predictions. But I got an error when I tried to turn the pipeline into a pmml file sklearn2pmml(pipeline, "testing.pmml", with_repr=True)

Standard error:
Exception in thread "main" org.jpmml.model.MissingAttributeException: Required attribute Value@value is not defined
    at org.dmg.pmml.Value.requireValue(Value.java:67)
    at org.jpmml.converter.PMMLUtil.getValues(PMMLUtil.java:139)
    at org.jpmml.converter.PMMLUtil.getValues(PMMLUtil.java:124)
    at org.jpmml.converter.CategoricalFeature.<init>(CategoricalFeature.java:35)
    at org.jpmml.converter.WildcardFeature.toCategoricalFeature(WildcardFeature.java:61)
    at sklearn.preprocessing.MultiOneHotEncoder.encodeFeatures(MultiOneHotEncoder.java:118)
    at sklearn.Transformer.encode(Transformer.java:69)
    at sklearn_pandas.DataFrameMapper.encodeFeatures(DataFrameMapper.java:67)
    at sklearn.Transformer.encode(Transformer.java:69)
    at sklearn.Composite.encodeFeatures(Composite.java:119)
    at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:212)
    at com.sklearn2pmml.Main.run(Main.java:84)
    at com.sklearn2pmml.Main.main(Main.java:62)

Can someone give me some advice on what I might have done wrong? Thanks.

vruusmann commented 2 years ago

I trained a PMMLPipeline with OneHotEncoder and XGBClassifier using the following code snippet.

First of all - what is your XGBoost package version?

If you upgrade to XGBoost 1.5.X or newer, then you shall be able to utilize XGBoost's new native One-Hot-Encoding (OHE) support. It's much more memory efficient than dealing with an external OneHotEncoder step, especially when dealing with sparse features.

Even better, you might consider upgrading to XGBoost 1.6.X or newer, and you shall be able to utilize XGBoost's new native multi-category categorical splits.

So, please upgrade your XGBoost package (and the SkLearn2PMML package as well!) to the latest, and simplify your Scikit-Learn pipeline to the following:

mapper = DataFrameMapper(
    [(col, None) for col in numerical_cols] +
    [([col], None) for col in categorical_cols]
)

The pipeline seemed to work and I was able to use it to do predictions.

Just a sidenote - Scikit-Learn is willing to fit all kinds of pipelines, without checking if the sequence of computational steps makes any sense or not. For as long as your "number of columns" is good, you'll be getting predictions.

However, the Scikit-Learn to PMML converter tries to understand the logic of each computational step. Therefore, if something does not make sense to it, it'll complain (eg. by raising an exception). You should heed to those complaints, and try to make your pipeline more information-rich.

I got an error when I tried to turn the pipeline into a pmml file

Exception in thread "main" org.jpmml.model.MissingAttributeException: Required attribute Value@value is not defined
  at org.dmg.pmml.Value.requireValue(Value.java:67)

Looks like the converter was unable to figure out the list of category values for some categorical feature.

Internal note - it's interesting that the converter is complaining about a missing DataField/Value@value attribute, and not about a missing DataField/Value element itself.

Could it be that your dataset contains a column with a None or float("NaN") category level? This seems like one plausible scenario how there can be a DataField/Value element whose @value attribute has been omitted (filtered out as a placeholder for a missing value).

You can make your pipeline more robust by collecting and storing category values using SkLearn2PMML domain decorator classes:

from sklearn2pmml.decoration import CategoricalDomain, ContinuousDomain

mapper = DataFrameMapper(
    [(col, ContinuousDomain()) for col in numerical_cols] +
    [([col], CategoricalDomain()) for col in categorical_cols]
)

At minimum, this should give you a different, more informative error.

vruusmann commented 2 years ago

Leaving this issue open as a reminder to improve error diagnostics in this area.

The current Java exception is void of any debugging information, because it is raised for a condition which is supposed to never trigger (a required attribute has not been set in JPMML-Converter library stack).

jpmml / jpmml-converter

Error on a pipeline with OneHotEncoder and xgboost #22