Open Hao-Jiang opened 2 years ago
I trained a PMMLPipeline with OneHotEncoder and XGBClassifier using the following code snippet.
First of all - what is your XGBoost package version?
If you upgrade to XGBoost 1.5.X or newer, then you shall be able to utilize XGBoost's new native One-Hot-Encoding (OHE) support. It's much more memory efficient than dealing with an external OneHotEncoder
step, especially when dealing with sparse features.
Even better, you might consider upgrading to XGBoost 1.6.X or newer, and you shall be able to utilize XGBoost's new native multi-category categorical splits.
So, please upgrade your XGBoost package (and the SkLearn2PMML package as well!) to the latest, and simplify your Scikit-Learn pipeline to the following:
mapper = DataFrameMapper(
[(col, None) for col in numerical_cols] +
[([col], None) for col in categorical_cols]
)
The pipeline seemed to work and I was able to use it to do predictions.
Just a sidenote - Scikit-Learn is willing to fit all kinds of pipelines, without checking if the sequence of computational steps makes any sense or not. For as long as your "number of columns" is good, you'll be getting predictions.
However, the Scikit-Learn to PMML converter tries to understand the logic of each computational step. Therefore, if something does not make sense to it, it'll complain (eg. by raising an exception). You should heed to those complaints, and try to make your pipeline more information-rich.
I got an error when I tried to turn the pipeline into a pmml file
Exception in thread "main" org.jpmml.model.MissingAttributeException: Required attribute Value@value is not defined at org.dmg.pmml.Value.requireValue(Value.java:67)
Looks like the converter was unable to figure out the list of category values for some categorical feature.
Internal note - it's interesting that the converter is complaining about a missing DataField/Value@value
attribute, and not about a missing DataField/Value
element itself.
Could it be that your dataset contains a column with a None
or float("NaN")
category level? This seems like one plausible scenario how there can be a DataField/Value
element whose @value
attribute has been omitted (filtered out as a placeholder for a missing value).
You can make your pipeline more robust by collecting and storing category values using SkLearn2PMML domain decorator classes:
from sklearn2pmml.decoration import CategoricalDomain, ContinuousDomain
mapper = DataFrameMapper(
[(col, ContinuousDomain()) for col in numerical_cols] +
[([col], CategoricalDomain()) for col in categorical_cols]
)
At minimum, this should give you a different, more informative error.
Leaving this issue open as a reminder to improve error diagnostics in this area.
The current Java exception is void of any debugging information, because it is raised for a condition which is supposed to never trigger (a required attribute has not been set in JPMML-Converter library stack).
Hello,
I trained a PMMLPipeline with OneHotEncoder and XGBClassifier using the following code snippet.
The pipeline seemed to work and I was able to use it to do predictions. But I got an error when I tried to turn the pipeline into a pmml file
sklearn2pmml(pipeline, "testing.pmml", with_repr=True)
Can someone give me some advice on what I might have done wrong? Thanks.