jpmml / jpmml-sklearn

Java library and command-line application for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
533 stars 117 forks source link

XGBClassifier wrapper failing to convert #16

Closed keithgw closed 8 years ago

keithgw commented 8 years ago

When using the xgboost.XGBClassifer wrapper, the estimator fails to convert. I get the error:

Aug 29, 2016 4:21:50 PM org.jpmml.sklearn.Main run
INFO: Parsing DataFrameMapper PKL..
Aug 29, 2016 4:21:50 PM org.jpmml.sklearn.Main run
INFO: Parsed DataFrameMapper PKL in 31 ms.
Aug 29, 2016 4:21:50 PM org.jpmml.sklearn.Main run
INFO: Converting DataFrameMapper..
Aug 29, 2016 4:21:50 PM org.jpmml.sklearn.Main run
INFO: Converted DataFrameMapper in 27 ms.
Aug 29, 2016 4:21:50 PM org.jpmml.sklearn.Main run
INFO: Parsing Estimator PKL..
Aug 29, 2016 4:21:50 PM org.jpmml.sklearn.Main run
INFO: Parsed Estimator PKL in 5 ms.
Aug 29, 2016 4:21:50 PM org.jpmml.sklearn.Main run
INFO: Converting Estimator..
Aug 29, 2016 4:21:50 PM org.jpmml.sklearn.Main run
SEVERE: Failed to convert Estimator
java.lang.ClassCastException: numpy.core.NDArray cannot be cast to java.util.List
        at xgboost.sklearn.XGBClassifier.getClasses(XGBClassifier.java:55)
        at sklearn.Classifier.createSchema(Classifier.java:43)
        at sklearn.EstimatorUtil.encodePMML(EstimatorUtil.java:47)
        at org.jpmml.sklearn.Main.run(Main.java:189)
        at org.jpmml.sklearn.Main.main(Main.java:107)

Exception in thread "main" java.lang.ClassCastException: numpy.core.NDArray cannot be cast to java.util.List
        at xgboost.sklearn.XGBClassifier.getClasses(XGBClassifier.java:55)
        at sklearn.Classifier.createSchema(Classifier.java:43)
        at sklearn.EstimatorUtil.encodePMML(EstimatorUtil.java:47)
        at org.jpmml.sklearn.Main.run(Main.java:189)
        at org.jpmml.sklearn.Main.main(Main.java:107)

My Version Info

numpy 1.11.1
pandas 0.18.1
xgboost 0.6
sklearn 0.17.1
joblib 0.10.0
java 1.8.0_91
python 2.7.10

Example

import pandas as pd
import numpy as np
from xgboost import XGBClassifier
from sklearn.preprocessing import StandardScaler
from sklearn_pandas import DataFrameMapper
from sklearn.datasets import load_iris
from sklearn2pmml.decoration import ContinuousDomain
from sklearn2pmml import sklearn2pmml

iris = load_iris()

iris_df = pd.concat((pd.DataFrame(iris.data[:, :], columns = ["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"]), pd.DataFrame(iris.target, columns = ["Species"])), axis = 1)
# change to binary classification problem
iris_df = iris_df[iris_df['Species'] > 0]

# EDIT, not included in original example
iris_mapper = DataFrameMapper([
    (["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"], [ContinuousDomain(), StandardScaler()]),
    ("Species", None)
])

iris = iris_mapper.fit_transform(iris_df)
iris_X = iris[:, 0:4]
iris_y = iris[:, 4]

iris_clf = XGBClassifier()
iris_clf.fit(iris_X, iris_y)

sklearn2pmml(estimator = iris_clf, mapper = iris_mapper, pmml = "code_output/irisXGB.pmml", with_repr = True)
vruusmann commented 8 years ago

The converter assumes that class labels are of string datatype.

As a temporary workaround, can you make the example code work if you convert the target column from boolean datatype to string datatype?

Something like this:

iris_df = iris_df["Species"].astype(str)
keithgw commented 8 years ago

Same error:

iris_y = iris[:, 4].astype(str)
iris_y.dtype # dtype('S1')

Error

CalledProcessError                        Traceback (most recent call last)
<ipython-input-203-1868a03b599c> in <module>()
----> 1 sklearn2pmml(estimator = iris_clf, mapper = iris_mapper, pmml = "code_output/irisXGB.pmml", with_repr = True)

/Users/kwilliams/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/__init__.pyc in sklearn2pmml(estimator, mapper, pmml, with_repr, debug)
     63                 if(debug):
     64                         print(" ".join(cmd))
---> 65                 subprocess.check_call(cmd)
     66         finally:
     67                 if(debug):

/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.pyc in check_call(*popenargs, **kwargs)
    538         if cmd is None:
    539             cmd = popenargs[0]
--> 540         raise CalledProcessError(retcode, cmd)
    541     return 0
    542 
SEVERE: Failed to convert Estimator
java.lang.ClassCastException: numpy.core.NDArray cannot be cast to java.util.List
        at xgboost.sklearn.XGBClassifier.getClasses(XGBClassifier.java:55)
        at sklearn.Classifier.createSchema(Classifier.java:43)
        at sklearn.EstimatorUtil.encodePMML(EstimatorUtil.java:47)
        at org.jpmml.sklearn.Main.run(Main.java:189)
        at org.jpmml.sklearn.Main.main(Main.java:107)

Exception in thread "main" java.lang.ClassCastException: numpy.core.NDArray cannot be cast to java.util.List
        at xgboost.sklearn.XGBClassifier.getClasses(XGBClassifier.java:55)
        at sklearn.Classifier.createSchema(Classifier.java:43)
        at sklearn.EstimatorUtil.encodePMML(EstimatorUtil.java:47)
        at org.jpmml.sklearn.Main.run(Main.java:189)
        at org.jpmml.sklearn.Main.main(Main.java:107)

Also tried iris_df["Species"] = iris_df["Species"].astype(str)

vruusmann commented 8 years ago

Your example script is missing the definition of iris_mapper. So, I used the following one:

iris_mapper = DataFrameMapper([
    (["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"], [ContinuousDomain(), StandardScaler()]),
    ("Species", None)
])

After that, everything works fine in my computer (printed using sklearn2pmml(debug = True)):

python 2.7.11
sklearn 0.17.1
sklearn.externals.joblib 0.9.4
sklearn_pandas 1.1.0
sklearn2pmml 0.9.7
xgboost 0.4

Perhaps they've changed XGBoost serialization functionality between 0.4 and 0.6 versions.

keithgw commented 8 years ago

Yes, iris_mapper was not included in my example, but the exact one you suggested was in the notebook I used to produce the error. I will try with xgboost 0.4

vruusmann commented 8 years ago

The DataField element for the "Species" column looks like this in the resulting PMML file:

<DataField name="Species" optype="categorical" dataType="double">
    <Value value="1"/>
    <Value value="2"/>
</DataField>

Target category names "1" and "2" are not so intuitive.

keithgw commented 8 years ago

I just reproduced your result by using xgboost 0.4a30 version instead of 0.6, and was able to successfully build the pmml file.

damienrj commented 8 years ago

Hello, I am having a similar issue and am getting the following error if I use his example code. My Xgboost version is 0.4a30, is this something that will fix if I we upgrade the version of xgboost?

Sep 19, 2016 10:33:09 AM org.jpmml.sklearn.Main run                                                                                                                                          [27/1867]
INFO: Converting Estimator..
Sep 19, 2016 10:33:09 AM org.jpmml.sklearn.Main run
SEVERE: Failed to convert Estimator
java.lang.RuntimeException: java.io.IOException
        at xgboost.sklearn.Booster.loadLearner(Booster.java:53)
        at xgboost.sklearn.Booster.getLearner(Booster.java:41)
        at xgboost.sklearn.BoosterUtil.getNumberOfFeatures(BoosterUtil.java:35)
        at xgboost.sklearn.XGBClassifier.getNumberOfFeatures(XGBClassifier.java:38)
        at sklearn.Classifier.createSchema(Classifier.java:59)
        at sklearn.EstimatorUtil.encodePMML(EstimatorUtil.java:47)
        at org.jpmml.sklearn.Main.run(Main.java:189)
        at org.jpmml.sklearn.Main.main(Main.java:107)
Caused by: java.io.IOException
        at org.jpmml.xgboost.XGBoostDataInput.readReserved(XGBoostDataInput.java:68)
        at org.jpmml.xgboost.GBTree.load(GBTree.java:61)
        at org.jpmml.xgboost.Learner.load(Learner.java:92)
        at org.jpmml.xgboost.XGBoostUtil.loadLearner(XGBoostUtil.java:34)
        at xgboost.sklearn.Booster.loadLearner(Booster.java:51)
        ... 7 more

Exception in thread "main" java.lang.RuntimeException: java.io.IOException
        at xgboost.sklearn.Booster.loadLearner(Booster.java:53)
        at xgboost.sklearn.Booster.getLearner(Booster.java:41)
        at xgboost.sklearn.BoosterUtil.getNumberOfFeatures(BoosterUtil.java:35)
        at xgboost.sklearn.XGBClassifier.getNumberOfFeatures(XGBClassifier.java:38)
        at sklearn.Classifier.createSchema(Classifier.java:59)
        at sklearn.EstimatorUtil.encodePMML(EstimatorUtil.java:47)
        at org.jpmml.sklearn.Main.run(Main.java:189)
        at org.jpmml.sklearn.Main.main(Main.java:107)
Caused by: java.io.IOException
        at org.jpmml.xgboost.XGBoostDataInput.readReserved(XGBoostDataInput.java:68)
        at org.jpmml.xgboost.GBTree.load(GBTree.java:61)
        at org.jpmml.xgboost.Learner.load(Learner.java:92)
        at org.jpmml.xgboost.XGBoostUtil.loadLearner(XGBoostUtil.java:34)
        at xgboost.sklearn.Booster.loadLearner(Booster.java:51)
        ... 7 more
vruusmann commented 8 years ago

I've tested both XGBoost 0.4 and 0.6 and I cannot reproduce this exception (ie. a java.io.IOException that signals that Booster binary object contains non-zero bytes in the "reserved" area). Maybe it's a Architecture/OS issue (I'm on 64-bit GNU/Linux).

You would need to provide a Booster file that I could study locally.

damienrj commented 8 years ago

I just got xgboost installed on OSX and it appeared to work. The server I am running the code on is CentOS/64-bit. I am happy to send the booster binary object, where are the located?

vruusmann commented 8 years ago

If you're using sklearn2pmml package version 0.9.7 or newer, then simply activate the debug option:

sklearn2pmml(estimator, mapper, debug = True)

The converter will then preserve temporary joblib dump files. Attach them here (or if GitHub won't let you do that for "security reasons", send to my e-mail).

damienrj commented 8 years ago

I believe these are the files you want. Btw, thank you for responding so quickly!

output.zip

damienrj commented 8 years ago

With the update to jpmml-xgboost it looks like it works with the R script, and for a simple python version I made. It doesn't appear to work yet with XGBClassifier but I can just make a function to generate the feature map and use jpmml-xgboost. I will give it a try with my full size models. Thanks for the help!

damienrj commented 7 years ago

Just a follow up, after I was able to get XGBoost to the current version (6.0) after building some new compliers everything worked without issues.