jpmml / jpmml-sklearn

Java library and command-line application for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
531 stars 117 forks source link

Convert Model with categorical features to PMML #8

Closed CodeKiller48 closed 8 years ago

CodeKiller48 commented 8 years ago

I used LabelBinarizer to convert categorical features to dummy variables, and trained a GBM model. However, when converting the trained model and datamapper to PMML, there was JAVA CalledProcessError error. Would you mind having a look at the issue? Thanks .

DataFrameMapper step

cat = [feature_names[i] for i in categorical_features] num = [feature_names[i] for i in range(15) if i not in categorical_features]

transform = [(column, None) if column in num else (column, sklearn.preprocessing.LabelBinarizer()) for column in train_dt.columns] from sklearn_pandas import DataFrameMapper mapper = DataFrameMapper(transform) train_array = mapper.fit_transform(train_dt)

model training step

colNum = train_array.shape[1] from sklearn.ensemble import GradientBoostingClassifier gbtree = GradientBoostingClassifier(random_state=10) gbtree.fit(train_array[:,0:colNum-1], train_array[:,colNum-1])

convert to PMML

from sklearn2pmml import sklearn2pmml sklearn2pmml(gbtree, mapper, "testLabelBinarizer.pmml", with_repr = True)

error: CalledProcessError: Command '['java', '-cp', '/Users/minw/.local/lib/python2.7/site-packages/sklearn2pmml/resources/guava-19.0.jar:/Users/minw/.local/lib/python2.7/site-packages/sklearn2pmml/resources/istack-commons-runtime-2.21.jar:/Users/minw/.local/lib/python2.7/site-packages/sklearn2pmml/resources/jaxb-core-2.2.11.jar:/Users/minw/.local/lib/python2.7/site-packages/sklearn2pmml/resources/jaxb-runtime-2.2.11.jar:/Users/minw/.local/lib/python2.7/site-packages/sklearn2pmml/resources/jcommander-1.48.jar:/Users/minw/.local/lib/python2.7/site-packages/sklearn2pmml/resources/jpmml-converter-1.0.3.jar:/Users/minw/.local/lib/python2.7/site-packages/sklearn2pmml/resources/jpmml-sklearn-1.0-SNAPSHOT.jar:/Users/minw/.local/lib/python2.7/site-packages/sklearn2pmml/resources/pmml-agent-1.2.11.jar:/Users/minw/.local/lib/python2.7/site-packages/sklearn2pmml/resources/pmml-model-1.2.11.jar:/Users/minw/.local/lib/python2.7/site-packages/sklearn2pmml/resources/pmml-model-metro-1.2.11.jar:/Users/minw/.local/lib/python2.7/site-packages/sklearn2pmml/resources/pmml-schema-1.2.11.jar:/Users/minw/.local/lib/python2.7/site-packages/sklearn2pmml/resources/pyrolite-4.10.jar:/Users/minw/.local/lib/python2.7/site-packages/sklearn2pmml/resources/serpent-1.12.jar:/Users/minw/.local/lib/python2.7/site-packages/sklearn2pmml/resources/slf4j-api-1.7.18.jar:/Users/minw/.local/lib/python2.7/site-packages/sklearn2pmml/resources/slf4j-jdk14-1.7.18.jar', 'org.jpmml.sklearn.Main', '--pkl-estimator-input', '/var/folders/_3/rcqhhsv17jlc9wk_83zvrjqw0000gn/T/tmpAuhOT9.pkl', '--repr-estimator', "GradientBoostingClassifier(init=None, learning_rate=0.1, loss='deviance',\n max_depth=3, max_features=None, max_leaf_nodes=None,\n min_samples_leaf=1, min_samples_split=2,\n min_weight_fraction_leaf=0.0, n_estimators=100,\n presort='auto', random_state=10, subsample=1.0, verbose=0,\n warm_start=False)", '--pkl-mapper-input', '/var/folders/_3/rcqhhsv17jlc9wk_83zvrjqw0000gn/T/tmpe7k2oy.pkl', '--repr-mapper', "DataFrameMapper(features=[('Age', None), ('Workclass', LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)), ('fnlwgt', None), ('Education', LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)), ('Education-Num', None), ('Marital Status', LabelBinarizer(neg_label=0, pos_label=1, sparse_output=Fa...None), ('Country', LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)), ('Labels', None)],\n sparse=False)", '--pmml-output', 'testLabelBinarizer.pmml']' returned non-zero exit status 1

vruusmann commented 8 years ago

The CalledProcessError exception is raised by Python's subprocess.check_call function call, because the underlying Java application returned a status code that indicates a failure (also communicated by the message "returned non-zero exit status 1").

This report does not reveal the Java application error. However, when I run the sample Python script that you provided to me privately, then I can see the following Java exception:

Exception in thread "main" java.util.NoSuchElementException
        at java.util.ArrayList$Itr.next(ArrayList.java:854)
        at sklearn_pandas.DataFrameMapper.updateDataDictionary(DataFrameMapper.java:420)
        at sklearn_pandas.DataFrameMapper.updatePMML(DataFrameMapper.java:206)
        at org.jpmml.sklearn.Main.run(Main.java:139)
        at org.jpmml.sklearn.Main.main(Main.java:102)

It means that your DataFrameMapper and GradientBoostingClassifier objects are not in sync - the former contains less feature definitions than expected by the latter. So, you need to fix the way how the DataFrameMapper object is constructed in your Python code.

According to log messages, the GradientBoostingClassifier expects to find 106 feature definitions (INFO: Updating 1 target field and 106 active field(s)), but the DataFrameMapper only provides 65 feature definitions (the exception occurs right after INFO: Mapping active field(s) [x66] to [Hours per week]).

CodeKiller48 commented 8 years ago

Hi Villu,

Thanks for your quick response. We have the same thoughts that the DataFrameMapper and GradientBoostingClassifier objects are not in sync. Do you have any idea to fix the DataFrameMapper I describe above(convert categorical features to dummy variables)? I am confused that after transforming the data by DataFrameMapper, i actually got a 107 columns Array, but why the exception says only 65 features provided by DataFrameMapper?

Actually, I thought converting categorical features to dummy variables is a very common practice, do you have any example that successfully convert models(any models) to PMML with categorical features?

Thanks a lot and look forward to your response.

vruusmann commented 8 years ago

You should take a look at my Python script main.py which is responsible for generating pickle files for integration testing. Specifically, it contains six examples about constructing new DataFrameMapper objects, and two of them (audit_mapper and auto_mapper) are dealing with converting categorical variables to continuous.

The idea is that when the number of columns is relatively low, then you should specify them manually (instead of attempting Python list magic). Once you get the simple thing working, only then start introducing more complex constructs:

my_mapper = DataFrameMapper([
  ("cat_col_1", OneHotEncoder()),
  ("bin_col_2", LabelBinarizer()),
  ("target", None)
])
CodeKiller48 commented 8 years ago

Hi Villu,

Thanks a lot for your help. Now i got what's going wrong there. if a categorical feature has only two levels, i should use LabelEncoder instead of LabelBinarizer.

Wei

MatiasSanchezCabrera commented 8 years ago

Thanks CodeKiller48 for pointing out your error. Had the same problem!

prateekpatelsc commented 7 years ago

Hi Villu

Ad CodeKiller pointed out , is this a special case ? I ran into same error : SEVERE: Failed to convert Estimator java.lang.IllegalArgumentException   at org.jpmml.sklearn.FeatureMapper.updateActiveFields(FeatureMapper.java:236)   at sklearn.Classifier.createSchema(Classifier.java:59)   at sklearn.EstimatorUtil.encodePMML(EstimatorUtil.java:47)   at org.jpmml.sklearn.Main.run(Main.java:189)   at org.jpmml.sklearn.Main.main(Main.java:107)

Exception in thread "main" java.lang.IllegalArgumentException   at org.jpmml.sklearn.FeatureMapper.updateActiveFields(FeatureMapper.java:236)   at sklearn.Classifier.createSchema(Classifier.java:59)   at sklearn.EstimatorUtil.encodePMML(EstimatorUtil.java:47)   at org.jpmml.sklearn.Main.run(Main.java:189)   at org.jpmml.sklearn.Main.main(Main.java:107)

On debugging more , i found that this is only when we have two categories for a feature and labelbinarizer outputs a columnvector . I tried using [LabelBinarizer() , OneHotEncoder()] to get around , but i run into the following erro : SEVERE: Failed to parse DataFrameMapper PKL net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for dill.dill._load_type) at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175) at org.jpmml.sklearn.PickleUtil$1.dispatch(PickleUtil.java:238) at net.razorvine.pickle.Unpickler.load(Unpickler.java:99) at org.jpmml.sklearn.PickleUtil.unpickle(PickleUtil.java:259) at org.jpmml.sklearn.Main.run(Main.java:126) at org.jpmml.sklearn.Main.main(Main.java:107)

Exception in thread "main" net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for dill.dill._load_type) at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175) at org.jpmml.sklearn.PickleUtil$1.dispatch(PickleUtil.java:238) at net.razorvine.pickle.Unpickler.load(Unpickler.java:99) at org.jpmml.sklearn.PickleUtil.unpickle(PickleUtil.java:259) at org.jpmml.sklearn.Main.run(Main.java:126) at org.jpmml.sklearn.Main.main(Main.java:107)

vruusmann commented 7 years ago

@prateekpatelsc Your DataFrameMapper object contains Dill objects (ie. something that is reconstructed using the dill.dill._load_type utility method). The JPMML-SkLearn doesn't know about Dill objects (they appear to involve CPython classes) and is unable to unpickle those.

For short-term solution, you should clean your DataFrameMapper object from any Dill objects, and re-try the conversion.

For long-term solution, please open a new JPMML-SkLearn issue about Dill support. I'm not familiar with this library, but quick googling shows that this is something that other users might need as well. Be sure to share some Python demo code about your use case. For example, if I wanted to reproduce the above exception with Iris dataset, then what should I do?