jpmml / jpmml-sklearn

Java library and command-line application for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
531 stars 117 forks source link

Array attribute 'sklearn2pmml.PMMLPipeline.active_fields' contains an unsupported value #69

Closed mpeychev closed 6 years ago

mpeychev commented 6 years ago

Hi, I am trying to convert a scikit-learn random forest classifier to a pmml file but am obtaining the following exception:

$ java -jar ~/jpmml-sklearn/target/converter-executable-1.4-SNAPSHOT.jar --pkl-input pmml_pipeline.pkl.z --pmml-output pipeline.pmml
Feb 08, 2018 4:46:19 PM org.jpmml.sklearn.Main run
INFO: Parsing PKL..
Feb 08, 2018 4:46:21 PM org.jpmml.sklearn.Main run
INFO: Parsed PKL in 2129 ms.
Feb 08, 2018 4:46:21 PM org.jpmml.sklearn.Main run
INFO: Converting..
Feb 08, 2018 4:46:21 PM org.jpmml.sklearn.Main run
SEVERE: Failed to convert
java.lang.IllegalArgumentException: Array attribute 'sklearn2pmml.PMMLPipeline.active_fields' contains an unsupported value (Java class java.lang.Integer)
    at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:43)
    at com.google.common.collect.Lists$TransformingRandomAccessList$1.transform(Lists.java:638)
    at com.google.common.collect.TransformedIterator.next(TransformedIterator.java:47)
    at sklearn2pmml.PMMLPipeline.initFeatures(PMMLPipeline.java:339)
    at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:151)
    at org.jpmml.sklearn.Main.run(Main.java:145)
    at org.jpmml.sklearn.Main.main(Main.java:94)
Caused by: java.lang.ClassCastException: Cannot cast java.lang.Integer to java.lang.String
    at java.lang.Class.cast(Class.java:3186)
    at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:41)
    ... 6 more

Exception in thread "main" java.lang.IllegalArgumentException: Array attribute 'sklearn2pmml.PMMLPipeline.active_fields' contains an unsupported value (Java class java.lang.Integer)
    at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:43)
    at com.google.common.collect.Lists$TransformingRandomAccessList$1.transform(Lists.java:638)
    at com.google.common.collect.TransformedIterator.next(TransformedIterator.java:47)
    at sklearn2pmml.PMMLPipeline.initFeatures(PMMLPipeline.java:339)
    at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:151)
    at org.jpmml.sklearn.Main.run(Main.java:145)
    at org.jpmml.sklearn.Main.main(Main.java:94)
Caused by: java.lang.ClassCastException: Cannot cast java.lang.Integer to java.lang.String
    at java.lang.Class.cast(Class.java:3186)
    at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:41)
    ... 6 more

The same issue occurred when trying to use your other tool - sklearn2pmml. Do you have any suggestions what the problem might be?

Thank you!

mpeychev commented 6 years ago

As a follow up, my PMMLPipeline is quite simple. I am producing the pickle file using the code below:

pmml_pipeline = PMMLPipeline([
    ("classifier", learning_model)
])
pmml_pipeline.fit(X_train, Y_train)
pmml_pipeline.verify(X_train.sample(n=15))
joblib.dump(pmml_pipeline, "pmml_pipeline.pkl.z", compress = 9)

X_train and Y_train are pandas DataFrames. Are there any other pipeline stages which should be defined and I am missing?

mpeychev commented 6 years ago

So I found that this issue happens when some of the features have numbers as names. Is this expected behaviour?

vruusmann commented 6 years ago

java.lang.IllegalArgumentException: Array attribute 'sklearn2pmml.PMMLPipeline.active_fields' contains an unsupported value (Java class java.lang.Integer)

Like the above exception message suggests, the value of PMMLPipeline.active_fields attribute must be a list of strings:

pipeline = PMMLPipeline(...)
pipeline.active_fields = ["1", "2", "3"] # YES!

If the list contains any non-string elements, then the converter fails:

pipeline.active_fields = [1, 2, 3] # NO-NO-NO!

So I found that this issue happens when some of the features have numbers as names.

I didn't know that pandas.DataFrame supports non-string column names. If this is "official" behaviour, then I'll improve my code. If this is "unofficial" behaviour, then I'll close this issue as invalid.

In the meantime, simply convert your column names to string.

sirvp commented 2 years ago

I had a similar issue, but my training data is a vectorized sparse matrix. i.e. The output of a CountVectorizer. How can I change the column names of such a dataset?

image