jpmml / sklearn2pmml

Python library for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
688 stars 113 forks source link

Error generating PMML file, LabelEncoder() not working when column contains missing values #61

Closed bbzzzz closed 7 years ago

bbzzzz commented 7 years ago

Hi,

I got the following error when trying to generate PMML file:

Oct 30, 2017 8:22:55 AM org.jpmml.sklearn.Main run
INFO: Parsed PKL in 97 ms.
Oct 30, 2017 8:22:55 AM org.jpmml.sklearn.Main run
INFO: Converting..
Oct 30, 2017 8:22:55 AM org.jpmml.sklearn.Main run
SEVERE: Failed to convert
java.lang.IllegalArgumentException: Field device_type has valid values [ANDROID, CHROMEOS, IPAD, IPHONE, IPOD, LINUX, MAC, WINDOWS]
    at org.jpmml.converter.PMMLEncoder.toCategorical(PMMLEncoder.java:189)
    at sklearn.preprocessing.LabelEncoder.encodeFeatures(LabelEncoder.java:97)
    at sklearn_pandas.DataFrameMapper.initializeFeatures(DataFrameMapper.java:75)
    at sklearn.Initializer.encodeFeatures(Initializer.java:53)
    at sklearn.pipeline.Pipeline.encodeFeatures(Pipeline.java:82)
    at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:128)
    at org.jpmml.sklearn.Main.run(Main.java:144)
    at org.jpmml.sklearn.Main.main(Main.java:93)

Exception in thread "main" java.lang.IllegalArgumentException: Field device_type has valid values [ANDROID, CHROMEOS, IPAD, IPHONE, IPOD, LINUX, MAC, WINDOWS]
    at org.jpmml.converter.PMMLEncoder.toCategorical(PMMLEncoder.java:189)
    at sklearn.preprocessing.LabelEncoder.encodeFeatures(LabelEncoder.java:97)
    at sklearn_pandas.DataFrameMapper.initializeFeatures(DataFrameMapper.java:75)
    at sklearn.Initializer.encodeFeatures(Initializer.java:53)
    at sklearn.pipeline.Pipeline.encodeFeatures(Pipeline.java:82)
    at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:128)
    at org.jpmml.sklearn.Main.run(Main.java:144)
    at org.jpmml.sklearn.Main.main(Main.java:93)

It seems the problem is related with column device_type.

My data:

X.columns = ['device_type','weekday','distance','grossamount']
var_char_list = ['device_type','weekday']
var_num_list = ['distance','grossamount']

Values in column device_type:

X.device_type.unique()

array([u'WINDOWS', u'ANDROID', u'MAC', u'IPHONE', u'IPAD', u'CHROMEOS',nan, u'LINUX', u'IPOD'], dtype=object)

If I delete device_type, then PMML file can be successfully generated (another categorical column weekday does not contain missing values)

Here is my code for generating PMML:

sklearn2pmml(pipeline, "test.pmml", with_repr = True)

Mapper:

mapper = DataFrameMapper(
      [(var, [CategoricalDomain(invalid_value_treatment = "as_missing", missing_value_replacement = "Unknown"),LabelEncoder()]) for var in var_char_list]
    + [(var, ContinuousDomain(invalid_value_treatment = "as_missing", missing_value_replacement = "-1.0")) for var in var_num_list] 
    , input_df=True, df_out=True
    )

Pipeline:

pipeline = PMMLPipeline([("mapper", mapper),("classifier", RandomForestClassifier())])
pipeline.fit(X,y)

There is no problem in pipeline.fit(X,y), I got:

PMMLPipeline(steps=[('mapper', DataFrameMapper(default=False, df_out=True,
        features=[('device_type', [CategoricalDomain(), LabelEncoder()]), ('weekday', [CategoricalDomain(), LabelEncoder()]), ('distance', ContinuousDomain()), ('grossamount', ContinuousDomain())],
        input_df=True, sparse=False)),
       ('classifier', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))])

Thanks, Bohan

vruusmann commented 7 years ago

SEVERE: Failed to convert java.lang.IllegalArgumentException: Field device_type has valid values [ANDROID, CHROMEOS, IPAD, IPHONE, IPOD, LINUX, MAC, WINDOWS] at org.jpmml.converter.PMMLEncoder.toCategorical(PMMLEncoder.java:189)

This exception means that the LabelEncoder transformation is trying to define the valid value space for the field device_type, but the information that it has is in conflict with the existing information. As a matter of caution, the sklearn2pmml package refuses to continue (in order to avoid generating potentially problematic PMML).

In other words, it means that CategoricalDomain and LabelEncoder transformations are seeing a different set of valid values.

array([u'WINDOWS', u'ANDROID', u'MAC', u'IPHONE', u'IPAD', u'CHROMEOS',nan, u'LINUX', u'IPOD'], dtype=object)

Your column is a mix of string and numeric (float64?) values.

To fix the problem, convert nan values to None values, so that the column is all string values.

bbzzzz commented 7 years ago

Thank you for your quick reply!