jpmml / jpmml-sparkml

Java library and command-line application for converting Apache Spark ML pipelines to PMML
GNU Affero General Public License v3.0
267 stars 80 forks source link

UnsupportedOperationException when exporting StringIndexer with LogisticRegression #29

Closed vikatskhay closed 7 years ago

vikatskhay commented 7 years ago

Hi,

I'm testing a very simple case just to evaluate the library and ran into an issue. Here's the code:

        // Load training data
        Dataset training = getTrainingData(jsc, sqlContext);
        StructType schema = training.schema();

        // Define the pipeline
        StringIndexer countryIndexer = new StringIndexer()
                .setInputCol("country")
                .setOutputCol("country_index");

        VectorAssembler assembler = new VectorAssembler()
                .setInputCols(new String[]{"country_index", "a", "b"})
                .setOutputCol("features");

        LogisticRegression lr = new LogisticRegression()
                .setMaxIter(10)
                .setRegParam(0.3)
                .setElasticNetParam(0.8);

        Pipeline pipeline = new Pipeline();
        pipeline.setStages(new PipelineStage[]{countryIndexer, assembler, lr});

        // Fit the model
        PipelineModel pipelineModel = pipeline.fit(training);

        // Predict
        Dataset testing = getTestingData(jsc, sqlContext);
        Dataset predictions = pipelineModel.transform(testing);
        predictions.show();

        // Export to PMML
        PMML pmml = ConverterUtil.toPMML(schema, pipelineModel);

Here's a piece of relevant output (predictions.show() and the exception):

+-----+-------+---+----+-------------+--------------+--------------------+--------------------+----------+
|label|country|  a|   b|country_index|      features|       rawPrediction|         probability|prediction|
+-----+-------+---+----+-------------+--------------+--------------------+--------------------+----------+
|  0.0|     FR|1.0|-0.2|          0.0|[0.0,1.0,-0.2]|[0.43756144584300...|[0.60767781895595...|       0.0|
|  1.0|     DE|0.9| 0.5|          1.0| [1.0,0.9,0.5]|[-0.7827870058785...|[0.31371953355157...|       1.0|
+-----+-------+---+----+-------------+--------------+--------------------+--------------------+----------+

Exception in thread "main" java.lang.UnsupportedOperationException
    at org.jpmml.converter.CategoricalFeature.toContinuousFeature(CategoricalFeature.java:63)
    at org.jpmml.converter.regression.RegressionModelUtil.createRegressionTable(RegressionModelUtil.java:232)
    at org.jpmml.converter.regression.RegressionModelUtil.createBinaryLogisticClassification(RegressionModelUtil.java:113)
    at org.jpmml.converter.regression.RegressionModelUtil.createBinaryLogisticClassification(RegressionModelUtil.java:87)
    at org.jpmml.sparkml.model.LogisticRegressionModelConverter.encodeModel(LogisticRegressionModelConverter.java:52)
    at org.jpmml.sparkml.model.LogisticRegressionModelConverter.encodeModel(LogisticRegressionModelConverter.java:39)
    at org.jpmml.sparkml.ModelConverter.registerModel(ModelConverter.java:165)
    at org.jpmml.sparkml.ConverterUtil.toPMML(ConverterUtil.java:81)
    at com.vika.pmml.PmmlExample.run(PmmlExample.java:99)
    at com.vika.pmml.PmmlExample.main(PmmlExample.java:40)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)

the training data:

    private static final StructType SCHEMA = new StructType(new StructField[]{
            createStructField("label", DoubleType, false),
            createStructField("country", StringType, false),
            createStructField("a", DoubleType, false),
            createStructField("b", DoubleType, false)
    });

    private Dataset getTrainingData(JavaSparkContext jsc, SQLContext sqlContext) {

        JavaRDD<Row> jrdd = jsc.parallelize(Arrays.asList(
                RowFactory.create(1.0, "DE", 1.1, 0.1),
                RowFactory.create(0.0, "FR", 1.0, -1.0),
                RowFactory.create(0.0, "FR", 1.3, 1.0),
                RowFactory.create(1.0, "DE", 1.2, -0.5)
        ));
        return sqlContext.createDataFrame(jrdd, SCHEMA);
    }

The exception is thrown when the country feature is handled in RegressionModelUtil.createRegressionTable().

Am I doing something wrong? Or it seems like using StringIndexer with LogisticRegression is not working right.

By the way, I also tried the same code with the library version 1.0.9 and Spark 1.6, it did get exported:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<PMML xmlns="http://www.dmg.org/PMML-4_2" version="4.2">
    <Header>
        <Application name="JPMML-SparkML" version="1.0.9"/>
        <Timestamp>2017-07-14T16:20:50Z</Timestamp>
    </Header>
    <DataDictionary>
        <DataField name="country" optype="categorical" dataType="string">
            <Value value="FR"/>
            <Value value="DE"/>
        </DataField>
        <DataField name="a" optype="continuous" dataType="double"/>
        <DataField name="b" optype="continuous" dataType="double"/>
        <DataField name="label" optype="categorical" dataType="double">
            <Value value="0"/>
            <Value value="1"/>
        </DataField>
    </DataDictionary>
    <RegressionModel functionName="classification" normalizationMethod="softmax">
        <MiningSchema>
            <MiningField name="label" usageType="target"/>
            <MiningField name="country"/>
            <MiningField name="a"/>
            <MiningField name="b"/>
        </MiningSchema>
        <Output>
            <OutputField name="probability_0" feature="probability" value="0"/>
            <OutputField name="probability_1" feature="probability" value="1"/>
        </Output>
        <RegressionTable intercept="-0.4375614458430096" targetCategory="1">
            <NumericPredictor name="country" coefficient="1.2203484517215881"/>
            <NumericPredictor name="a" coefficient="0.0"/>
            <NumericPredictor name="b" coefficient="0.0"/>
        </RegressionTable>
        <RegressionTable intercept="0.0" targetCategory="0"/>
    </RegressionModel>
</PMML>

however evaluating this PMML didn't work:

Exception in thread "main" org.jpmml.evaluator.TypeCheckException: Expected DOUBLE, but got STRING (FR)
    at org.jpmml.evaluator.TypeUtil.toDouble(TypeUtil.java:617)
    at org.jpmml.evaluator.TypeUtil.cast(TypeUtil.java:424)
    at org.jpmml.evaluator.FieldValue.getValue(FieldValue.java:320)
    at org.jpmml.evaluator.FieldValue.asNumber(FieldValue.java:269)
    at org.jpmml.evaluator.RegressionModelEvaluator.evaluateRegressionTable(RegressionModelEvaluator.java:194)
    at org.jpmml.evaluator.RegressionModelEvaluator.evaluateClassification(RegressionModelEvaluator.java:146)
    at org.jpmml.evaluator.RegressionModelEvaluator.evaluate(RegressionModelEvaluator.java:70)
    at org.jpmml.evaluator.ModelEvaluator.evaluate(ModelEvaluator.java:346)

Thank you very much beforehand!

vruusmann commented 7 years ago

The StringIndexer transformation translates string labels to indexes. The indexes are assigned by "popularity", so the most frequent string label will be mapped to 0, the second most frequent string label will be mapped to 1, and so on.

The output of a StringIndexer transformation is a numeric column, but it's devoid of any meaning (eg. if your example has mapping FR -> 0 and DE -> 1, then what does it mean - France is "less than" Germany?).

You can give this numeric column meaning by binarizing it in a "one-vs-rest" fashion using the OneHotEncoder transformation:

StringIndexer countryIndexer = new StringIndexer()
    .setInputCol("country")
    .setOutputCol("country_index");

// THIS!
OneHotEncoder countryBinarizer = new OneHotEncoder()
    .setInputCol("country_index")
    .setOutputCol("country_bitvector");

VectorAssembler assembler = new VectorAssembler()
    .setInputCols(new String[]{"country_bitvector", "a", "b"})
    .setOutputCol("features");

Exception in thread "main" java.lang.UnsupportedOperationException at org.jpmml.converter.CategoricalFeature.toContinuousFeature(CategoricalFeature.java:63)

Basically, the JPMML-SparkML library has "detected" that you're trying to invoke a categorical feature in a context that requires a continuous feature.

It's a valid exception, because you should never pass a "raw" StringIndexer output column to any ML algorithm (not just LogisticRegression). Sure, in order to avoid confusion, the type of this exception needs to be something other than java.lang.UnsupportedOperationException, and there needs to be a proper message (eg. java.lang.IllegalArgumentException("Cannot cast a feature from categorical operational type to continuous operational type")).

By the way, I also tried the same code with the library version 1.0.9 and Spark 1.6, it did get exported.

Apache Spark 1.6.X and JPMML-SparkML 1.0.X are no longer supported.

The export operation succeeds, but the resulting PMML document is non-sensical - it contains an instruction to multiply the name of country by 1.2203484517215881.

vikatskhay commented 7 years ago

Thanks a lot @vruusmann ! That's really helpful.