jpmml / jpmml-sparkml

Java library and command-line application for converting Apache Spark ML pipelines to PMML
GNU Affero General Public License v3.0
267 stars 80 forks source link

Interaction Stage fails when we perform model stacking. #64

Closed SleepyThread closed 5 years ago

SleepyThread commented 5 years ago

Currently I am unable to serialize Spark Pipeline to jPMML as it fails with error

Exception in thread "main" java.lang.IllegalArgumentException: Expected 3 features, got 6 features
    at org.jpmml.sparkml.ModelConverter.encodeSchema(ModelConverter.java:156)
    at org.jpmml.sparkml.ModelConverter.registerModel(ModelConverter.java:170)
    at org.jpmml.sparkml.PMMLBuilder.build(PMMLBuilder.java:116)
    at org.jpmml.sparkml.PMMLBuilder.buildByteArray(PMMLBuilder.java:249)
    at org.jpmml.sparkml.PMMLBuilder.buildByteArray(PMMLBuilder.java:245)

This is because the difference in way num of features are calculated in jPmml Interaction code.

Here is the sample code to reproduce this scenario, first simple pipeline with Interaction code serialize fine but pipeline with multiple models are failing.

Version Details: jpmml-sparkml : 1.4.7 spark : 2.3.0

val spark = SparkSession.builder.appName("tst").master("local").getOrCreate

  val df = spark.createDataFrame(Seq(
    (1, 1, 2, 3, 8, 4, 5, 0),
    (2, 4, 3, 8, 7, 9, 8, 1),
    (3, 6, 1, 9, 2, 3, 6, 0),
    (4, 10, 8, 6, 9, 4, 5,1 ),
    (5, 9, 2, 7, 10, 7, 3, 0 ),
    (6, 1, 1, 4, 2, 8, 4, 0)
  )).toDF("id1", "id2", "id3", "id4", "id5", "id6", "id7","label")

  // First model
  val vector1 = new VectorAssembler().
    setInputCols(Array("id2", "id3", "id4")).
    setOutputCol("vec1")

  val vector2 = new VectorAssembler().
    setInputCols(Array("id1", "id6", "id7")).
    setOutputCol("vec2")

  // This interaction get Serialize well.
  val interaction1 = new Interaction()
    .setInputCols(Array("vec1", "vec2"))
    .setOutputCol("int_output")

  val rfc = new RandomForestClassifier()
    .setNumTrees(10)
    .setFeaturesCol("int_output")
    .setLabelCol("label")
    .setPredictionCol("rfc_output")
    .setRawPredictionCol("rfc_raw_prediction")
    .setProbabilityCol("rfc_probability")

  val workingPipeline = new Pipeline().setStages(Array(vector1, vector2, interaction1, rfc))

  private val workingPipelineModel: PipelineModel = workingPipeline.fit(df)

  val bytes = new PMMLBuilder(df.schema, workingPipelineModel).buildByteArray

  // Works fine..

  // Taking Result of 1st model
  val vectorAssembler2 = new VectorAssembler().setInputCols(Array("rfc_output")).setOutputCol("rfc_input_vec")

  // Creating a new Interaction with vector 1 and output of first model
  val interaction = new Interaction()
    .setInputCols(Array("vec1", "rfc_input_vec"))
    .setOutputCol("features")

  val metaRfc = new RandomForestClassifier()
    .setNumTrees(10)
    .setFeaturesCol("features")
    .setLabelCol("label")
    .setPredictionCol("meta_prediction")
    .setRawPredictionCol("meta_raw_prediction")
    .setProbabilityCol("meta_probability")

  val stages = Array(vector1, vector2, interaction1, rfc, vectorAssembler2, interaction, metaRfc)
  val metaPipeline = new Pipeline().setStages(stages)

  private val metaPipelineModel: PipelineModel = metaPipeline.fit(df)

  /// This fails with error.
  ///// Exception in thread "main" java.lang.IllegalArgumentException: Expected 3 features, got 6 features
  val failedBytes = new PMMLBuilder(df.schema, metaPipelineModel).buildByteArray

  println(failedBytes.size)

I understand the Interaction code is generating all the permutation and combination of the two features involved ( which comes as 6 ) but the Spark Model has hardcoded the num features as 3.

Can team explain me why this is happening and how we may fix the issue ?

I am able to ser the model via Spark and Mleap but not jPmml.

Thanks.

vruusmann commented 5 years ago

jpmml-sparkml : 1.4.7

Have you tested your code with the latest 1.4.9 version? There have been some changes (between 1.4.7 and 1.4.9), which affect the "PMML definition" of model output columns.

// Taking Result of 1st model val vectorAssembler2 = new VectorAssembler().setInputCols(Array("rfc_output")).setOutputCol("rfc_input_vec")

The vectorAssembler2 step seems completely unnecessary here. Why don't you interact vec1 and rfc_output columns directly?

SleepyThread commented 5 years ago

jpmml-sparkml : 1.4.7

Have you tested your code with the latest 1.4.9 version? There have been some changes (between 1.4.7 and 1.4.9), which affect the "PMML definition" of model output columns.

Does not works with jpmml-sparkml: 1.4.9 either. Still getting the same error.

SleepyThread commented 5 years ago

Have you tested your code with the latest 1.4.9 version? There have been some changes (between 1.4.7 and 1.4.9), which affect the "PMML definition" of model output columns.

When you say minor version upgrade will lead to update in "PMML definition". Will this lead to incompatibility between minor version upgrade ?

Do we have to re-serialise everything after minor version upgrade ?

vruusmann commented 5 years ago

.. minor version upgrade will lead to update in "PMML definition"

In this context, "PMML definition" means the information that is extracted from the Apache Spark pipeline, and how it's mapped in the JPMML-SparkML -> JPMML-Converter -> JPMML-Model stack. For example, the following commit changed the "PMML definition" of the result field(s) of clustering models: https://github.com/jpmml/jpmml-sparkml/commit/fbeaadcbf0e1519effaf4f3cadfd95f0751b950a

Do we have to re-serialise everything after minor version upgrade ?

Absolutely not. All JPMML-SparkML 1.4.X versions are generating PMML documents that conform to the PMML schema version 4.3.

SleepyThread commented 5 years ago

@vruusmann Do you have any idea regarding the issue/fix ?

vruusmann commented 5 years ago

@SleepyThread It's probably something very trivial. However, I'm currently unable to experiment with the JPMML-SparkML codebase locally (due to uncommitted changes related to JPMML-Model and JPMML-Converter library updates).

SleepyThread commented 5 years ago

@vruusmann any updates ?

PowerToThePeople111 commented 3 years ago

I have a similar issue. I trained 2 base models and one meta model that uses the probability columns of the first 2 models in order to make a final prediction.

Lets say the base models are named A and Band the meta model is C. Also it might be important to note, that I trained A, B and C separately in different pipelines in order to not have the CrossValidations explode in complexity because there would have been many more combinations to test compared to doing it separately. While A, B and C all had different probability output columns (prob1, prob2 and prob respectively), they still had the same rawPrediction column name which was ofc a problem for Spark, so I added SQLTransformers after A and B which just selected the relevant features: namely prob1 (after A) and prob1 and prob2 (after B). This fixed the issue for Spark and so I could create working PipelineModel that uses A, B and C.

But when trying to convert them to jpmml, I got an error stating that prob1 can not be found. The issue could potentially be related.

vruusmann commented 3 years ago

@PowerToThePeople111 Please move your comment to a new issue. I don't want to mix unrelated technical content.