Closed SleepyThread closed 5 years ago
jpmml-sparkml : 1.4.7
Have you tested your code with the latest 1.4.9 version? There have been some changes (between 1.4.7 and 1.4.9), which affect the "PMML definition" of model output columns.
// Taking Result of 1st model val vectorAssembler2 = new VectorAssembler().setInputCols(Array("rfc_output")).setOutputCol("rfc_input_vec")
The vectorAssembler2
step seems completely unnecessary here. Why don't you interact vec1
and rfc_output
columns directly?
jpmml-sparkml : 1.4.7
Have you tested your code with the latest 1.4.9 version? There have been some changes (between 1.4.7 and 1.4.9), which affect the "PMML definition" of model output columns.
Does not works with jpmml-sparkml: 1.4.9 either. Still getting the same error.
Have you tested your code with the latest 1.4.9 version? There have been some changes (between 1.4.7 and 1.4.9), which affect the "PMML definition" of model output columns.
When you say minor version upgrade will lead to update in "PMML definition". Will this lead to incompatibility between minor version upgrade ?
Do we have to re-serialise everything after minor version upgrade ?
.. minor version upgrade will lead to update in "PMML definition"
In this context, "PMML definition" means the information that is extracted from the Apache Spark pipeline, and how it's mapped in the JPMML-SparkML -> JPMML-Converter -> JPMML-Model stack. For example, the following commit changed the "PMML definition" of the result field(s) of clustering models: https://github.com/jpmml/jpmml-sparkml/commit/fbeaadcbf0e1519effaf4f3cadfd95f0751b950a
Do we have to re-serialise everything after minor version upgrade ?
Absolutely not. All JPMML-SparkML 1.4.X versions are generating PMML documents that conform to the PMML schema version 4.3.
@vruusmann Do you have any idea regarding the issue/fix ?
@SleepyThread It's probably something very trivial. However, I'm currently unable to experiment with the JPMML-SparkML codebase locally (due to uncommitted changes related to JPMML-Model and JPMML-Converter library updates).
@vruusmann any updates ?
I have a similar issue. I trained 2 base models and one meta model that uses the probability columns of the first 2 models in order to make a final prediction.
Lets say the base models are named A
and B
and the meta model is C
. Also it might be important to note, that I trained A, B
and C
separately in different pipelines in order to not have the CrossValidations explode in complexity because there would have been many more combinations to test compared to doing it separately.
While A, B
and C
all had different probability output columns (prob1, prob2
and prob
respectively), they still had the same rawPrediction
column name which was ofc a problem for Spark, so I added SQLTransformers
after A
and B
which just selected the relevant features: namely prob1
(after A
) and prob1
and prob2
(after B
). This fixed the issue for Spark and so I could create working PipelineModel that uses A, B
and C
.
But when trying to convert them to jpmml, I got an error stating that prob1 can not be found. The issue could potentially be related.
@PowerToThePeople111 Please move your comment to a new issue. I don't want to mix unrelated technical content.
Currently I am unable to serialize Spark Pipeline to jPMML as it fails with error
This is because the difference in way num of features are calculated in jPmml Interaction code.
Here is the sample code to reproduce this scenario, first simple pipeline with Interaction code serialize fine but pipeline with multiple models are failing.
Version Details: jpmml-sparkml : 1.4.7 spark : 2.3.0
I understand the Interaction code is generating all the permutation and combination of the two features involved ( which comes as 6 ) but the Spark Model has hardcoded the num features as 3.
Can team explain me why this is happening and how we may fix the issue ?
I am able to ser the model via Spark and Mleap but not jPmml.
Thanks.