Closed xiaoSUM closed 6 months ago
Two questions to you:
Pay attention to the exception message!
It complains about the OneHotEncoder
class, which is the initial/unfitted state of the transformer! As such, it does not contain any fitted state (for example, between class labels and their one-hot encoded indices).
When you fit your pipeline, the OneHotEncoder
step is replaced with a OneHotEncoderModel
step (note the "Model" suffix to the class name). This class is recognized and supported by the JPMML-SparkML library.
For a list of supported transformers, see this mappings file: https://github.com/jpmml/jpmml-sparkml/blob/2.0.1/pmml-sparkml/src/main/resources/META-INF/sparkml2pmml.properties
The mapping for OneHotEncoderModel
is located on line 14 there.
TLDR: It appears to me that you are attempting to convert a pipeline object which contains one or more unfitted steps. For example, there is an unfitted OneHotEncoder
step in there.
The JPMML-SparkML library assumes that all steps have been fitted.
Feel free to reopen this issue, if my initial instinct/reaction about the nature of this problem turns out to be incorrect.
Two questions to you:
- Can you paste the full exception stack trace here (full depth of the exception stack)?
- Why don't you upgrade from JPMML-SparkML 2.0.1 to 2.0.3 (this is the latest 2.0.X release version as of today), and see if anything changes.
Thank you for your guidance! I simply reproduced the problem.
<dependency>
<groupId>org.jpmml</groupId>
<artifactId>pmml-sparkml</artifactId>
<version>2.0.3</version>
</dependency>
sepal_length,sepal_width,petal_length,petal_width,species 5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa 4.6,3.1,1.5,0.2,Iris-setosa 5.0,3.6,1.4,0.2,Iris-setosa 5.4,3.9,1.7,0.4,Iris-setosa 4.6,3.4,1.4,0.3,Iris-setosa 5.0,3.4,1.5,0.2,Iris-setosa 4.4,2.9,1.4,0.2,Iris-setosa 7.0,3.2,4.7,1.4,Iris-versicolor 6.4,3.2,4.5,1.5,Iris-versicolor 6.9,3.1,4.9,1.5,Iris-versicolor 5.5,2.3,4.0,1.3,Iris-versicolor 6.5,2.8,4.6,1.5,Iris-versicolor 5.7,2.8,4.5,1.3,Iris-versicolor 6.3,3.3,4.7,1.6,Iris-versicolor 4.9,2.4,3.3,1.0,Iris-versicolor 6.6,2.9,4.6,1.3,Iris-versicolor 5.2,2.7,3.9,1.4,Iris-versicolor
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.DecisionTreeClassifier
import org.apache.spark.ml.feature.RFormula
object test {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder().master("local").getOrCreate()
val irisData = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("E:\\code\\20230201git\\data_mining_arithmetic\\src\\main\\resources\\data\\iris.csv")
val irisSchema = irisData.schema
val rFormula = new RFormula().setFormula("species ~ .")
val dtClassifier = new DecisionTreeClassifier().setLabelCol(rFormula.getLabelCol).setFeaturesCol(rFormula.getFeaturesCol)
val pipeline = new Pipeline().setStages(Array(rFormula, dtClassifier))
val pipelineModel = pipeline.fit(irisData)
import org.jpmml.sparkml.PMMLBuilder
val pmml = new PMMLBuilder(irisSchema, pipelineModel).build()
}
}
Very interesting! Especially the fact that your example Scala code doesn't contain any references to the OneHotEncoder
in any shape or form (whether initial/unfitted or final/fitted).
Another interesting thing is that the exception is thrown on line 192 of ConverterFactory.java
. This agrees with JPMML-SparkML version 2.0.0, but it doesn't agree with JPMML-SparkML versions 2.0.1 through 2.0.3, where it should be line 193.
https://github.com/jpmml/jpmml-sparkml/blob/2.0.0/pmml-sparkml/src/main/java/org/jpmml/sparkml/ConverterFactory.java#L192
https://github.com/jpmml/jpmml-sparkml/blob/2.0.1/pmml-sparkml/src/main/java/org/jpmml/sparkml/ConverterFactory.java#L192-L193
It still looks like some kind of JPMML-SparkML library configuration problem on your computer. Basically, your application classpath contains some customsparkml2pmml.properties
file(s), where there is a mapping from org.apache.spark.ml.feature.OneHotEncoder
transformer class to some transformer converter class.
The correct mapping would have org.apache.spark.ml.feature.OneHotEncoderModel
on its left side (aka key).
This is such a fundamental JPMML-SparkML library configuration issue, which should make it unusable everywhere for everybody. Yet, all its GitHub Actions CI integration tests are passing cleanly, and there are no other people complaining about it. Which kind of suggests that the misconfiguration is specific to your computer/production environment.
Anyway, will try to run your example on my computer (with the latest JPMML-SparkML 2.0.3), and see what happens.
Anyway, will try to run your example on my computer (with the latest JPMML-SparkML 2.0.3), and see what happens.
Started my local Apache Spark instance like this (I'm pulling JPMML-SparkML version 2.0.3 from the official repository, there is no chance of local application classpath contamination):
$ export SPARK_HOME=/opt/spark-3.0.3
$ $SPARK_HOME/bin/spark-shell --packages org.jpmml:pmml-sparkml:2.0.3
Then, I copy-pasted your example code into the console. Everything was successful:
scala> import org.jpmml.sparkml.PMMLBuilder
import org.jpmml.sparkml.PMMLBuilder
scala> val pmml = new PMMLBuilder(irisSchema, pipelineModel).build()
pmml: org.dmg.pmml.PMML = org.dmg.pmml.PMML@6089c37c
Just to be sure, dumped the PMML document into a file in a local filesystem:
new PMMLBuilder(irisSchema, pipelineModel).buildFile(new java.io.File("iris.pmml.txt"))
See for yourself: iris.pmml.txt
Closing this issue again as "not reproducible" aka "everything works as advertised".
The issue happens because of a JPMML-SparkML library mis-configuration on your computer. Specifically, there must be some extra sparkml2pmml.properties
file(s) on your system or application classpath, which contain an illegal mapping, where the left side is OneHotEncoder
(when it should be OneHotEncoderModel
).
TLDR: Run a file search on your computer, looking for files that are named "sparkml2pmml.properties" and that contain "OneHotEncoder" in them.
Anyway, will try to run your example on my computer (with the latest JPMML-SparkML 2.0.3), and see what happens.
Started my local Apache Spark instance like this (I'm pulling JPMML-SparkML version 2.0.3 from the official repository, there is no chance of local application classpath contamination):
$ export SPARK_HOME=/opt/spark-3.0.3 $ $SPARK_HOME/bin/spark-shell --packages org.jpmml:pmml-sparkml:2.0.3
Then, I copy-pasted your example code into the console. Everything was successful:
scala> import org.jpmml.sparkml.PMMLBuilder import org.jpmml.sparkml.PMMLBuilder scala> val pmml = new PMMLBuilder(irisSchema, pipelineModel).build() pmml: org.dmg.pmml.PMML = org.dmg.pmml.PMML@6089c37c
Just to be sure, dumped the PMML document into a file in a local filesystem:
new PMMLBuilder(irisSchema, pipelineModel).buildFile(new java.io.File("iris.pmml.txt"))
See for yourself: iris.pmml.txt
Thank you very much for your guidance. I ran a container environment according to your method and encountered other problems. Can you provide me with some ideas? (Java. lang. ClassNotFoundException: jakarta. XML. bind. JAXBException)
I ran a container environment according to your method and encountered other problems.
There are some dependencies missing (JPMML-Model library, and above).
It appears to me that Apache Spark's --packages
command-line option is not deterministic. It exhibits different behaviour depending on the composition of the local Apache Maven repository - sometimes it downloads missing transitive dependencies, sometimes it doesn't.
On my computer, I have all transitive dependencies available in my local Apache Maven repository. Therefore I can use the --packages org.jpmml:pmml-sparkml:2.0.3
shortcut.
On your computer (new container environment) this local Apache Maven repository is empty. Therefore, Apache Spark downloads some JAR files (eg. the ones containing org.jpmml.sparkml
classes), but does not download some other JAR files (eg. the ones containing jakarta.xml.bind
classes).
When I run the $SPARK_HOME/bin/spark-shell --packages org.jpmml:pmml-sparkml:2.0.3
command, then I see the following dependency resolution log:
org.jpmml#pmml-sparkml added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-6a81ae05-cf2b-468f-ba1a-73463e866c87;1.0
confs: [default]
found org.jpmml#pmml-sparkml;2.0.3 in local-m2-cache
found org.jpmml#pmml-converter;1.5.5 in local-m2-cache
found org.jpmml#pmml-model-metro;1.6.4 in local-m2-cache
found org.jpmml#pmml-model;1.6.4 in local-m2-cache
found com.fasterxml.jackson.core#jackson-annotations;2.13.3 in local-m2-cache
[2.13.3] com.fasterxml.jackson.core#jackson-annotations;[2.11.0, 2.13.3]
found jakarta.xml.bind#jakarta.xml.bind-api;3.0.1 in local-m2-cache
found org.glassfish.jaxb#jaxb-runtime;3.0.2 in local-m2-cache
found com.sun.activation#jakarta.activation;2.0.1 in local-m2-cache
found org.glassfish.jaxb#jaxb-core;3.0.2 in local-m2-cache
found com.sun.istack#istack-commons-runtime;4.0.1 in local-m2-cache
found com.google.guava#guava;32.1.1-jre in local-m2-cache
found com.google.guava#failureaccess;1.0.1 in local-m2-cache
found com.google.guava#listenablefuture;9999.0-empty-to-avoid-conflict-with-guava in local-m2-cache
found com.google.code.findbugs#jsr305;3.0.2 in local-m2-cache
found org.checkerframework#checker-qual;3.33.0 in local-m2-cache
found com.google.errorprone#error_prone_annotations;2.18.0 in local-m2-cache
found com.google.j2objc#j2objc-annotations;2.8 in local-m2-cache
found org.jpmml#pmml-converter-testing;1.5.5 in local-m2-cache
:: resolution report :: resolve 3286ms :: artifacts dl 20ms
:: modules in use:
com.fasterxml.jackson.core#jackson-annotations;2.13.3 from local-m2-cache in [default]
com.google.code.findbugs#jsr305;3.0.2 from local-m2-cache in [default]
com.google.errorprone#error_prone_annotations;2.18.0 from local-m2-cache in [default]
com.google.guava#failureaccess;1.0.1 from local-m2-cache in [default]
com.google.guava#guava;32.1.1-jre from local-m2-cache in [default]
com.google.guava#listenablefuture;9999.0-empty-to-avoid-conflict-with-guava from local-m2-cache in [default]
com.google.j2objc#j2objc-annotations;2.8 from local-m2-cache in [default]
com.sun.activation#jakarta.activation;2.0.1 from local-m2-cache in [default]
com.sun.istack#istack-commons-runtime;4.0.1 from local-m2-cache in [default]
jakarta.xml.bind#jakarta.xml.bind-api;3.0.1 from local-m2-cache in [default]
org.checkerframework#checker-qual;3.33.0 from local-m2-cache in [default]
org.glassfish.jaxb#jaxb-core;3.0.2 from local-m2-cache in [default]
org.glassfish.jaxb#jaxb-runtime;3.0.2 from local-m2-cache in [default]
org.jpmml#pmml-converter;1.5.5 from local-m2-cache in [default]
org.jpmml#pmml-converter-testing;1.5.5 from local-m2-cache in [default]
org.jpmml#pmml-model;1.6.4 from local-m2-cache in [default]
org.jpmml#pmml-model-metro;1.6.4 from local-m2-cache in [default]
org.jpmml#pmml-sparkml;2.0.3 from local-m2-cache in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 18 | 1 | 0 | 0 || 18 | 0 |
Compare your dependency resolution log against it! See any differences? What are the names of "not found" or "cannot be downloaded", etc. artifacts?
In your case, the missing jakarta.xml.bind.JAXBException
class is probably located inside org.glassfish.jaxb:jaxb-core:3.0.2
or org.glassfish.jaxb:jaxb-runtime:3.0.2
artifacts. If you append both of them manually to the --packages
command, the problem should resolve.
TLDR: It's an Apache Spark package management issue. I can't fix this.
It appears to me that Apache Spark's --packages command-line option is not deterministic. It exhibits different behaviour depending on the composition of the local Apache Maven repository...
The simplest way to ensure that the local Apache Maven repository contains all the required dependencies is to build the JPMML-SparkML library locally from source checkout:
$ git clone https://github.com/jpmml/jpmml-sparkml.git
$ cd jpmml-sparkml
$ git checkout 2.0.3
$ mvn clean install
After that, the $SPARK_HOME/bin/spark-shell --packages org.jpmml:pmml-sparkml:2.0.3
command should succeed as-is.
There are some dependencies missing (JPMML-Model library, and above).
Thank you very much for your idea, Through the https://github.com/jpmml/jpmml-sparkml/blob/2.0.X/pom.xml file, I found out the dependency problem, and it worked normally after excluding spark-mllib 3.0.3 org.glassfish.jaxb Run val pmml = new PMMLBuilder(irisSchema, pipelineModel).build().
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.12</artifactId>
<version>3.0.3</version>
<exclusions>
<exclusion>
<groupId>org.glassfish.jaxb</groupId>
<artifactId>*</artifactId>
</exclusion>
</exclusions>
</dependency>
Through the https://github.com/jpmml/jpmml-sparkml/blob/2.0.X/pom.xml file, I found out the dependency problem, and it worked normally after excluding spark-mllib 3.0.3 org.glassfish.jaxb
Interesting observation - do I understand correctly, that the latest classpath problem (https://github.com/jpmml/jpmml-sparkml/issues/137#issuecomment-1999193096) went away after tweaking the default pom.xml
file? Specifically, did you delete the exclusion
tag?
This exclusion was put there, because Apache Spark includes some not so up-to-date JAXB version. My idea was to exclude that outdated version, and bring in the very latest version via the org.pmml:pmml-model-metro
dependency chain.
It could be the case that the --packages
command-line option does not pay attention to this "forced JAXB update", and proceeds to use its own bundled outdated version, which then remains incomplete/conflicting, leading to the classpath error.
The safest option would be to replace the --packages
command-line options with the --jars
command-line option, and provide a filesystem path to the pre-built pmml-sparkml-example-executable-${version}.jar
JAR file there (available under the JPMML-SparkML releases section).
@xiaoSUM Most importantly - after you got the classpath issue sorted out, does the conversion succeed now? Did you find out, where was the invalid OneHotEncoder
mapping coming in from?
@xiaoSUM Most importantly - after you got the classpath issue sorted out, does the conversion succeed now? Did you find out, where was the invalid
OneHotEncoder
mapping coming in from?
Thank you for providing a safer solution.
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.eye</groupId>
<artifactId>data_mining_arithmetic</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging>
<name>data mining</name>
<description>data mining arithmetic</description>
<properties>
<java.version>1.8</java.version>
<scala.version>2.12.12</scala.version>
<scala.tools.version>2.12</scala.tools.version>
<spark.version>3.0.3</spark.version>
<mysql.connector.version>8.0.16</mysql.connector.version>
<postgresql.version>42.2.5</postgresql.version>
</properties>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.0.3</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.12</artifactId>
<version>3.0.3</version>
<exclusions>
<exclusion>
<groupId>org.glassfish.jaxb</groupId>
<artifactId>*</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.jpmml</groupId>
<artifactId>pmml-sparkml</artifactId>
<version>2.0.3</version>
</dependency>
</dependencies>
<build>
<resources>
<resource>
<directory>src/main/java</directory>
<includes>
<include>**/*.xml</include>
</includes>
</resource>
<resource>
<directory>src/main/resources</directory>
</resource>
</resources>
<plugins>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.2</version>
<executions>
<execution>
<id>compile-scala</id>
<phase>process-resources</phase>
<goals>
<goal>add-source</goal>
<goal>compile</goal>
</goals>
</execution>
<execution>
<id>scala-compile-first</id>
<phase>compile</phase>
<goals>
<goal>add-source</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
See for yourself: iris.pmml.txt
hello,I found that the content of the generated pmml file was missing, only the "species" and "petal_length" columns:
<MiningSchema>
<MiningField name="species" usageType="target"/>
<MiningField name="petal_length"/>
</MiningSchema>
But the data I entered has sepal_length, sepal_width, petal_length, petal_width, species columns.
I found that the content of the generated pmml file was missing, only the "species" and "petal_length" columns:
The PMML representation of the model is correct - the prediction is made based on one input feature only ("petal_length"); the other three input features ("petal_width", "sepal_length" and "sepal_width") do not participate in the "decisioning process" in any way, and are therefore pruned.
There is no point in keeping around unused input features.
If you use this PMML document for prediction, then you'll see that the predicted classes and class probability distributions match 100% between PMML and Apache Spark.
The PMML representation of the model is correct - the prediction is made based on one input feature only ("petal_length")
If you increase the complexity of the model, more input features will be used in the decisioning process, and they will also show up in the PMML representation.
For example, replace DecisionTreeClassifier
with RandomForestClassifier(n_estimators = 7)
.
I found that the content of the generated pmml file was missing, only the "species" and "petal_length" columns:
The PMML representation of the model is correct - the prediction is made based on one input feature only ("petal_length"); the other three input features ("petal_width", "sepal_length" and "sepal_width") do not participate in the "decisioning process" in any way, and are therefore pruned.
There is no point in keeping around unused input features.
If you use this PMML document for prediction, then you'll see that the predicted classes and class probability distributions match 100% between PMML and Apache Spark.
How to deal with the error when I input data sepal_length, sepal_width, petal_width columns when using inference? Because the model cannot recognize sepal_length, sepal_width, petal_width.
How to deal with the error when I input data sepal_length, sepal_width, petal_width columns when using inference?
What error? Do you have a stack trace or some other tangible evidence?
When making predictions, then the PMML engine should be querying its "inference context" only for the value of the "petal_length" input feature.
It should not ask for any other input feature value (because there are no other input features declared in the PMML document).
The PMML engine should not care about the contents of its "inference context" beyond this one single mapping. If your PMML engine is sensitive to the composition of its "inference context" (eg. whether the "sepal_length" feature value is present or not), then it is broken/stupid.
How to deal with the error when I input data sepal_length, sepal_width, petal_width columns when using inference?
What error? Do you have a stack trace or some other tangible evidence?
There is such a usage scenario. When the user inputs the original data to train the model, the pmml file converted by calling pipelinemodel using the original data should be able to directly call the output interface. The interface called after each training needs to adjust the input features to adapt to the pmml document?
Traceback (most recent call last):
File "e:\aml-cube\algo-framework-2\aml-scheduler\job-template\job\automl\test.py", line 34, in <module>
pred = model.predict(df_1)
File "D:\anaconda3\envs\fs\lib\site-packages\pypmml\model.py", line 163, in predict
return self.call('predict', data)
File "D:\anaconda3\envs\fs\lib\site-packages\pypmml\base.py", line 134, in call
return call_java_func(getattr(self._java_model, name), *a)
File "D:\anaconda3\envs\fs\lib\site-packages\pypmml\base.py", line 41, in call_java_func
return _java2py(func(*args))
File "D:\anaconda3\envs\fs\lib\site-packages\py4j\java_gateway.py", line 1314, in __call__
args_command, temp_args = self._build_args(*args)
File "D:\anaconda3\envs\fs\lib\site-packages\py4j\java_gateway.py", line 1277, in _build_args
(new_args, temp_args) = self._get_args(args)
File "D:\anaconda3\envs\fs\lib\site-packages\py4j\java_gateway.py", line 1264, in _get_args
temp_arg = converter.convert(arg, self.gateway_client)
File "D:\anaconda3\envs\fs\lib\site-packages\py4j\java_collections.py", line 511, in convert
java_list.add(element)
File "D:\anaconda3\envs\fs\lib\site-packages\py4j\java_gateway.py", line 1314, in __call__
args_command, temp_args = self._build_args(*args)
File "D:\anaconda3\envs\fs\lib\site-packages\py4j\java_gateway.py", line 1277, in _build_args
(new_args, temp_args) = self._get_args(args)
File "D:\anaconda3\envs\fs\lib\site-packages\py4j\java_gateway.py", line 1264, in _get_args
temp_arg = converter.convert(arg, self.gateway_client)
File "D:\anaconda3\envs\fs\lib\site-packages\py4j\java_collections.py", line 523, in convert
java_map[key] = object[key]
File "D:\anaconda3\envs\fs\lib\site-packages\py4j\java_collections.py", line 82, in __setitem__
self.put(key, value)
File "D:\anaconda3\envs\fs\lib\site-packages\py4j\java_gateway.py", line 1314, in __call__
args_command, temp_args = self._build_args(*args)
File "D:\anaconda3\envs\fs\lib\site-packages\py4j\java_gateway.py", line 1277, in _build_args
(new_args, temp_args) = self._get_args(args)
File "D:\anaconda3\envs\fs\lib\site-packages\py4j\java_gateway.py", line 1264, in _get_args
temp_arg = converter.convert(arg, self.gateway_client)
File "D:\anaconda3\envs\fs\lib\site-packages\py4j\java_collections.py", line 523, in convert
java_map[key] = object[key]
File "D:\anaconda3\envs\fs\lib\site-packages\py4j\java_collections.py", line 82, in __setitem__
self.put(key, value)
File "D:\anaconda3\envs\fs\lib\site-packages\py4j\java_gateway.py", line 1314, in __call__
args_command, temp_args = self._build_args(*args)
File "D:\anaconda3\envs\fs\lib\site-packages\py4j\java_gateway.py", line 1277, in _build_args
(new_args, temp_args) = self._get_args(args)
File "D:\anaconda3\envs\fs\lib\site-packages\py4j\java_gateway.py", line 1264, in _get_args
temp_arg = converter.convert(arg, self.gateway_client)
File "D:\anaconda3\envs\fs\lib\site-packages\py4j\java_collections.py", line 511, in convert
java_list.add(element)
File "D:\anaconda3\envs\fs\lib\site-packages\py4j\java_gateway.py", line 1314, in __call__
args_command, temp_args = self._build_args(*args)
File "D:\anaconda3\envs\fs\lib\site-packages\py4j\java_gateway.py", line 1283, in _build_args
[get_command_part(arg, self.pool) for arg in new_args])
File "D:\anaconda3\envs\fs\lib\site-packages\py4j\java_gateway.py", line 1283, in <listcomp>
[get_command_part(arg, self.pool) for arg in new_args])
File "D:\anaconda3\envs\fs\lib\site-packages\py4j\protocol.py", line 298, in get_command_part
command_part = REFERENCE_TYPE + parameter._get_object_id()
AttributeError: 'numpy.int32' object has no attribute '_get_object_id'
The interface called after each training needs to adjust the input features to adapt to the pmml document?
PMML identifies features by name and name only.
So, the assumption is that your data container must support input feature identification by name. In Python world, it means that pandas.DataFrame
is suitable, but numpy.ndarray
is not.
When performing input feature lookups by name, then it absolutely doesn't matter if the targeted input column has changed its physical position between workflow runs or not. For example, it is perfectly OK, that "petal_length" is column number one in training data frame, and column number five or seven in validation/testing data frame.
Right now, your issue is caused by the fact that you're using PyPMML package for making prediction, which is known not to respect this one fundamental assumption - it does not check column names; it feeds data into models using column indices.
Please switch from PyPMML to JPMML-Evaluator-Python, and everything will be okay.
spark 3.0.3 jpmml 2.0.1 have error Expected org.apache.spark.ml.Transformer subclass, got org.apache.spark.ml.feature.OneHotEncoder.
thanks!