Cannot convert (partially-) unfitted pipelines

jpmml / jpmml-sparkml

Java library and command-line application for converting Apache Spark ML pipelines to PMML

GNU Affero General Public License v3.0

267 stars 80 forks source link

Cannot convert (partially-) unfitted pipelines #137

Closed xiaoSUM closed 6 months ago

xiaoSUM commented 6 months ago

spark 3.0.3 jpmml 2.0.1 have error Expected org.apache.spark.ml.Transformer subclass, got org.apache.spark.ml.feature.OneHotEncoder.

thanks!

vruusmann commented 6 months ago

Two questions to you:

Can you paste the full exception stack trace here (full depth of the exception stack)?
Why don't you upgrade from JPMML-SparkML 2.0.1 to 2.0.3 (this is the latest 2.0.X release version as of today), and see if anything changes.

vruusmann commented 6 months ago

Pay attention to the exception message!

It complains about the OneHotEncoder class, which is the initial/unfitted state of the transformer! As such, it does not contain any fitted state (for example, between class labels and their one-hot encoded indices).

When you fit your pipeline, the OneHotEncoder step is replaced with a OneHotEncoderModel step (note the "Model" suffix to the class name). This class is recognized and supported by the JPMML-SparkML library.

For a list of supported transformers, see this mappings file: https://github.com/jpmml/jpmml-sparkml/blob/2.0.1/pmml-sparkml/src/main/resources/META-INF/sparkml2pmml.properties

The mapping for OneHotEncoderModel is located on line 14 there.

vruusmann commented 6 months ago

TLDR: It appears to me that you are attempting to convert a pipeline object which contains one or more unfitted steps. For example, there is an unfitted OneHotEncoder step in there.

The JPMML-SparkML library assumes that all steps have been fitted.

Feel free to reopen this issue, if my initial instinct/reaction about the nature of this problem turns out to be incorrect.

xiaoSUM commented 6 months ago

Two questions to you:

Can you paste the full exception stack trace here (full depth of the exception stack)?

Why don't you upgrade from JPMML-SparkML 2.0.1 to 2.0.3 (this is the latest 2.0.X release version as of today), and see if anything changes.

Thank you for your guidance! I simply reproduced the problem.

pom.xml:

 <dependency>
       <groupId>org.jpmml</groupId>
       <artifactId>pmml-sparkml</artifactId>
       <version>2.0.3</version>
</dependency>

data:

sepal_length,sepal_width,petal_length,petal_width,species 5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa 4.6,3.1,1.5,0.2,Iris-setosa 5.0,3.6,1.4,0.2,Iris-setosa 5.4,3.9,1.7,0.4,Iris-setosa 4.6,3.4,1.4,0.3,Iris-setosa 5.0,3.4,1.5,0.2,Iris-setosa 4.4,2.9,1.4,0.2,Iris-setosa 7.0,3.2,4.7,1.4,Iris-versicolor 6.4,3.2,4.5,1.5,Iris-versicolor 6.9,3.1,4.9,1.5,Iris-versicolor 5.5,2.3,4.0,1.3,Iris-versicolor 6.5,2.8,4.6,1.5,Iris-versicolor 5.7,2.8,4.5,1.3,Iris-versicolor 6.3,3.3,4.7,1.6,Iris-versicolor 4.9,2.4,3.3,1.0,Iris-versicolor 6.6,2.9,4.6,1.3,Iris-versicolor 5.2,2.7,3.9,1.4,Iris-versicolor

scala code:

import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.DecisionTreeClassifier
import org.apache.spark.ml.feature.RFormula

object test {
    def main(args: Array[String]): Unit = {
        val spark = SparkSession
          .builder().master("local").getOrCreate()
      val irisData = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("E:\\code\\20230201git\\data_mining_arithmetic\\src\\main\\resources\\data\\iris.csv")
      val irisSchema = irisData.schema
      val rFormula = new RFormula().setFormula("species ~ .")
      val dtClassifier = new DecisionTreeClassifier().setLabelCol(rFormula.getLabelCol).setFeaturesCol(rFormula.getFeaturesCol)
      val pipeline = new Pipeline().setStages(Array(rFormula, dtClassifier))
      val pipelineModel = pipeline.fit(irisData)
      import org.jpmml.sparkml.PMMLBuilder
      val pmml = new PMMLBuilder(irisSchema, pipelineModel).build()
    }
}

error

vruusmann commented 6 months ago

Very interesting! Especially the fact that your example Scala code doesn't contain any references to the OneHotEncoder in any shape or form (whether initial/unfitted or final/fitted).

Another interesting thing is that the exception is thrown on line 192 of ConverterFactory.java. This agrees with JPMML-SparkML version 2.0.0, but it doesn't agree with JPMML-SparkML versions 2.0.1 through 2.0.3, where it should be line 193. https://github.com/jpmml/jpmml-sparkml/blob/2.0.0/pmml-sparkml/src/main/java/org/jpmml/sparkml/ConverterFactory.java#L192 https://github.com/jpmml/jpmml-sparkml/blob/2.0.1/pmml-sparkml/src/main/java/org/jpmml/sparkml/ConverterFactory.java#L192-L193

It still looks like some kind of JPMML-SparkML library configuration problem on your computer. Basically, your application classpath contains some customsparkml2pmml.properties file(s), where there is a mapping from org.apache.spark.ml.feature.OneHotEncoder transformer class to some transformer converter class.

The correct mapping would have org.apache.spark.ml.feature.OneHotEncoderModel on its left side (aka key).

This is such a fundamental JPMML-SparkML library configuration issue, which should make it unusable everywhere for everybody. Yet, all its GitHub Actions CI integration tests are passing cleanly, and there are no other people complaining about it. Which kind of suggests that the misconfiguration is specific to your computer/production environment.

Anyway, will try to run your example on my computer (with the latest JPMML-SparkML 2.0.3), and see what happens.

vruusmann commented 6 months ago

Anyway, will try to run your example on my computer (with the latest JPMML-SparkML 2.0.3), and see what happens.

Started my local Apache Spark instance like this (I'm pulling JPMML-SparkML version 2.0.3 from the official repository, there is no chance of local application classpath contamination):

$ export SPARK_HOME=/opt/spark-3.0.3
$ $SPARK_HOME/bin/spark-shell --packages org.jpmml:pmml-sparkml:2.0.3

Then, I copy-pasted your example code into the console. Everything was successful:

scala> import org.jpmml.sparkml.PMMLBuilder
import org.jpmml.sparkml.PMMLBuilder

scala> val pmml = new PMMLBuilder(irisSchema, pipelineModel).build()
pmml: org.dmg.pmml.PMML = org.dmg.pmml.PMML@6089c37c

Just to be sure, dumped the PMML document into a file in a local filesystem:

new PMMLBuilder(irisSchema, pipelineModel).buildFile(new java.io.File("iris.pmml.txt"))

See for yourself: iris.pmml.txt

vruusmann commented 6 months ago

Closing this issue again as "not reproducible" aka "everything works as advertised".

The issue happens because of a JPMML-SparkML library mis-configuration on your computer. Specifically, there must be some extra sparkml2pmml.properties file(s) on your system or application classpath, which contain an illegal mapping, where the left side is OneHotEncoder (when it should be OneHotEncoderModel).

TLDR: Run a file search on your computer, looking for files that are named "sparkml2pmml.properties" and that contain "OneHotEncoder" in them.

xiaoSUM commented 6 months ago

Anyway, will try to run your example on my computer (with the latest JPMML-SparkML 2.0.3), and see what happens.

Started my local Apache Spark instance like this (I'm pulling JPMML-SparkML version 2.0.3 from the official repository, there is no chance of local application classpath contamination):
$ export SPARK_HOME=/opt/spark-3.0.3
$ $SPARK_HOME/bin/spark-shell --packages org.jpmml:pmml-sparkml:2.0.3
Then, I copy-pasted your example code into the console. Everything was successful:
scala> import org.jpmml.sparkml.PMMLBuilder
import org.jpmml.sparkml.PMMLBuilder

scala> val pmml = new PMMLBuilder(irisSchema, pipelineModel).build()
pmml: org.dmg.pmml.PMML = org.dmg.pmml.PMML@6089c37c
Just to be sure, dumped the PMML document into a file in a local filesystem:
new PMMLBuilder(irisSchema, pipelineModel).buildFile(new java.io.File("iris.pmml.txt"))
See for yourself: iris.pmml.txt

Thank you very much for your guidance. I ran a container environment according to your method and encountered other problems. Can you provide me with some ideas? （Java. lang. ClassNotFoundException: jakarta. XML. bind. JAXBException）

vruusmann commented 6 months ago

I ran a container environment according to your method and encountered other problems.

There are some dependencies missing (JPMML-Model library, and above).

It appears to me that Apache Spark's --packages command-line option is not deterministic. It exhibits different behaviour depending on the composition of the local Apache Maven repository - sometimes it downloads missing transitive dependencies, sometimes it doesn't.

On my computer, I have all transitive dependencies available in my local Apache Maven repository. Therefore I can use the --packages org.jpmml:pmml-sparkml:2.0.3 shortcut.

On your computer (new container environment) this local Apache Maven repository is empty. Therefore, Apache Spark downloads some JAR files (eg. the ones containing org.jpmml.sparkml classes), but does not download some other JAR files (eg. the ones containing jakarta.xml.bind classes).

vruusmann commented 6 months ago

When I run the $SPARK_HOME/bin/spark-shell --packages org.jpmml:pmml-sparkml:2.0.3 command, then I see the following dependency resolution log:

org.jpmml#pmml-sparkml added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-6a81ae05-cf2b-468f-ba1a-73463e866c87;1.0
        confs: [default]
        found org.jpmml#pmml-sparkml;2.0.3 in local-m2-cache
        found org.jpmml#pmml-converter;1.5.5 in local-m2-cache
        found org.jpmml#pmml-model-metro;1.6.4 in local-m2-cache
        found org.jpmml#pmml-model;1.6.4 in local-m2-cache
        found com.fasterxml.jackson.core#jackson-annotations;2.13.3 in local-m2-cache
        [2.13.3] com.fasterxml.jackson.core#jackson-annotations;[2.11.0, 2.13.3]
        found jakarta.xml.bind#jakarta.xml.bind-api;3.0.1 in local-m2-cache
        found org.glassfish.jaxb#jaxb-runtime;3.0.2 in local-m2-cache
        found com.sun.activation#jakarta.activation;2.0.1 in local-m2-cache
        found org.glassfish.jaxb#jaxb-core;3.0.2 in local-m2-cache
        found com.sun.istack#istack-commons-runtime;4.0.1 in local-m2-cache
        found com.google.guava#guava;32.1.1-jre in local-m2-cache
        found com.google.guava#failureaccess;1.0.1 in local-m2-cache
        found com.google.guava#listenablefuture;9999.0-empty-to-avoid-conflict-with-guava in local-m2-cache
        found com.google.code.findbugs#jsr305;3.0.2 in local-m2-cache
        found org.checkerframework#checker-qual;3.33.0 in local-m2-cache
        found com.google.errorprone#error_prone_annotations;2.18.0 in local-m2-cache
        found com.google.j2objc#j2objc-annotations;2.8 in local-m2-cache
        found org.jpmml#pmml-converter-testing;1.5.5 in local-m2-cache
:: resolution report :: resolve 3286ms :: artifacts dl 20ms
        :: modules in use:
        com.fasterxml.jackson.core#jackson-annotations;2.13.3 from local-m2-cache in [default]
        com.google.code.findbugs#jsr305;3.0.2 from local-m2-cache in [default]
        com.google.errorprone#error_prone_annotations;2.18.0 from local-m2-cache in [default]
        com.google.guava#failureaccess;1.0.1 from local-m2-cache in [default]
        com.google.guava#guava;32.1.1-jre from local-m2-cache in [default]
        com.google.guava#listenablefuture;9999.0-empty-to-avoid-conflict-with-guava from local-m2-cache in [default]
        com.google.j2objc#j2objc-annotations;2.8 from local-m2-cache in [default]
        com.sun.activation#jakarta.activation;2.0.1 from local-m2-cache in [default]
        com.sun.istack#istack-commons-runtime;4.0.1 from local-m2-cache in [default]
        jakarta.xml.bind#jakarta.xml.bind-api;3.0.1 from local-m2-cache in [default]
        org.checkerframework#checker-qual;3.33.0 from local-m2-cache in [default]
        org.glassfish.jaxb#jaxb-core;3.0.2 from local-m2-cache in [default]
        org.glassfish.jaxb#jaxb-runtime;3.0.2 from local-m2-cache in [default]
        org.jpmml#pmml-converter;1.5.5 from local-m2-cache in [default]
        org.jpmml#pmml-converter-testing;1.5.5 from local-m2-cache in [default]
        org.jpmml#pmml-model;1.6.4 from local-m2-cache in [default]
        org.jpmml#pmml-model-metro;1.6.4 from local-m2-cache in [default]
        org.jpmml#pmml-sparkml;2.0.3 from local-m2-cache in [default]
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   18  |   1   |   0   |   0   ||   18  |   0   |

Compare your dependency resolution log against it! See any differences? What are the names of "not found" or "cannot be downloaded", etc. artifacts?

In your case, the missing jakarta.xml.bind.JAXBException class is probably located inside org.glassfish.jaxb:jaxb-core:3.0.2 or org.glassfish.jaxb:jaxb-runtime:3.0.2 artifacts. If you append both of them manually to the --packages command, the problem should resolve.

TLDR: It's an Apache Spark package management issue. I can't fix this.

vruusmann commented 6 months ago

It appears to me that Apache Spark's --packages command-line option is not deterministic. It exhibits different behaviour depending on the composition of the local Apache Maven repository...

The simplest way to ensure that the local Apache Maven repository contains all the required dependencies is to build the JPMML-SparkML library locally from source checkout:

$ git clone https://github.com/jpmml/jpmml-sparkml.git
$ cd jpmml-sparkml
$ git checkout 2.0.3
$ mvn clean install

After that, the $SPARK_HOME/bin/spark-shell --packages org.jpmml:pmml-sparkml:2.0.3 command should succeed as-is.

xiaoSUM commented 6 months ago

There are some dependencies missing (JPMML-Model library, and above).

Thank you very much for your idea, Through the https://github.com/jpmml/jpmml-sparkml/blob/2.0.X/pom.xml file, I found out the dependency problem, and it worked normally after excluding spark-mllib 3.0.3 org.glassfish.jaxb Run val pmml = new PMMLBuilder(irisSchema, pipelineModel).build().

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-mllib_2.12</artifactId>
            <version>3.0.3</version>
            <exclusions>
                <exclusion>
                    <groupId>org.glassfish.jaxb</groupId>
                    <artifactId>*</artifactId>
                </exclusion>
            </exclusions>
        </dependency>

vruusmann commented 6 months ago

Through the https://github.com/jpmml/jpmml-sparkml/blob/2.0.X/pom.xml file, I found out the dependency problem, and it worked normally after excluding spark-mllib 3.0.3 org.glassfish.jaxb

Interesting observation - do I understand correctly, that the latest classpath problem (https://github.com/jpmml/jpmml-sparkml/issues/137#issuecomment-1999193096) went away after tweaking the default pom.xml file? Specifically, did you delete the exclusion tag?

This exclusion was put there, because Apache Spark includes some not so up-to-date JAXB version. My idea was to exclude that outdated version, and bring in the very latest version via the org.pmml:pmml-model-metro dependency chain.

It could be the case that the --packages command-line option does not pay attention to this "forced JAXB update", and proceeds to use its own bundled outdated version, which then remains incomplete/conflicting, leading to the classpath error.

The safest option would be to replace the --packages command-line options with the --jars command-line option, and provide a filesystem path to the pre-built pmml-sparkml-example-executable-${version}.jar JAR file there (available under the JPMML-SparkML releases section).

vruusmann commented 6 months ago

@xiaoSUM Most importantly - after you got the classpath issue sorted out, does the conversion succeed now? Did you find out, where was the invalid OneHotEncoder mapping coming in from?

xiaoSUM commented 6 months ago

@xiaoSUM Most importantly - after you got the classpath issue sorted out, does the conversion succeed now? Did you find out, where was the invalid OneHotEncoder mapping coming in from?

Thank you for providing a safer solution.

1、 I deleted all the dependencies of my local maven library, the problem of invalid ineHotEncoder is gone.

2、 then updated the pom file as follows,Now the conversion is successful.

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>com.eye</groupId>
    <artifactId>data_mining_arithmetic</artifactId>
    <version>0.0.1-SNAPSHOT</version>
    <packaging>jar</packaging>
    <name>data mining</name>
    <description>data mining arithmetic</description>
    <properties>
        <java.version>1.8</java.version>
        <scala.version>2.12.12</scala.version>
        <scala.tools.version>2.12</scala.tools.version>
        <spark.version>3.0.3</spark.version>
        <mysql.connector.version>8.0.16</mysql.connector.version>
        <postgresql.version>42.2.5</postgresql.version>
    </properties>
    <dependencies>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>${scala.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.12</artifactId>
            <version>3.0.3</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-mllib_2.12</artifactId>
            <version>3.0.3</version>
            <exclusions>
                <exclusion>
                    <groupId>org.glassfish.jaxb</groupId>
                    <artifactId>*</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
        <dependency>
            <groupId>org.jpmml</groupId>
            <artifactId>pmml-sparkml</artifactId>
            <version>2.0.3</version>
        </dependency>
    </dependencies>

    <build>
        <resources>
            <resource>
                <directory>src/main/java</directory>
                <includes>
                    <include>**/*.xml</include>
                </includes>
            </resource>
            <resource>
                <directory>src/main/resources</directory>
            </resource>
        </resources>
        <plugins>
            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <version>3.2.2</version>
                <executions>
                    <execution>
                        <id>compile-scala</id>
                        <phase>process-resources</phase>
                        <goals>
                            <goal>add-source</goal>
                            <goal>compile</goal>
                        </goals>
                    </execution>
                    <execution>
                        <id>scala-compile-first</id>
                        <phase>compile</phase>
                        <goals>
                            <goal>add-source</goal>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>

xiaoSUM commented 6 months ago

See for yourself: iris.pmml.txt

hello，I found that the content of the generated pmml file was missing, only the "species" and "petal_length" columns:

        <MiningSchema>
            <MiningField name="species" usageType="target"/>
            <MiningField name="petal_length"/>
        </MiningSchema>

But the data I entered has sepal_length, sepal_width, petal_length, petal_width, species columns.

vruusmann commented 6 months ago

I found that the content of the generated pmml file was missing, only the "species" and "petal_length" columns:

The PMML representation of the model is correct - the prediction is made based on one input feature only ("petal_length"); the other three input features ("petal_width", "sepal_length" and "sepal_width") do not participate in the "decisioning process" in any way, and are therefore pruned.

There is no point in keeping around unused input features.

If you use this PMML document for prediction, then you'll see that the predicted classes and class probability distributions match 100% between PMML and Apache Spark.

vruusmann commented 6 months ago

The PMML representation of the model is correct - the prediction is made based on one input feature only ("petal_length")

If you increase the complexity of the model, more input features will be used in the decisioning process, and they will also show up in the PMML representation.

For example, replace DecisionTreeClassifier with RandomForestClassifier(n_estimators = 7).

xiaoSUM commented 6 months ago

I found that the content of the generated pmml file was missing, only the "species" and "petal_length" columns:

The PMML representation of the model is correct - the prediction is made based on one input feature only ("petal_length"); the other three input features ("petal_width", "sepal_length" and "sepal_width") do not participate in the "decisioning process" in any way, and are therefore pruned.

There is no point in keeping around unused input features.

If you use this PMML document for prediction, then you'll see that the predicted classes and class probability distributions match 100% between PMML and Apache Spark.

How to deal with the error when I input data sepal_length, sepal_width, petal_width columns when using inference? Because the model cannot recognize sepal_length, sepal_width, petal_width.

vruusmann commented 6 months ago

How to deal with the error when I input data sepal_length, sepal_width, petal_width columns when using inference?

What error? Do you have a stack trace or some other tangible evidence?

When making predictions, then the PMML engine should be querying its "inference context" only for the value of the "petal_length" input feature.

It should not ask for any other input feature value (because there are no other input features declared in the PMML document).

The PMML engine should not care about the contents of its "inference context" beyond this one single mapping. If your PMML engine is sensitive to the composition of its "inference context" (eg. whether the "sepal_length" feature value is present or not), then it is broken/stupid.

xiaoSUM commented 6 months ago

How to deal with the error when I input data sepal_length, sepal_width, petal_width columns when using inference?

What error? Do you have a stack trace or some other tangible evidence?

There is such a usage scenario. When the user inputs the original data to train the model, the pmml file converted by calling pipelinemodel using the original data should be able to directly call the output interface. The interface called after each training needs to adjust the input features to adapt to the pmml document?

Traceback (most recent call last):
  File "e:\aml-cube\algo-framework-2\aml-scheduler\job-template\job\automl\test.py", line 34, in <module>
    pred = model.predict(df_1)
  File "D:\anaconda3\envs\fs\lib\site-packages\pypmml\model.py", line 163, in predict
    return self.call('predict', data)
  File "D:\anaconda3\envs\fs\lib\site-packages\pypmml\base.py", line 134, in call
    return call_java_func(getattr(self._java_model, name), *a)
  File "D:\anaconda3\envs\fs\lib\site-packages\pypmml\base.py", line 41, in call_java_func
    return _java2py(func(*args))
  File "D:\anaconda3\envs\fs\lib\site-packages\py4j\java_gateway.py", line 1314, in __call__
    args_command, temp_args = self._build_args(*args)
  File "D:\anaconda3\envs\fs\lib\site-packages\py4j\java_gateway.py", line 1277, in _build_args
    (new_args, temp_args) = self._get_args(args)
  File "D:\anaconda3\envs\fs\lib\site-packages\py4j\java_gateway.py", line 1264, in _get_args
    temp_arg = converter.convert(arg, self.gateway_client)
  File "D:\anaconda3\envs\fs\lib\site-packages\py4j\java_collections.py", line 511, in convert
    java_list.add(element)
  File "D:\anaconda3\envs\fs\lib\site-packages\py4j\java_gateway.py", line 1314, in __call__
    args_command, temp_args = self._build_args(*args)
  File "D:\anaconda3\envs\fs\lib\site-packages\py4j\java_gateway.py", line 1277, in _build_args
    (new_args, temp_args) = self._get_args(args)
  File "D:\anaconda3\envs\fs\lib\site-packages\py4j\java_gateway.py", line 1264, in _get_args
    temp_arg = converter.convert(arg, self.gateway_client)
  File "D:\anaconda3\envs\fs\lib\site-packages\py4j\java_collections.py", line 523, in convert
    java_map[key] = object[key]
  File "D:\anaconda3\envs\fs\lib\site-packages\py4j\java_collections.py", line 82, in __setitem__
    self.put(key, value)
  File "D:\anaconda3\envs\fs\lib\site-packages\py4j\java_gateway.py", line 1314, in __call__
    args_command, temp_args = self._build_args(*args)
  File "D:\anaconda3\envs\fs\lib\site-packages\py4j\java_gateway.py", line 1277, in _build_args
    (new_args, temp_args) = self._get_args(args)
  File "D:\anaconda3\envs\fs\lib\site-packages\py4j\java_gateway.py", line 1264, in _get_args
    temp_arg = converter.convert(arg, self.gateway_client)
  File "D:\anaconda3\envs\fs\lib\site-packages\py4j\java_collections.py", line 523, in convert
    java_map[key] = object[key]
  File "D:\anaconda3\envs\fs\lib\site-packages\py4j\java_collections.py", line 82, in __setitem__
    self.put(key, value)
  File "D:\anaconda3\envs\fs\lib\site-packages\py4j\java_gateway.py", line 1314, in __call__
    args_command, temp_args = self._build_args(*args)
  File "D:\anaconda3\envs\fs\lib\site-packages\py4j\java_gateway.py", line 1277, in _build_args
    (new_args, temp_args) = self._get_args(args)
  File "D:\anaconda3\envs\fs\lib\site-packages\py4j\java_gateway.py", line 1264, in _get_args
    temp_arg = converter.convert(arg, self.gateway_client)
  File "D:\anaconda3\envs\fs\lib\site-packages\py4j\java_collections.py", line 511, in convert
    java_list.add(element)
  File "D:\anaconda3\envs\fs\lib\site-packages\py4j\java_gateway.py", line 1314, in __call__
    args_command, temp_args = self._build_args(*args)
  File "D:\anaconda3\envs\fs\lib\site-packages\py4j\java_gateway.py", line 1283, in _build_args
    [get_command_part(arg, self.pool) for arg in new_args])
  File "D:\anaconda3\envs\fs\lib\site-packages\py4j\java_gateway.py", line 1283, in <listcomp>
    [get_command_part(arg, self.pool) for arg in new_args])
  File "D:\anaconda3\envs\fs\lib\site-packages\py4j\protocol.py", line 298, in get_command_part
    command_part = REFERENCE_TYPE + parameter._get_object_id()
AttributeError: 'numpy.int32' object has no attribute '_get_object_id'

vruusmann commented 6 months ago

The interface called after each training needs to adjust the input features to adapt to the pmml document?

PMML identifies features by name and name only.

So, the assumption is that your data container must support input feature identification by name. In Python world, it means that pandas.DataFrame is suitable, but numpy.ndarray is not.

When performing input feature lookups by name, then it absolutely doesn't matter if the targeted input column has changed its physical position between workflow runs or not. For example, it is perfectly OK, that "petal_length" is column number one in training data frame, and column number five or seven in validation/testing data frame.

Right now, your issue is caused by the fact that you're using PyPMML package for making prediction, which is known not to respect this one fundamental assumption - it does not check column names; it feeds data into models using column indices.

Please switch from PyPMML to JPMML-Evaluator-Python, and everything will be okay.