jpmml / jpmml-evaluator-spark

PMML evaluator library for the Apache Spark cluster computing system (http://spark.apache.org/)
GNU Affero General Public License v3.0
94 stars 43 forks source link

org.jpmml.evaluator.UnsupportedFeatureException: TreeModel #15

Closed akari0725 closed 6 years ago

akari0725 commented 6 years ago

Hello!

My colleague give me an pmml file like:

<?xml version="1.0" encoding="UTF-8"?>
<PMML version="4.2" xmlns="http://www.dmg.org/PMML-4_2">
  <Header copyright="rosy">
    <Application name="KNIME" version="2.11.0"/>
  </Header>
  <DataDictionary numberOfFields="21">
...
  <TreeModel modelName="DecisionTree" functionName="classification" splitCharacteristic="binarySplit" missingValueStrategy="lastPrediction" noTrueChildStrategy="returnNullPrediction">
    <MiningSchema>
      <MiningField name="VMail Message" invalidValueTreatment="asIs"/>
      <MiningField name="Day Mins" invalidValueTreatment="asIs"/>
...

I build a transformer:

        InputStream is = MyModel.class.getResourceAsStream("/model.pmml");
        Evaluator evaluator = EvaluatorUtil.createEvaluator(is);

        TransformerBuilder modelBuilder = new TransformerBuilder(evaluator)
                .withOutputCols()
                .withTargetCols()
                .exploded(false);

        Transformer transformer = modelBuilder.build();

and run it at spark local, it's no any problem, the job add the predicte tag after the DataFrame.

But, when i run it at AWS EMR cluster which has 1 masternode and 2 workernode, The java code can not transform the pmml file:

Exception in thread "main" org.jpmml.evaluator.UnsupportedFeatureException: TreeModel
    at org.jpmml.evaluator.ModelEvaluatorFactory.createModelEvaluator(ModelEvaluatorFactory.java:134)
    at org.jpmml.evaluator.ModelEvaluatorFactory.newModelEvaluator(ModelEvaluatorFactory.java:74)
    at org.jpmml.evaluator.ModelEvaluatorFactory.newModelEvaluator(ModelEvaluatorFactory.java:70)
    at org.jpmml.evaluator.spark.EvaluatorUtil.createEvaluator(EvaluatorUtil.java:63)
    at javaCode.mlModel.MyModel.getClassifier(MyModel.java:21)
    at testpackage.test_s3$.main(test_s3.scala:17)

I don't know Y.

my env is: java 7 scala 2.11.8 spark 2.0.2 (AWS EMR 5.2.1)

        <dependency>
            <groupId>org.jpmml</groupId>
            <artifactId>pmml-model</artifactId>
            <version>1.3.8</version>
        </dependency>
        <dependency>
            <groupId>org.jpmml</groupId>
            <artifactId>pmml-evaluator</artifactId>
            <version>1.3.10</version>
        </dependency>
        <dependency>
            <groupId>org.jpmml</groupId>
            <artifactId>jpmml-evaluator-spark</artifactId>
            <version>1.1-SNAPSHOT</version>
        </dependency>

Last time, in order to run the pmml(4.3) exported from sklearn, I used jpmml-evaluator-spark 1.1 This time, the version of pmml is 4.2, but when I use jpmml-evaluator-spark 1.0.0, it has same problam.

Forgive my fucking English...

Thank you!

vruusmann commented 6 years ago

This appears like a versioning/shading problem - the method ModelEvaluatorFactory#createModelEvaluator(...) is trying to identify the model type using instanceof checks, and your model file does not match any of them. Specifically, decision tree models should match the check if(model instanceof org.dmg.pmml.tree.TreeModel), but the Java class of you model object is something different: https://github.com/jpmml/jpmml-evaluator/blob/master/pmml-evaluator/src/main/java/org/jpmml/evaluator/ModelEvaluatorFactory.java#L126-L128

If you're mixing different JPMML-Model/JPMML-Evaluator versions, then it may be the case that your model object is actually of type org.dmg.pmml.TreeModel (note the missing tree package component). However, if you've performed bad class name shading, then it may be the case that it is of type org.shaded.pmml.tree.TreeModel (note the added shaded package component).

To figure out what's going on, simply print the class name of your model object:

System.out.println(model.getClass().getName());

Additionally, when using the org.jpmml:jpmml-evaluator-spark dependency in your project, then you don't need to manually declare org.jpmml:pmml-model and/or org.jpmml:pmml-evaluator dependencies in your pom.xml file. This "parent dependency" will automatically bring in all required "child dependencies".

akari0725 commented 6 years ago

I have tried do not declare org.jpmml:pmml-model and org.jpmml:pmml-evaluator but it can not run at spark local mode

The org.jpmml:jpmml-evaluator-spark 1.1 dependency is I download the source code and mvn clean install into my local .m2

I try to copy your source code into local java class, and the getName() result is org.dmg.pmml.tree.TreeModel (on spark local mode)

I don't know what maven dependency I need to declare, pmml-model 1.3.8 and pmml-evaluator is the last version on maven.apache.org.

And I don't know why it run succeed on local but can not work at cluster.I put my pmml file into resources folder and load it by ClassLoader, and send the java object(Transformer) to workernodes by sparkSession.sparkContext.broadcast method.

Thank you!

DataAndModel.zip

akari0725 commented 6 years ago

Another question,

I try to run another pmml model that export from sklearn , it worked OK on local mode, but on cluster:

Exception in thread "main" java.lang.IllegalArgumentException: http://www.dmg.org/PMML-4_3
    at org.jpmml.schema.Version.forNamespaceURI(Version.java:61)
    at org.jpmml.model.PMMLFilter.updateSource(PMMLFilter.java:121)
    at org.jpmml.model.PMMLFilter.startPrefixMapping(PMMLFilter.java:43)
    at org.apache.xerces.parsers.AbstractSAXParser.startNamespaceMapping(Unknown Source)
    at org.apache.xerces.parsers.AbstractSAXParser.startElement(Unknown Source)
    at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown Source)
    at org.apache.xerces.impl.XMLNSDocumentScannerImpl$NSContentDispatcher.scanRootElementHook(Unknown Source)
    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
    at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
    at org.xml.sax.helpers.XMLFilterImpl.parse(XMLFilterImpl.java:357)
    at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal0(UnmarshallerImpl.java:243)
    at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal(UnmarshallerImpl.java:214)
    at javax.xml.bind.helpers.AbstractUnmarshallerImpl.unmarshal(AbstractUnmarshallerImpl.java:140)
    at javax.xml.bind.helpers.AbstractUnmarshallerImpl.unmarshal(AbstractUnmarshallerImpl.java:123)
    at org.jpmml.model.JAXBUtil.unmarshal(JAXBUtil.java:78)
    at org.jpmml.model.JAXBUtil.unmarshalPMML(JAXBUtil.java:64)
    at org.jpmml.evaluator.spark.EvaluatorUtil.createEvaluator(EvaluatorUtil.java:55)
    at javaCode.mlModel.LRClassifier_s3.getClassifier(LRClassifier_s3.java:21)

I think if this is the same problem on cluster mode.

could you show me a demo that import an KNIME/SKLEARN pmml file and can run on spark cluster(scala) ?

vruusmann commented 6 years ago

I try to copy your source code into local java class, and the getName() result is org.dmg.pmml.tree.TreeModel (on spark local mode)

Your application is using JPMML-Model 1.3.8 in local mode (in that case the TreeModel element is mapped to the org.dmg.pmml.tree.TreeModel class), but Apache Spark ML's built-in JPMML-Model 1.2.15 in cluster mode (mapped to the org.dmg.pmml.TreeModel class).

I have tried do not declare org.jpmml:pmml-model and org.jpmml:pmml-evaluator but it can not run at spark local mode

What's the exception/problem then?

You can solve classpath conflicts by "shading" (ie. renaming and/or relocating) your application classes. See the example here: https://github.com/jpmml/jpmml-sparkml#run-time-conflict-resolution

I would personally suggest you to remove those legacy JPMML-Model library JAR files from your local and cluster environments altogether: https://github.com/jpmml/jpmml-sparkml#modifying-apache-spark-installation

Exception in thread "main" java.lang.IllegalArgumentException: http://www.dmg.org/PMML-4_3 at org.jpmml.schema.Version.forNamespaceURI(Version.java:61)

This is another proof that your cluster "sees" the legacy JPMML-Model version, which supports PMML schema versions 3.0 through 4.2, but not 4.3.

akari0725 commented 6 years ago

I had modify my pom file:

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-mllib_2.11</artifactId>
            <version>${spark.version}</version>
            <scope>provided</scope>
            <exclusions>
                <exclusion>
                    <groupId>org.jpmml</groupId>
                    <artifactId>pmml-model</artifactId>
                </exclusion>
            </exclusions>
        </dependency>

and

<plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-shade-plugin</artifactId>
    <version>2.3</version>
    <executions>
        <execution>
            <phase>package</phase>
            <goals>
                <goal>shade</goal>
            </goals>
            <configuration>
                <filters>
                    <filter>
                        <artifact>*:*</artifact>
                        <excludes>
                            <exclude>META-INF/*.SF</exclude>
                            <exclude>META-INF/*.DSA</exclude>
                            <exclude>META-INF/*.RSA</exclude>
                        </excludes>
                    </filter>
                </filters>
                <transformers>
                    <transformer
                            implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
                        <resource>META-INF/spring.handlers</resource>
                    </transformer>
                    <transformer
                            implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                        <mainClass>com.fxc.rpc.impl.member.MemberProvider</mainClass>
                    </transformer>
                    <transformer
                            implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
                        <resource>META-INF/spring.schemas</resource>
                    </transformer>
                </transformers>
                <relocations>
                    <relocation>
                        <pattern>org.dmg.pmml</pattern>
                        <shadedPattern>org.shaded.dmg.pmml</shadedPattern>
                    </relocation>
                    <relocation>
                        <pattern>org.jpmml</pattern>
                        <shadedPattern>org.shaded.jpmml</shadedPattern>
                    </relocation>
                </relocations>
            </configuration>
        </execution>
    </executions>
</plugin>

and delete

        <!--&lt;!&ndash; https://mvnrepository.com/artifact/org.jpmml/pmml-model &ndash;&gt;-->
        <dependency>
            <groupId>org.jpmml</groupId>
            <artifactId>pmml-model</artifactId>
            <version>1.3.8</version>
        </dependency>
        <!--&lt;!&ndash; https://mvnrepository.com/artifact/org.jpmml/pmml-evaluator &ndash;&gt;-->
        <dependency>
            <groupId>org.jpmml</groupId>
            <artifactId>pmml-evaluator</artifactId>
            <version>1.3.10</version>
        </dependency>

this time it run successful on cluster mode!

but did not delete this two files on cluster:

$SPARK_HOME/jars/pmml-model-1.2.15.jar
$SPARK_HOME/jars/pmml-schema-1.2.15.jar

but, whether I add this two dependency or not, it both can not run at local mode:

        <!--&lt;!&ndash; https://mvnrepository.com/artifact/org.jpmml/pmml-model &ndash;&gt;-->
        <dependency>
            <groupId>org.jpmml</groupId>
            <artifactId>pmml-model</artifactId>
            <version>1.3.8</version>
        </dependency>
        <!--&lt;!&ndash; https://mvnrepository.com/artifact/org.jpmml/pmml-evaluator &ndash;&gt;-->
        <dependency>
            <groupId>org.jpmml</groupId>
            <artifactId>pmml-evaluator</artifactId>
            <version>1.3.10</version>
        </dependency>

error:

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/ml/Transformer
    at javaCode.mlModel.BaixiaoeModel.getClassifier

(may this is a proof that I don't need to add the two dependency)

I try to remove two .jar from local $SPARK_HOME and it can not work too.

Could you tell how can I run this program successful at both local and cluster?

Thanks for your answer and patience my bad English!