jpmml / jpmml-evaluator-spark

PMML evaluator library for the Apache Spark cluster computing system (http://spark.apache.org/)
GNU Affero General Public License v3.0
94 stars 43 forks source link

JPMML-Evaluator-Spark Build Status

PMML evaluator library for the Apache Spark cluster computing system (https://spark.apache.org/).

Features

Prerequisites

Installation

The JPMML-Evaluator-Spark library JAR file (together with accompanying Java source and Javadocs JAR files) is released via Maven Central Repository.

The current version is 1.3.0 (2 April, 2022).

<dependency>
    <groupId>org.jpmml</groupId>
    <artifactId>jpmml-evaluator-spark</artifactId>
    <version>1.3.0</version>
</dependency>

Usage

Building a generic transformer based on a PMML byte stream:

InputStream pmmlIs = ...;

EvaluatorBuilder evaluatorBuilder = new LoadingModelEvaluatorBuilder()
    .setLocatable(false)
    .load(pmmlIs);

Evaluator evaluator = evaluatorBuilder.build();

// Performing a self-check (duplicates as a warm-up)
evaluator.verify();

TransformerBuilder pmmlTransformerBuilder = new TransformerBuilder(evaluator)
    .withTargetCols()
    .withOutputCols()
    .exploded(false);

Transformer pmmlTransformer = pmmlTransformerBuilder.build();

Building an Apache Spark ML-style regressor when the PMML document is known to contain a regression model (eg. auto-mpg dataset):

TransformerBuilder pmmlTransformerBuilder = new TransformerBuilder(evaluator)
    .withLabelCol("MPG") // Double column
    .exploded(true);

Building an Apache Spark ML-style classifier when the PMML document is known to contain a classification model (eg. iris-species dataset):

TransformerBuilder pmmlTransformerBuilder = new TransformerBuilder(evaluator)
    .withLabelCol("Species") // String column
    .withProbabilityCol("Species_probability", Arrays.asList("setosa", "versicolor", "virginica")) // Vector column
    .exploded(true);

Scoring data:

Dataset<?> inputDs = ...;

Dataset<?> resultDs = pmmlTransformer.transform(inputDs);

In default mode, the transformation appends an intermediary "pmml" column to the data frame, which contains all the requested result columns:

root
 |-- Sepal_Length: double (nullable = true)
 |-- Sepal_Width: double (nullable = true)
 |-- Petal_Length: double (nullable = true)
 |-- Petal_Width: double (nullable = true)
 |-- pmml: struct (nullable = true)
 |    |-- Species: string (nullable = false)
 |    |-- Species_probability: vector (nullable = false)

In exploded mode, the transformation appends all the requested result columns to the data frame:

root
 |-- Sepal_Length: double (nullable = true)
 |-- Sepal_Width: double (nullable = true)
 |-- Petal_Length: double (nullable = true)
 |-- Petal_Width: double (nullable = true)
 |-- Species: string (nullable = false)
 |-- Species_probability: vector (nullable = false)

License

JPMML-Evaluator-Spark is dual-licensed under the GNU Affero General Public License (AGPL) version 3.0, and a commercial license.

Additional information

JPMML-Evaluator-Spark is developed and maintained by Openscoring Ltd, Estonia.

Interested in using JPMML software in your application? Please contact info@openscoring.io