jpmml / jpmml-sparkml

Java library and command-line application for converting Apache Spark ML pipelines to PMML
GNU Affero General Public License v3.0
267 stars 80 forks source link

How to import the training data schema in libsvm format #116

Open sunxiaolongsf opened 3 years ago

sunxiaolongsf commented 3 years ago

I want to export the model in jpmml-sparkml.If the training data is in another format,I knew use dataframe.schema,but the training data in libsvm format,what should I pass in to this function:

val pmml = new PMMLBuilder(schema, pipelineModel).build()
vruusmann commented 3 years ago

If the training data is in another format

Have you verified that other parts of the PMML conversion workflow work as expected?

For example, if the training data contains vector columns, then the JPMML-SparkML library will probably raise an error about it.

but the training data in libsvm format,what should I pass in to this function

Convert the training data to proper Dataset<Row> representation, and then proceed as usual?

Alternatively, you may construct the schema descriptor (a StructType object) manually. It's easiest to get it via Dataset#schema(), but if that's not an option, you can always do it manually.

vruusmann commented 3 years ago

Anyhow, if you want me to look deeper into this, then please provide a small self-contained & fully reproducible example.

You could convert the Iris dataset to LibSVM data format, and then report everything that's going wrong with it.

Otherwise, I simply won't have the time.

sunxiaolongsf commented 3 years ago

OK.thank for your reply!

  1. Load and parse the libsvm data file, converting it to a DataFrame. val data = spark.read.format("libsvm").load("./src/main/resources/data/sample_libsvm_data.txt") println("read libsvm first:" +data.first()) data.show() // first is (1.0,(438,[7,53,101,166,250,312,412],[4.0,2156.0,1927.0,73.0,804.0,477.0,415.0])) label| features| +-----+--------------------+ | 1.0|(438,[7,53,101,16...| | 0.0|(438,[59,124,191,...| | 0.0|(438,[5,17,91,192...|

2.1 labelIndexer val labelIndexer = new StringIndexer() .setInputCol("label") .setOutputCol("indexedLabel") .fit(data)

2.2 featureIndexer

    val featureIndexer = new VectorIndexer()
      .setInputCol("features")
      .setOutputCol("indexedFeatures")
      .setMaxCategories(4)
      .fit(data)
`val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))`
  1. Train a GBT model. 3.1 set model
    val gbt = new GBTClassifier()
      .setLabelCol("indexedLabel")
      .setFeaturesCol("indexedFeatures")
      .setMaxIter(10)

3.2 Convert indexed labels back to original labels.

    val labelConverter = new IndexToString()
      .setInputCol("prediction")
      .setOutputCol("predictedLabel")
      .setLabels(labelIndexer.labels)

3.3 Chain indexers and GBT in a Pipeline.

    val pipeline = new Pipeline()
      .setStages(Array(labelIndexer, featureIndexer, gbt, labelConverter))

3.4 Train model. This also runs the indexers.

    val model = pipeline.fit(trainingData)
    println(model.getClass)
  1. save model to pmml

4.1 use the libsvm data#schema,raise error "Expected string, integral, double or boolean type, got vector type"

    println("save model to pmml")
    val pmmlPath = "./src/main/resources/data/spark2pmml.pmml"
    val pmml = new PMMLBuilder(data.schema, model).build()
    JAXBUtil.marshalPMML(pmml, new StreamResult(new FileOutputStream(pmmlPath)))

4.2 construct the schema, raise error "Field "features" does not exist." val newSchema = getLibsvmSchema(8) // StructType(StructField(label,DoubleType,true), StructField(col1,DoubleType,true), // StructField(col2,DoubleType,true), StructField(col3,DoubleType,true), StructField(col4,DoubleType,true), // StructField(col5,DoubleType,true), StructField(col6,DoubleType,true), StructField(col7,DoubleType,true)) savePmml(newSchema,model,"./src/main/resources/data/spark2pmml.pmml")

So,I want to know how to save model to pmml when i use training data in libsvm format like this,thanks!

vruusmann commented 3 years ago

4.1 use the libsvm data#schema,raise error "Expected string, integral, double or boolean type, got vector type"

I told you something like this will happen.

See https://github.com/jpmml/jpmml-sparkml/issues/26 and friends.

val newSchema = getLibsvmSchema(8)

You're mis-representing the data here, isn't it so?

Your data frame contains a single n-element vector column, but the newSchema claims that there are n separate scalar (double) columns.

vruusmann commented 3 years ago

The fix would be to add support for the ArrayType column type.

I've refused to do it in earlier years, but maybe I'll do it this year.

vruusmann commented 3 years ago

@sunxiaolongsf To answer your original question ("how to handle LibSVM data"), then you'd still need to manually unpack all array/vector columns to scalar columns before performing the conversion to PMML.

Rough outline:

  1. Load dataset in LibSVM format. It gives you vector columns.
  2. Unpack each and every n-element vector column to n scalar columns (typically double columns).
  3. Fit the Apache Spark ML pipeline using step 2 data frame.
  4. Get the schema of the step 2 data frame, and perform the conversion to PMML.

The resulting PMML will contain information about step 2 onward. It knows nothing about the vector columns of the step 1.

githubthunder commented 3 months ago

HI, @vruusmann

Does the jpmml-sparkml support the libsvm format or the vector datatype now?

vruusmann commented 3 months ago

@githubthunder The issue is still in "open" state, meaning that there hasn't been any major work done towards addressing it.

Anyway, what's wrong with the workflow suggested in my earlier comment (https://github.com/jpmml/jpmml-sparkml/issues/116#issuecomment-863460805)? It lets you use LibSVM dataset, if you're willing to throw in a couple lines of data manipulation code.

The main issue with the LibSVM data format (and the vector data/column type) is that it is effectively schema-less.

The PMML standard is about structured ML applications (think: statistics). And structured ML applications require basic information/understanding about the undelying data, such as column names, column types, etc.

githubthunder commented 3 months ago

HI, @vruusmann thanks for your replies

If n is very large(meaning there are many features), the dataframe will have numerous columns, which could lead to excessive use of space.

Can jpmml-sparkml provide an interface that accepts a schema of 'label: DOUBLE, features: vector,' where the vector may be in sparse format, and automatically generates feature names based on the vector order, similar to 'f_1, f_2, ...'?

vruusmann commented 3 months ago

Older discussion(s) regarding vector columns: https://github.com/jpmml/jpmml-sparkml/issues/21

vruusmann commented 3 months ago

Can jpmml-sparkml automatically generates feature names based on the vector order, similar to 'f_1, f_2, ...'?

Something like that. When the converter comes across a vector column, then it would automatically expand it into a list of org.jpmml.converter.ContinuousFeature objects, one for each vector element. The name would be synthetic (x_{n} or f_{n}, or whatever), and the data type would be inherited from vector's element type (can you have float vector in Apache Spark these days, or are they all double vectors?).

The biggest obstacle in implementing vector column support is that the VectorUDT type does not support any information about the "vector size". That is, there is no VectorUDT#getSize() method.

Without this information, the converter does not now how many ContinuousFeature objects to create.

vruusmann commented 3 months ago

The biggest obstacle in implementing vector column support is that the VectorUDT type does not support any information about the "vector size". That is, there is no VectorUDT#getSize() method.

The workaround would be to require that the pipeline must contain a VectorSizeHint transformer.

githubthunder commented 3 months ago

HI, @vruusmann Thanks again for your replies and work

If VectorUDT type does not support any information about the "vector size", maybe the interface can add the inputting paramter "numFeatures". The numFeatures can be provided by the user. If numFeatures>0, the interface will use this value for calculations, otherwise it will follow the current processing logic.

The code may look as follows,

// get the number of features

val data = spark.read.format("libsvm").load("sample_libsvm_data.txt")
val vector = data.first().getAs[org.apache.spark.ml.linalg.Vector]("features")
val numFeatures = vector.size

// train the machine learning model
......

// export model with pmml format

val pmml = new PMMLBuilder(training.schema, pipelineModel, numFeatures).build()
or
val pmml = new PMMLBuilder(training.schema, pipelineModel).build(numFeatures)

JAXBUtil.marshalPMML(pmml, new StreamResult(System.out))
vruusmann commented 3 months ago

It's not permitted to change the PMMLBuilderconstructor, or the build method (because that's what is guaranteed to be stable for 5+ years).

But since we're dealing with a builder pattern here, it's possible to add more "configuration" methods.

For example, the PMMLBuilder could allow you to manually specify the Apache-Spark-to-(J)PMML mapping for individual columns:

List<Feature> listOfScalarFeatures = new ArrayList<>();
for(int i = 0; i < numFeatures; i++){
  listOfScalarFeatures.add(new ContinuousFeature(encoder, "f_" + String.valueOf(i + 1), DataType.DOUBLE));
}

PMMLBuilder pmmlBuilder = new PMMLBuilder(training.schema, pipelineModel)
  // THIS!
  .defineColumn("features", listOfScalarFeatures);

PMML pmml = pmmlBuilder.build();
githubthunder commented 3 months ago

HI, @vruusmann thanks again

Maybe the method "defineColumn" is the simple and effetive solution to support the sparse format. I am really looking forward to your work. Also, could you add the "defineColumn" feature in the older versions as well?