Open sunxiaolongsf opened 3 years ago
If the training data is in another format
Have you verified that other parts of the PMML conversion workflow work as expected?
For example, if the training data contains vector columns, then the JPMML-SparkML library will probably raise an error about it.
but the training data in libsvm format,what should I pass in to this function
Convert the training data to proper Dataset<Row>
representation, and then proceed as usual?
Alternatively, you may construct the schema descriptor (a StructType
object) manually. It's easiest to get it via Dataset#schema()
, but if that's not an option, you can always do it manually.
Anyhow, if you want me to look deeper into this, then please provide a small self-contained & fully reproducible example.
You could convert the Iris dataset to LibSVM data format, and then report everything that's going wrong with it.
Otherwise, I simply won't have the time.
OK.thank for your reply!
val data = spark.read.format("libsvm").load("./src/main/resources/data/sample_libsvm_data.txt") println("read libsvm first:" +data.first()) data.show()
// first is (1.0,(438,[7,53,101,166,250,312,412],[4.0,2156.0,1927.0,73.0,804.0,477.0,415.0]))
label| features|
+-----+--------------------+
| 1.0|(438,[7,53,101,16...|
| 0.0|(438,[59,124,191,...|
| 0.0|(438,[5,17,91,192...|2.1 labelIndexer
val labelIndexer = new StringIndexer() .setInputCol("label") .setOutputCol("indexedLabel") .fit(data)
2.2 featureIndexer
val featureIndexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexedFeatures")
.setMaxCategories(4)
.fit(data)
`val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))`
val gbt = new GBTClassifier()
.setLabelCol("indexedLabel")
.setFeaturesCol("indexedFeatures")
.setMaxIter(10)
3.2 Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predictedLabel")
.setLabels(labelIndexer.labels)
3.3 Chain indexers and GBT in a Pipeline.
val pipeline = new Pipeline()
.setStages(Array(labelIndexer, featureIndexer, gbt, labelConverter))
3.4 Train model. This also runs the indexers.
val model = pipeline.fit(trainingData)
println(model.getClass)
4.1 use the libsvm data#schema,raise error "Expected string, integral, double or boolean type, got vector type"
println("save model to pmml")
val pmmlPath = "./src/main/resources/data/spark2pmml.pmml"
val pmml = new PMMLBuilder(data.schema, model).build()
JAXBUtil.marshalPMML(pmml, new StreamResult(new FileOutputStream(pmmlPath)))
4.2 construct the schema, raise error "Field "features" does not exist."
val newSchema = getLibsvmSchema(8)
// StructType(StructField(label,DoubleType,true), StructField(col1,DoubleType,true),
// StructField(col2,DoubleType,true), StructField(col3,DoubleType,true), StructField(col4,DoubleType,true),
// StructField(col5,DoubleType,true), StructField(col6,DoubleType,true), StructField(col7,DoubleType,true))
savePmml(newSchema,model,"./src/main/resources/data/spark2pmml.pmml")
So,I want to know how to save model to pmml when i use training data in libsvm format like this,thanks!
4.1 use the libsvm data#schema,raise error "Expected string, integral, double or boolean type, got vector type"
I told you something like this will happen.
See https://github.com/jpmml/jpmml-sparkml/issues/26 and friends.
val newSchema = getLibsvmSchema(8)
You're mis-representing the data here, isn't it so?
Your data frame contains a single n-element vector column, but the newSchema
claims that there are n separate scalar (double) columns.
The fix would be to add support for the ArrayType
column type.
I've refused to do it in earlier years, but maybe I'll do it this year.
@sunxiaolongsf To answer your original question ("how to handle LibSVM data"), then you'd still need to manually unpack all array/vector columns to scalar columns before performing the conversion to PMML.
Rough outline:
double
columns).The resulting PMML will contain information about step 2 onward. It knows nothing about the vector columns of the step 1.
HI, @vruusmann
Does the jpmml-sparkml support the libsvm format or the vector datatype now?
@githubthunder The issue is still in "open" state, meaning that there hasn't been any major work done towards addressing it.
Anyway, what's wrong with the workflow suggested in my earlier comment (https://github.com/jpmml/jpmml-sparkml/issues/116#issuecomment-863460805)? It lets you use LibSVM dataset, if you're willing to throw in a couple lines of data manipulation code.
The main issue with the LibSVM data format (and the vector data/column type) is that it is effectively schema-less.
The PMML standard is about structured ML applications (think: statistics). And structured ML applications require basic information/understanding about the undelying data, such as column names, column types, etc.
HI, @vruusmann thanks for your replies
If n is very large(meaning there are many features), the dataframe will have numerous columns, which could lead to excessive use of space.
Can jpmml-sparkml provide an interface that accepts a schema of 'label: DOUBLE, features: vector,' where the vector may be in sparse format, and automatically generates feature names based on the vector order, similar to 'f_1, f_2, ...'?
Older discussion(s) regarding vector columns: https://github.com/jpmml/jpmml-sparkml/issues/21
Can jpmml-sparkml automatically generates feature names based on the vector order, similar to 'f_1, f_2, ...'?
Something like that. When the converter comes across a vector column, then it would automatically expand it into a list of org.jpmml.converter.ContinuousFeature
objects, one for each vector element. The name would be synthetic (x_{n}
or f_{n}
, or whatever), and the data type would be inherited from vector's element type (can you have float
vector in Apache Spark these days, or are they all double
vectors?).
The biggest obstacle in implementing vector column support is that the VectorUDT
type does not support any information about the "vector size". That is, there is no VectorUDT#getSize()
method.
Without this information, the converter does not now how many ContinuousFeature
objects to create.
The biggest obstacle in implementing vector column support is that the VectorUDT type does not support any information about the "vector size". That is, there is no VectorUDT#getSize() method.
The workaround would be to require that the pipeline must contain a VectorSizeHint
transformer.
HI, @vruusmann Thanks again for your replies and work
If VectorUDT type does not support any information about the "vector size", maybe the interface can add the inputting paramter "numFeatures". The numFeatures can be provided by the user. If numFeatures>0, the interface will use this value for calculations, otherwise it will follow the current processing logic.
The code may look as follows,
// get the number of features
val data = spark.read.format("libsvm").load("sample_libsvm_data.txt")
val vector = data.first().getAs[org.apache.spark.ml.linalg.Vector]("features")
val numFeatures = vector.size
// train the machine learning model
......
// export model with pmml format
val pmml = new PMMLBuilder(training.schema, pipelineModel, numFeatures).build()
or
val pmml = new PMMLBuilder(training.schema, pipelineModel).build(numFeatures)
JAXBUtil.marshalPMML(pmml, new StreamResult(System.out))
It's not permitted to change the PMMLBuilder
constructor, or the build
method (because that's what is guaranteed to be stable for 5+ years).
But since we're dealing with a builder pattern here, it's possible to add more "configuration" methods.
For example, the PMMLBuilder
could allow you to manually specify the Apache-Spark-to-(J)PMML mapping for individual columns:
List<Feature> listOfScalarFeatures = new ArrayList<>();
for(int i = 0; i < numFeatures; i++){
listOfScalarFeatures.add(new ContinuousFeature(encoder, "f_" + String.valueOf(i + 1), DataType.DOUBLE));
}
PMMLBuilder pmmlBuilder = new PMMLBuilder(training.schema, pipelineModel)
// THIS!
.defineColumn("features", listOfScalarFeatures);
PMML pmml = pmmlBuilder.build();
HI, @vruusmann thanks again
Maybe the method "defineColumn" is the simple and effetive solution to support the sparse format. I am really looking forward to your work. Also, could you add the "defineColumn" feature in the older versions as well?
I want to export the model in jpmml-sparkml.If the training data is in another format,I knew use dataframe.schema,but the training data in libsvm format,what should I pass in to this function: