Closed obones closed 7 years ago
Duplicate of https://github.com/jpmml/jpmml-sparkml/issues/18 and https://github.com/jpmml/jpmml-sparkml/issues/2 (and probably some others)
I see that it comes from the fact that jpmml-sparkml does not know how to handle our dataframe. What metadata are we missing so that our models can be exported to PMML?
The VectorUDT
data type does not provide adequate description of your dataframe. At minimum, it would be necessary to know the number of columns in your dataframe, but there is no method VectorUDT#numDimensions
(or similar).
Perhaps it will be possible to create a subclass of VectorUDT
that does so.
As a workaround, we can have "split" data and use a VectorAssembler but it uses some computation time that we feel is a bit wasted.
You can waste computation time, or you can waste your own time.
If you think that your time is more abundant than computer time, then you can try creating a synthetic dataframe schema definition, as explained here: https://github.com/jpmml/jpmml-sparkml/issues/18#issuecomment-310727514
Thanks, I'll see what I can do with the "synthetic definition" as using a VectorAssembler
adds anywhere from 1 to 10% time penalty.
You don't need to embed and execute the VectorAssembly
transformation in your actual data pipeline.
The idea is to create a pair of "synthetic" StructType
and PipelineModel
objects based on actual schema and fitted pipeline model objects. This synthetic PipelineModel
object contains a synthetic VectorAssembler
stage in the first position, which references columns in your synthetic StructType
object. The important point is that VectorAssembler
makes the number of columns in your dataframe known to JPMML-SparkML via the VectorAssembler#inputCols()
parameter.
Anyway, if 10% time penalty is such a huge deal for your use case, then you should be probably avoiding the PMML approach.
Hello,
We are using Spark with a custom datasource that directly gives a
label, vector(features)
dataframe which saves using aVectorAssembler
in the pipeline. While this works just fine to train ML models, we can't export them to PMML usingjpmml-sparkml
because we receive this errorjava.lang.IllegalArgumentException: Expected string, integral, double or boolean type, got vector type
Looking around on various sites, I see that it comes from the fact that
jpmml-sparkml
does not know how to handle our dataframe. What metadata are we missing so that our models can be exported to PMML?As a workaround, we can have "split" data and use a
VectorAssembler
but it uses some computation time that we feel is a bit wasted.