jpmml / jpmml-sparkml

Java library and command-line application for converting Apache Spark ML pipelines to PMML
GNU Affero General Public License v3.0
267 stars 80 forks source link

Unsupported vector type on datasource that provides it #21

Closed obones closed 7 years ago

obones commented 7 years ago

Hello,

We are using Spark with a custom datasource that directly gives a label, vector(features) dataframe which saves using a VectorAssembler in the pipeline. While this works just fine to train ML models, we can't export them to PMML using jpmml-sparkml because we receive this error java.lang.IllegalArgumentException: Expected string, integral, double or boolean type, got vector type

Looking around on various sites, I see that it comes from the fact that jpmml-sparkml does not know how to handle our dataframe. What metadata are we missing so that our models can be exported to PMML?

As a workaround, we can have "split" data and use a VectorAssembler but it uses some computation time that we feel is a bit wasted.

vruusmann commented 7 years ago

Duplicate of https://github.com/jpmml/jpmml-sparkml/issues/18 and https://github.com/jpmml/jpmml-sparkml/issues/2 (and probably some others)

I see that it comes from the fact that jpmml-sparkml does not know how to handle our dataframe. What metadata are we missing so that our models can be exported to PMML?

The VectorUDT data type does not provide adequate description of your dataframe. At minimum, it would be necessary to know the number of columns in your dataframe, but there is no method VectorUDT#numDimensions (or similar).

Perhaps it will be possible to create a subclass of VectorUDT that does so.

As a workaround, we can have "split" data and use a VectorAssembler but it uses some computation time that we feel is a bit wasted.

You can waste computation time, or you can waste your own time.

If you think that your time is more abundant than computer time, then you can try creating a synthetic dataframe schema definition, as explained here: https://github.com/jpmml/jpmml-sparkml/issues/18#issuecomment-310727514

obones commented 7 years ago

Thanks, I'll see what I can do with the "synthetic definition" as using a VectorAssembler adds anywhere from 1 to 10% time penalty.

vruusmann commented 7 years ago

You don't need to embed and execute the VectorAssembly transformation in your actual data pipeline.

The idea is to create a pair of "synthetic" StructType and PipelineModel objects based on actual schema and fitted pipeline model objects. This synthetic PipelineModel object contains a synthetic VectorAssembler stage in the first position, which references columns in your synthetic StructType object. The important point is that VectorAssembler makes the number of columns in your dataframe known to JPMML-SparkML via the VectorAssembler#inputCols() parameter.

Anyway, if 10% time penalty is such a huge deal for your use case, then you should be probably avoiding the PMML approach.