jpmml / jpmml-sparkml

Java library and command-line application for converting Apache Spark ML pipelines to PMML
GNU Affero General Public License v3.0
267 stars 80 forks source link

How could I write a converter for my custom transformer in scala? #68

Closed rollingdeep closed 5 years ago

rollingdeep commented 5 years ago

It may an unsolving problem in jpmml-sparkml. I knew a method of exploding vector to pieces(f1,f2, ..., fn)to do it. I think it was indeed wasted. I'm trying to write transformer and converter to fix it. First, I want read libsvm format data from hive. I have 3 columns:id,libsvm, dt. Second, I tried to transform libsvm string column to 2 columns: features, label. Third, I trained pipeline model with the transfomer and logistic regression model. and succeeded! But fail to export pmml file. I have read your source code, and know that I need write a converter and add a config in But still I don't konw how to write a converter to my transformer. If you have no time to fix, can you give me some instruction to implement it?

# hive data examples
gid      libsvm             dt
1         1 1:1 3:1 7:1    20190601
2         0 1:1 2:1  3:1   20190601
3         1 1:1 5:1  7:1   20190601
# my transformer transform
gid      libsvm             dt                label          features
1         1 1:1 3:1 7:1    20190601    1                [1, 0, 1, 0, 0, 0, 1]
2         0 1:1 2:1  3:1   20190601    0                [1, 1, 1, 0, 0, 0, 0]
3         1 1:1 5:1  7:1   20190601    1                [1, 0, 0, 0, 1, 0, 1]
# then feed into logisitc regression model
# val pipeline = Pipeline().setStages(Array(mytransfomer, lr))
#   # it was passed and gave right result.
# when exporting pmml, it threw exception the function of encoderFeatures  in my converter class.
public List<Feature> encodeFeatures(SparkMLEncoder encoder){
        Interaction transformer = getTransformer();
               # TODO
               # libsvm is my inputCol, and I can get it through getInputCol() function.
               # I need to break it into label(double type), features( Vector type or Array type)
               # generate feature field and  return List<features> 

Hope for your reply and contact emal if this is private for you.

vruusmann commented 5 years ago

Second, I tried to transform libsvm string column to 2 columns: features, label.

You should split this single libsvm column into two fatures and label column using regular Apache Spark APIs.

A transformer should only act on the features column (independent variables). This is reflected in the JPMML-SparkML library, where the method org.jpmml.sparkml.FeatureConverter#encodeFeatures(SparkMLEncoder) only deals with the features part (the label part is handled by o.j.s.ModelConverter).

But still I don't konw how to write a converter to my transformer.

If you have a standalone features column, then you probably don't need your custom transformer anymore? Or if you think you do, then perhaps you could use standard tools like or instead.

Code-wise, the JPMML-SparkML implementation of your custom transformer should take inspiration from o.j.s.feature.VectorSlicerConverter and/or o.j.s.feature.VectorAssemblerConverter classes.

rollingdeep commented 5 years ago

Yeah, I take inspiration from the classes you have given. I have finished the converter and made a jar and py-transfomers using in pyspark. I don't know it right or wrong. Then, I will make some test. Thank you! Your work is awesome!

rollingdeep commented 5 years ago

And I push you a project pr at Now I am using the version I push.