jpmml / jpmml-sparkml

Java library and command-line application for converting Apache Spark ML pipelines to PMML
GNU Affero General Public License v3.0
267 stars 80 forks source link

How could I write a converter for my custom transformer in scala? #68

Closed rollingdeep closed 5 years ago

rollingdeep commented 5 years ago

It may an unsolving problem in jpmml-sparkml. I knew a method of exploding vector to pieces(f1,f2, ..., fn)to do it. I think it was indeed wasted. I'm trying to write transformer and converter to fix it. First, I want read libsvm format data from hive. I have 3 columns:id,libsvm, dt. Second, I tried to transform libsvm string column to 2 columns: features, label. Third, I trained pipeline model with the transfomer and logistic regression model. and succeeded! But fail to export pmml file. I have read your source code, and know that I need write a converter and add a config in sparkml2pmml.properties. But still I don't konw how to write a converter to my transformer. If you have no time to fix, can you give me some instruction to implement it?

# hive data examples
gid      libsvm             dt
1         1 1:1 3:1 7:1    20190601
2         0 1:1 2:1  3:1   20190601
3         1 1:1 5:1  7:1   20190601
# my transformer transform
gid      libsvm             dt                label          features
1         1 1:1 3:1 7:1    20190601    1                [1, 0, 1, 0, 0, 0, 1]
2         0 1:1 2:1  3:1   20190601    0                [1, 1, 1, 0, 0, 0, 0]
3         1 1:1 5:1  7:1   20190601    1                [1, 0, 0, 0, 1, 0, 1]
# then feed into logisitc regression model
# val pipeline = Pipeline().setStages(Array(mytransfomer, lr))
# pipeline.fit(df)   # it was passed and gave right result.
# when exporting pmml, it threw exception the function of encoderFeatures  in my converter class.
public List<Feature> encodeFeatures(SparkMLEncoder encoder){
        Interaction transformer = getTransformer();
               # TODO
               # libsvm is my inputCol, and I can get it through getInputCol() function.
               # I need to break it into label(double type), features( Vector type or Array type)
               # generate feature field and  return List<features> 
}

Hope for your reply and contact emal rollingdeep@yeah.net if this is private for you.

vruusmann commented 5 years ago

Second, I tried to transform libsvm string column to 2 columns: features, label.

You should split this single libsvm column into two fatures and label column using regular Apache Spark APIs.

A transformer should only act on the features column (independent variables). This is reflected in the JPMML-SparkML library, where the method org.jpmml.sparkml.FeatureConverter#encodeFeatures(SparkMLEncoder) only deals with the features part (the label part is handled by o.j.s.ModelConverter).

But still I don't konw how to write a converter to my transformer.

If you have a standalone features column, then you probably don't need your custom transformer anymore? Or if you think you do, then perhaps you could use standard tools like org.apache.spark.ml.feature.VectorSlicer or rg.apache.spark.ml.feature.VectorAssembler instead.

Code-wise, the JPMML-SparkML implementation of your custom transformer should take inspiration from o.j.s.feature.VectorSlicerConverter and/or o.j.s.feature.VectorAssemblerConverter classes.

rollingdeep commented 5 years ago

Yeah, I take inspiration from the classes you have given. I have finished the converter and made a jar and py-transfomers using in pyspark. I don't know it right or wrong. Then, I will make some test. Thank you! Your work is awesome!

rollingdeep commented 5 years ago

And I push you a project pr at https://github.com/jpmml/jpmml-sparkml-xgboost/pull/11. Now I am using the version I push.