jpmml / pyspark2pmml

Python library for converting Apache Spark ML pipelines to PMML
GNU Affero General Public License v3.0
95 stars 25 forks source link

AttributeError: 'ColumnX' object has no attribute '_to_java' #16

Closed yairdata closed 5 years ago

yairdata commented 5 years ago

hi ,

i am using the jpmml 1.5.0 uber jar. i was able to export a simple pipeline model to pmml file . now i try to use custom transformer as well:

class WeightColumn(Transformer):
  def __init__(self):
    super(WeightColumn, self).__init__()

  def _transform(self, df):
    df = df.withColumn('ColumnX', (when(df.id==6, 4).otherwise(df.id+1)))    
  return df

and it fails with the following message:

 File "test_pmml_export.py", line 95, in <module>
    pmmlBuilder = PMMLBuilder(spark.sparkContext, sparkTransformed, model) \
  File "/opt/pyspark2pmml/__init__.py", line 13, in __init__
    javaPipelineModel = pipelineModel._to_java()
  File "/opt/spark-2.4.0-bin-hadoop2.7/python/pyspark/ml/pipeline.py", line 316, in _to_java
    java_stages[idx] = stage._to_java()
AttributeError: 'ColumnX' object has no attribute '_to_java'
Failed with exit code: 1
Checking for local changes...
vruusmann commented 5 years ago

AttributeError: 'ColumnX' object has no attribute '_to_java'

If you want to use a custom transformer, then you need to implement it throughout the Python -> PySpark -> PySpark2PMML -> JPMML-SparkML stack.

You've currently only implemented the first layer - Python. All other three layers are missing. The object $object has no attribute "_to_java" is actually a PySpark layer error message; the execution hasn't even reached the PySpark2PMML layer yet.

Alternatively, your custom transformer appears to implement a very simple "if-else" logic. This kind of logic can very likely be expressed using the standard SQLTransformer transformation class - no need to mess with custom transformer classes at all.

yairdata commented 5 years ago

@vruusmann thanks for the answer but what do you mean in implementations through Python -> PySpark -> PySpark2PMML -> JPMML-SparkML stack. i extended the pyspark.ml.Transformer like in https://stackoverflow.com/questions/51415784/how-to-add-my-own-function-as-a-custom-stage-in-a-ml-pyspark-pipeline can you give an example of such implementation ?

vruusmann commented 5 years ago

what do you mean in implementations through Python -> PySpark -> PySpark2PMML -> JPMML-SparkML stack.

You need to move step by step. Currently, your transformer is not even supported/recognized by the PySpark layer (ie. if you ignore/delete everything about PMML conversion, your pipeline still wouldn't work).