jpmml / jpmml-sparkml

Java library and command-line application for converting Apache Spark ML pipelines to PMML
GNU Affero General Public License v3.0
267 stars 80 forks source link

Add support for `cast` SQL function #63

Closed fttt closed 5 years ago

fttt commented 5 years ago

version info:spark2.4.0 jpmml 1.5.0

I want to change column type in pipelinemodel,its successful in pipelinemodel,but not in pmml build.Any one can help me.Thanks!

when I run code:

import org.apache.spark.ml.classification.GBTClassifier
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.{Pipeline, PipelineStage}
import org.jpmml.sparkml.PMMLBuilder

import scala.collection.mutable.ListBuffer

import org.apache.spark.ml.feature.SQLTransformer
import org.apache.spark.sql.SparkSession

object aaaa {
  def main(args: Array[String]): Unit = {

    val spark = SparkSession
      .builder()
      .master("local")
      .enableHiveSupport()
      .getOrCreate()

    val df = spark.createDataFrame(
      Seq((0, 1.0, 3.0), (1, 2.0, 1.0))).toDF("id", "v1", "v2")
    df.printSchema()
    val sqlTrans = new SQLTransformer()
      .setStatement(
      "SELECT cast(id as double),v1,v2 FROM __THIS__")

    val stagesArray = new ListBuffer[PipelineStage]()

    stagesArray.append(sqlTrans)
    val ar = Array("v1","v2")
    val assembler = new VectorAssembler().setInputCols(ar).setOutputCol("features")
    stagesArray.append(assembler)

    val gbt = new GBTClassifier()
      .setLabelCol("id")
      .setFeaturesCol("features")
      .setMaxDepth(2)
      .setMinInstancesPerNode(1)
      .setSeed(2)
      .setMaxIter(3)
      .setStepSize(0.01)
      .setSubsamplingRate(1)

    stagesArray.append(gbt)

    val pp = new Pipeline().setStages(stagesArray.toArray)
    val ppmodel = pp.fit(df)
    ppmodel.transform(df).show()
    ppmodel.stages(0).transform(df).show()
    println(ppmodel.stages.size)
      val schema = df.schema
      new PMMLBuilder(schema,ppmodel).build()

  }

}

return error:

Exception in thread "main" java.lang.IllegalArgumentException: cast(id#54 as double)
at org.jpmml.sparkml.ExpressionTranslator.translateInternal(ExpressionTranslator.java:229)
at org.jpmml.sparkml.ExpressionTranslator.translate(ExpressionTranslator.java:72)
at org.jpmml.sparkml.ExpressionTranslator.translate(ExpressionTranslator.java:67)
at org.jpmml.sparkml.feature.SQLTransformerConverter.encodeFeatures(SQLTransformerConverter.java:110)
at org.jpmml.sparkml.feature.SQLTransformerConverter.registerFeatures(SQLTransformerConverter.java:141)
at org.jpmml.sparkml.PMMLBuilder.build(PMMLBuilder.java:110)
at aaaa$.main(aaaa.scala:33)
at aaaa.main(aaaa.scala)
vruusmann commented 5 years ago

The cast SQL function is not yet implemented.

In the meantime, the SQL-to-PMML translation component should throw a more meaningful exception here (eg. "function XYZ is not yet implemented").

psxmc6 commented 3 years ago

Hi Villu,

First of all, thank you for such a useful suite of libraries!

I would like to ask if support for the above mentioned SQL cast would become available anytime soon? It would greatly expand Spark SQL layer's data-preprocessing capabilities.

Currently, any implicit or explicit cast yields the following error at the point of invoking PMMLBuilder, yet data frame is transformed in pyspark correctly:

PMMLBuilder(spark, df, pipelineModel).buildFile("export.pmml")

pyspark.sql.utils.IllegalArgumentException: Spark SQL function 'cast(substring(DATE#1153, 1, 4) as double)' (class org.apache.spark.sql.catalyst.expressions.Cast) is not supported

I believe the issue is closely related to the following: https://github.com/jpmml/jpmml-sparkml/issues/66 and https://github.com/jpmml/jpmml-sparkml/issues/62

Kind regards

vruusmann commented 3 years ago

@psxmc6 This issue has been closed with a commit (more than two years ago!), which means that the cast function is conceptually supported.

However, looking into the JPMML-SparkML library code, then there's a small restriction that the cast function must be used in a context which supports setting the PMML dataType attribute. This is what's causing problems for you.

vruusmann commented 3 years ago

the cast function must be used in a context which supports setting the PMML dataType attribute.

Your expression cast(substring(DATE#1153, 1, 4) as double) is parsed so that the substring function becomes the following PMML Apply element:

<Apply function="substring">
  <!-- omitted for brevity -->
</Apply>

Casting would mean setting Apply@dataType=double. However, the Apply element does not define this attribute, and the DMG.org (maintainer of the PMML specification) refuses to add it.

Maybe I'll go my own way, and implement Apply@(x-)dataType attribute as a vendor extension.

psxmc6 commented 3 years ago

Hi Villu,

Thank you for the prompt reply.

Yes, I tried to navigate through the source code and seen the part you are referring to I guess: https://github.com/jpmml/jpmml-sparkml/blob/master/src/main/java/org/jpmml/sparkml/ExpressionTranslator.java#L281

Could you please provide a small example on how to use CAST function within SQL statement as I don't fully understand the dataType constraint bit.

My use case is that I have a date in a string format yyyymmdd and I would like to extract some component from it and perform mathematical operation (e.g. multiply by some number) on, lets say, extracted year.

Would that be possible?

Many thanks

psxmc6 commented 3 years ago

the cast function must be used in a context which supports setting the PMML dataType attribute.

Your expression cast(substring(DATE#1153, 1, 4) as double) is parsed so that the substring function becomes the following PMML Apply element:

<Apply function="substring">
  <!-- omitted for brevity -->
</Apply>

Casting would mean setting Apply@dataType=double. However, the Apply element does not define this attribute, and the DMG.org (maintainer of the PMML specification) refuses to add it.

Maybe I'll go my own way, and implement Apply@(x-)dataType attribute as a vendor extension.

I understand, but what I am really aiming for is to end up with the below structure, where substring's output is implicitly converted to integer via DerivedField which has dataType attribute:

<DerivedField name="derived_DATE_FIELD_year" dataType="integer" optype="continuous">
        <Apply function="substring">
          <FieldRef field="DATE_FIELD"/>
          <Constant dataType="double">1</Constant>
          <Constant dataType="double">4</Constant>
        </Apply>
</DerivedField>
vruusmann commented 3 years ago

what I am really aiming for is to end up with the below structure, where substring's output is implicitly converted to integer via DerivedField which has dataType attribute

Yes, wrapping the expression into a DerivedField element would be a viable workaround. Viable, but not elegant.

The technical limitation here is that the org.jpmml.sparkml.ExpressionTranslator#translate(org.apache.spark.sql.catalyst.expressions.Expression) method does not keep track of the PMML creation context (in the form of org.jpmml.sparkml.SparkMLEncoder reference), so it cannot define new derived fields.

psxmc6 commented 3 years ago

Please correct me if I am wrong, but wouldn't it be sensible if the effect of applying CAST to an expression/variable would be applied to the first supported element?

What I mean by that, in the above case, Apply does not support dataType attribute, but the outermost DerivedField does.

This was my intuition behind CAST, I thought that with the following expression:

SELECT
  CAST(SOME_NUMERIC_COLUMN AS STRING) AS NUM_AS_STRING
FROM
__THIS__

would yield the following PMML snippet:

<DerivedField name="NUM_AS_STRING" optype="categorical" dataType="string">
    <FieldRef field="SOME_NUMERIC_COLUMN "/>
</DerivedField>

Is there any alternative way of allowing PMMLBuilder to convert such transformations?

Thanks for your insights

vruusmann commented 3 years ago

but wouldn't it be sensible if the effect of applying CAST to an expression/variable would be applied to the first supported element?

If the current PMML expression element ("child") does not support the dataType attribute, but this element is contained in another PMML expression element ("parent") that does, then it would be OK to define the data type change there.

However, in the current case, the topmost element is Apply@function="substring".

I thought that with the following expression .. would yield the following PMML snippet

The FieldRef expression element does not support the dataType attribute.

It's kind of stupid to create a DerivedField element for (data-) type casting, when we could have:

<FieldRef field="SOME_NUMERIC_COLUMN" dataType="string"/>
vruusmann commented 3 years ago

@psxmc6 Anyway, you have full access to the JPMML-SparkML source code, so you can change it to do anything you want.