jpmml / jpmml-sparkml

Java library and command-line application for converting Apache Spark ML pipelines to PMML
GNU Affero General Public License v3.0
267 stars 80 forks source link

SQLTransformerConverter bug #53

Closed alex-krash closed 5 years ago

alex-krash commented 5 years ago

Hello! I got a bug, when using SQLTransformer (trying to implement engineered features via it):

        SQLTransformer sqlTrans = new SQLTransformer().setStatement(
                "SELECT *, (a / b) as eng FROM __THIS__");

An error:

Exception in thread "main" org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved object, tree: 'a
    at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:105)
    at org.jpmml.sparkml.ExpressionTranslator.translate(ExpressionTranslator.java:125)
    at org.jpmml.sparkml.feature.SQLTransformerConverter.encodeFeatures(SQLTransformerConverter.java:122)
    at org.jpmml.sparkml.feature.SQLTransformerConverter.registerFeatures(SQLTransformerConverter.java:173)
    at org.jpmml.sparkml.PMMLBuilder.build(PMMLBuilder.java:112)
    at org.jpmml.sparkml.PMMLBuilder.buildFile(PMMLBuilder.java:290)
    at Main.main(Main.java:77)

Spark 2.3.0 (tried also 2.3.2 - the same). Full example goes here: https://gist.github.com/alex-krash/46ec1947e9ea4f0d7ce9acb63c512e09 Is there a workaround for dealing with it?

vruusmann commented 5 years ago

Support for the SQLTransformer class was introduced very recently, so some bumps in the road are to be expected.

This issue is specifically about "validating" data types in arithmetic expressions. It's probably safe to comment out/remove this validation logic altogether. However, the right thing to do would be to perform the "resolution" of the parsed Apache Spark SQL statement.

I couldn't find the right Apache Spark API/entry point for that. Basically, I expect there to be some org.apache.spark.sql.catalyst.plans.logical.LogicalPlan#resolveAll(StructType) method - visits the AST and adds data type information to individual AST nodes based on the data schema.

@alex-krash Do you know how to turn a "raw" LogicalPlan object into a "resolved" LogicalPlan object?

alex-krash commented 5 years ago

Do you know how to turn a "raw" LogicalPlan object into a "resolved" LogicalPlan object? @vruusmann , I am trying to figure out, how resolving can be implemented, but no luck yet :( I think that the same error when getting dataType will be with all Unresolved* instances.

alex-krash commented 5 years ago

It looks like the parsed plan is within SQLTransformer itself: I don't have an elegant solution for now.

  @Since("1.6.0")
  override def transformSchema(schema: StructType): StructType = {
    val spark = SparkSession.builder().getOrCreate()
    val dummyRDD = spark.sparkContext.parallelize(Seq(Row.empty))
    val dummyDF = spark.createDataFrame(dummyRDD, schema)
    val tableName = Identifiable.randomUID(uid)
    val realStatement = $(statement).replace(tableIdentifier, tableName)
    dummyDF.createOrReplaceTempView(tableName)
    val outputSchema = spark.sql(realStatement).schema
    //  spark.sql(realStatement).queryExecution.analyzed -- here is an analyzed plan
    spark.catalog.dropTempView(tableName)
    outputSchema
  }
alex-krash commented 5 years ago

54 will add support here