jpmml / jpmml-sparkml

Java library and command-line application for converting Apache Spark ML pipelines to PMML
GNU Affero General Public License v3.0
267 stars 80 forks source link

Add support for WITH AS/nested SELECT clause in SQLTransformer #114

Closed psxmc6 closed 3 years ago

psxmc6 commented 3 years ago

Hi Villu,

It is often the case that for preprocessing purposes, variables might need to be derived using some other already derived columns. Nevertheless, it is impossible to reference an alias in SELECT clause e.g.:

Won't work

SELECT
    INT(SUBSTRING(COL0, 1, 4)) AS DERIVED_VARIABLE_1,
    COL1 / COL2 AS DERIVED_VARIABLE_2,
    DERIVED_VARIABLE_1*DERIVED_VARIABLE_2 AS DERIVED_VARIABLE_3
FROM
    __THIS__ 

As an alternative, the following SQL will work and transform data frame giving three derived columns:

sql = """
        WITH preprocessing_1 AS (
            SELECT
              INT(SUBSTRING(COL0, 1, 4)) AS DERIVED_VARIABLE_1,
              COL1 / COL2 AS DERIVED_VARIABLE_2
            FROM
            __THIS__ 
        ) 
        SELECT
            *,
            DERIVED_VARIABLE_1*DERIVED_VARIABLE_2 AS DERIVED_VARIABLE_3
        FROM
            preprocessing_1
       """

sqlTransformer = SQLTransformer(statement = sql)
pipeline = Pipeline(stages = [sqlTransformer])
pipelineModel = pipeline.fit(df)
df = pipelineModel.transform(df)

Desired PMML output would map to the following snippet:

<DerivedField name="DERIVED_VARIABLE_1" optype="continuous" dataType="integer">
   <Apply function="substring">
          <FieldRef field="COL0"/>
          <Constant dataType="double">1</Constant>
          <Constant dataType="double">4</Constant>
   </Apply>
</DerivedField>

<DerivedField name="DERIVED_VARIABLE_2" optype="continuous" dataType="double">
   <Apply function="/">
          <FieldRef field="COL1"/>
          <FieldRef field="COL1"/>
   </Apply>
</DerivedField>

<DerivedField name="DERIVED_VARIABLE_3" optype="continuous" dataType="integer">
   <Apply function="*">
          <FieldRef field="DERIVED_VARIABLE_1"/>
          <FieldRef field="DERIVED_VARIABLE_2"/>
   </Apply>
</DerivedField>

Same could also be achieved with nested SELECT clause:

SELECT
   *,
   DERIVED_VARIABLE_1*DERIVED_VARIABLE_2 AS DERIVED_VARIABLE_3
FROM
(
   SELECT
      INT(SUBSTRING(COL0, 1, 4)) AS DERIVED_VARIABLE_1,
      COL1 / COL2 AS DERIVED_VARIABLE_2
   FROM
      __THIS__
)

PMMLBuilder does not seem to accept nested WITH/nested SELECT clause queries. The ability to create DerivedField elements and reference these in other derived columns would be very helpful from the feature engineering point of view.

Please let me know if there is an alternative approach to this issue.

Kind regards

psxmc6 commented 3 years ago

@vruusmann following up on my previous post, could you please reflect on the complete example shown below:

from pyspark.ml import Pipeline
from pyspark.ml.feature import SQLTransformer
from pyspark2pmml import PMMLBuilder

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("SQL to PMML") \
    .config("spark.master", "local") \
    .config("spark.jars", "jpmml-sparkml-executable-1.7.1.jar") \
    .getOrCreate()

df = spark.read.option("delimiter", ";").csv("winequality-red.csv", header=True, inferSchema=True)

statement = """
            SELECT 
               *,
               sqrt_test*100 AS subderived_field
            FROM
                (
                    SELECT 
                    *,
                    SQRT(`free sulfur dioxide`) AS `sqrt_test`
                    FROM
                    __THIS__
                )
            """

# SUCCESS: returns data frame with sqrt_test and subderived_field columns appended
sqlTransformer = SQLTransformer(statement=statement)
pipeline = Pipeline(stages=[sqlTransformer])
pipelineModel = pipeline.fit(df).transform(df)
pipelineModel.show()

# FAIL
sqlTransformer = SQLTransformer(statement=statement)
pipeline = Pipeline(stages=[sqlTransformer])
pipelineModel = pipeline.fit(df)

PMMLBuilder(spark, df, pipelineModel) \
    .buildFile("{}WineQuality.pmml".format("red"))

Error:

> Traceback (most recent call last):
>   File "C:\Users\psxmc6\PycharmProjects\pythonProject\main.py", line 34, in <module>
>     PMMLBuilder(spark, df, pipelineModel) \
>   File "C:\Users\psxmc6\PycharmProjects\pythonProject\venv\lib\site-packages\pyspark2pmml\__init__.py", line 27, in buildFile
>     javaFile = self.javaPmmlBuilder.buildFile(javaFile)
>   File "C:\Users\psxmc6\PycharmProjects\pythonProject\venv\lib\site-packages\py4j\java_gateway.py", line 1304, in __call__
>     return_value = get_return_value(
>   File "C:\Users\Maciej\PycharmProjects\pythonProject\venv\lib\site-packages\pyspark\sql\utils.py", line 117, in deco
>     raise converted from None
>
> pyspark.sql.utils.IllegalArgumentException: sqrt_test does not exist. Available: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, quality
vruusmann commented 3 years ago

@psxmc6 I've got alias expressions and SQL subqueries working in my computer. Will publish soon.

Thanks for the actionable code examples!

psxmc6 commented 3 years ago

Hi Villu, thank you very much for the prompt commit!

Kind regards

vruusmann commented 3 years ago

@psxmc6 Have you tested the new release yet?

Here are my test SQL statements: https://github.com/jpmml/jpmml-sparkml/blob/1.7.2/src/test/java/org/jpmml/sparkml/feature/SQLTransformerConverterTest.java#L48-L67

Can you break the current implementation with real-life SQL statements?

psxmc6 commented 3 years ago

I am still experimenting. One thing I noted is that CAST is now resulting in an additional DerivedField, but I suppose this is the expected behaviour:

select
  *,
  length_ratio*2 AS length_ratio_doubled
from
  (
    select
      *,
      int(sepal_length/petal_length) as length_ratio
    from
      __THIS__
  )

yields

<TransformationDictionary>
  <DerivedField name="integer((sepal_length#178 / petal_length#176))" optype="continuous" dataType="integer">
      <Apply function="/">
          <FieldRef field="sepal_length"/>
          <FieldRef field="petal_length"/>
      </Apply>
  </DerivedField>
  <DerivedField name="length_ratio" optype="continuous" dataType="integer">
      <FieldRef field="integer((sepal_length#178 / petal_length#176))"/> # HERE
  </DerivedField>
  <DerivedField name="length_ratio_doubled" optype="continuous" dataType="integer">
      <Apply function="*">
          <FieldRef field="length_ratio"/>
          <Constant dataType="integer">2</Constant>
      </Apply>
  </DerivedField>
</TransformationDictionary>  

@vruusmann, please note that int() cast has an effect on both DerivedFields and sets dataType="integer" on:

  1. DerivedField name="integer((sepal_length#178 / petal_length#176))"
  2. DerivedField name="length_ratio"

If the 1) would be called length_ratio according to name passed after AS keyword, then 2) could be removed?

Kind regards

vruusmann commented 3 years ago

One thing I noted is that CAST is now resulting in an additional DerivedField, but I suppose this is the required behaviour:

This looks like an opportunity for further optimization.

The int(sepal_length/petal_length) as length_ratio expression should be generating one properly named and properly typed DerivedField element, not two:

<DerivedField name="length_ratio" optype="continuous" dataType="string">
    <Apply function="/">
        <FieldRef field="sepal_length"/>
        <FieldRef field="petal_length"/>
    </Apply>
</DerivedField>

The integer((sepal_length#178 / petal_length#176)) is a "temporary" expression, that cannot be used or referenced by other parts of the SQL statement. Therefore, there really is no need for it to exist separately.

psxmc6 commented 3 years ago

I thought it was a workaround for the issue with lack of dataType attribute for Apply object. Although this temporary field doesn't affect PMML evaluation, I agree that having a clean file without redundant entries would make it more readable and smaller.