Closed psxmc6 closed 3 years ago
@vruusmann following up on my previous post, could you please reflect on the complete example shown below:
from pyspark.ml import Pipeline
from pyspark.ml.feature import SQLTransformer
from pyspark2pmml import PMMLBuilder
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("SQL to PMML") \
.config("spark.master", "local") \
.config("spark.jars", "jpmml-sparkml-executable-1.7.1.jar") \
.getOrCreate()
df = spark.read.option("delimiter", ";").csv("winequality-red.csv", header=True, inferSchema=True)
statement = """
SELECT
*,
sqrt_test*100 AS subderived_field
FROM
(
SELECT
*,
SQRT(`free sulfur dioxide`) AS `sqrt_test`
FROM
__THIS__
)
"""
# SUCCESS: returns data frame with sqrt_test and subderived_field columns appended
sqlTransformer = SQLTransformer(statement=statement)
pipeline = Pipeline(stages=[sqlTransformer])
pipelineModel = pipeline.fit(df).transform(df)
pipelineModel.show()
# FAIL
sqlTransformer = SQLTransformer(statement=statement)
pipeline = Pipeline(stages=[sqlTransformer])
pipelineModel = pipeline.fit(df)
PMMLBuilder(spark, df, pipelineModel) \
.buildFile("{}WineQuality.pmml".format("red"))
Error:
> Traceback (most recent call last):
> File "C:\Users\psxmc6\PycharmProjects\pythonProject\main.py", line 34, in <module>
> PMMLBuilder(spark, df, pipelineModel) \
> File "C:\Users\psxmc6\PycharmProjects\pythonProject\venv\lib\site-packages\pyspark2pmml\__init__.py", line 27, in buildFile
> javaFile = self.javaPmmlBuilder.buildFile(javaFile)
> File "C:\Users\psxmc6\PycharmProjects\pythonProject\venv\lib\site-packages\py4j\java_gateway.py", line 1304, in __call__
> return_value = get_return_value(
> File "C:\Users\Maciej\PycharmProjects\pythonProject\venv\lib\site-packages\pyspark\sql\utils.py", line 117, in deco
> raise converted from None
>
> pyspark.sql.utils.IllegalArgumentException: sqrt_test does not exist. Available: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, quality
@psxmc6 I've got alias expressions and SQL subqueries working in my computer. Will publish soon.
Thanks for the actionable code examples!
Hi Villu, thank you very much for the prompt commit!
Kind regards
@psxmc6 Have you tested the new release yet?
Here are my test SQL statements: https://github.com/jpmml/jpmml-sparkml/blob/1.7.2/src/test/java/org/jpmml/sparkml/feature/SQLTransformerConverterTest.java#L48-L67
Can you break the current implementation with real-life SQL statements?
I am still experimenting. One thing I noted is that CAST is now resulting in an additional DerivedField, but I suppose this is the expected behaviour:
select
*,
length_ratio*2 AS length_ratio_doubled
from
(
select
*,
int(sepal_length/petal_length) as length_ratio
from
__THIS__
)
yields
<TransformationDictionary>
<DerivedField name="integer((sepal_length#178 / petal_length#176))" optype="continuous" dataType="integer">
<Apply function="/">
<FieldRef field="sepal_length"/>
<FieldRef field="petal_length"/>
</Apply>
</DerivedField>
<DerivedField name="length_ratio" optype="continuous" dataType="integer">
<FieldRef field="integer((sepal_length#178 / petal_length#176))"/> # HERE
</DerivedField>
<DerivedField name="length_ratio_doubled" optype="continuous" dataType="integer">
<Apply function="*">
<FieldRef field="length_ratio"/>
<Constant dataType="integer">2</Constant>
</Apply>
</DerivedField>
</TransformationDictionary>
@vruusmann, please note that int() cast has an effect on both DerivedFields and sets dataType="integer" on:
If the 1) would be called length_ratio according to name passed after AS keyword, then 2) could be removed?
Kind regards
One thing I noted is that CAST is now resulting in an additional DerivedField, but I suppose this is the required behaviour:
This looks like an opportunity for further optimization.
The int(sepal_length/petal_length) as length_ratio
expression should be generating one properly named and properly typed DerivedField
element, not two:
<DerivedField name="length_ratio" optype="continuous" dataType="string">
<Apply function="/">
<FieldRef field="sepal_length"/>
<FieldRef field="petal_length"/>
</Apply>
</DerivedField>
The integer((sepal_length#178 / petal_length#176))
is a "temporary" expression, that cannot be used or referenced by other parts of the SQL statement. Therefore, there really is no need for it to exist separately.
I thought it was a workaround for the issue with lack of dataType attribute for Apply object. Although this temporary field doesn't affect PMML evaluation, I agree that having a clean file without redundant entries would make it more readable and smaller.
Hi Villu,
It is often the case that for preprocessing purposes, variables might need to be derived using some other already derived columns. Nevertheless, it is impossible to reference an alias in SELECT clause e.g.:
Won't work
As an alternative, the following SQL will work and transform data frame giving three derived columns:
Desired PMML output would map to the following snippet:
Same could also be achieved with nested SELECT clause:
PMMLBuilder does not seem to accept nested WITH/nested SELECT clause queries. The ability to create DerivedField elements and reference these in other derived columns would be very helpful from the feature engineering point of view.
Please let me know if there is an alternative approach to this issue.
Kind regards