Closed fttt closed 5 years ago
The cast
SQL function is not yet implemented.
In the meantime, the SQL-to-PMML translation component should throw a more meaningful exception here (eg. "function XYZ is not yet implemented").
Hi Villu,
First of all, thank you for such a useful suite of libraries!
I would like to ask if support for the above mentioned SQL cast would become available anytime soon? It would greatly expand Spark SQL layer's data-preprocessing capabilities.
Currently, any implicit or explicit cast yields the following error at the point of invoking PMMLBuilder, yet data frame is transformed in pyspark correctly:
PMMLBuilder(spark, df, pipelineModel).buildFile("export.pmml")
pyspark.sql.utils.IllegalArgumentException: Spark SQL function 'cast(substring(DATE#1153, 1, 4) as double)' (class org.apache.spark.sql.catalyst.expressions.Cast) is not supported
I believe the issue is closely related to the following: https://github.com/jpmml/jpmml-sparkml/issues/66 and https://github.com/jpmml/jpmml-sparkml/issues/62
Kind regards
@psxmc6 This issue has been closed with a commit (more than two years ago!), which means that the cast
function is conceptually supported.
However, looking into the JPMML-SparkML library code, then there's a small restriction that the cast
function must be used in a context which supports setting the PMML dataType
attribute. This is what's causing problems for you.
the cast function must be used in a context which supports setting the PMML dataType attribute.
Your expression cast(substring(DATE#1153, 1, 4) as double)
is parsed so that the substring
function becomes the following PMML Apply
element:
<Apply function="substring">
<!-- omitted for brevity -->
</Apply>
Casting would mean setting Apply@dataType=double
. However, the Apply
element does not define this attribute, and the DMG.org (maintainer of the PMML specification) refuses to add it.
Maybe I'll go my own way, and implement Apply@(x-)dataType
attribute as a vendor extension.
Hi Villu,
Thank you for the prompt reply.
Yes, I tried to navigate through the source code and seen the part you are referring to I guess: https://github.com/jpmml/jpmml-sparkml/blob/master/src/main/java/org/jpmml/sparkml/ExpressionTranslator.java#L281
Could you please provide a small example on how to use CAST function within SQL statement as I don't fully understand the dataType constraint bit.
My use case is that I have a date in a string format yyyymmdd and I would like to extract some component from it and perform mathematical operation (e.g. multiply by some number) on, lets say, extracted year.
Would that be possible?
Many thanks
the cast function must be used in a context which supports setting the PMML dataType attribute.
Your expression
cast(substring(DATE#1153, 1, 4) as double)
is parsed so that thesubstring
function becomes the following PMMLApply
element:<Apply function="substring"> <!-- omitted for brevity --> </Apply>
Casting would mean setting
Apply@dataType=double
. However, theApply
element does not define this attribute, and the DMG.org (maintainer of the PMML specification) refuses to add it.Maybe I'll go my own way, and implement
Apply@(x-)dataType
attribute as a vendor extension.
I understand, but what I am really aiming for is to end up with the below structure, where substring's output is implicitly converted to integer via DerivedField which has dataType attribute:
<DerivedField name="derived_DATE_FIELD_year" dataType="integer" optype="continuous">
<Apply function="substring">
<FieldRef field="DATE_FIELD"/>
<Constant dataType="double">1</Constant>
<Constant dataType="double">4</Constant>
</Apply>
</DerivedField>
what I am really aiming for is to end up with the below structure, where substring's output is implicitly converted to integer via DerivedField which has dataType attribute
Yes, wrapping the expression into a DerivedField
element would be a viable workaround. Viable, but not elegant.
The technical limitation here is that the org.jpmml.sparkml.ExpressionTranslator#translate(org.apache.spark.sql.catalyst.expressions.Expression)
method does not keep track of the PMML creation context (in the form of org.jpmml.sparkml.SparkMLEncoder
reference), so it cannot define new derived fields.
Please correct me if I am wrong, but wouldn't it be sensible if the effect of applying CAST to an expression/variable would be applied to the first supported element?
What I mean by that, in the above case, Apply does not support dataType attribute, but the outermost DerivedField does.
This was my intuition behind CAST, I thought that with the following expression:
SELECT
CAST(SOME_NUMERIC_COLUMN AS STRING) AS NUM_AS_STRING
FROM
__THIS__
would yield the following PMML snippet:
<DerivedField name="NUM_AS_STRING" optype="categorical" dataType="string">
<FieldRef field="SOME_NUMERIC_COLUMN "/>
</DerivedField>
Is there any alternative way of allowing PMMLBuilder to convert such transformations?
Thanks for your insights
but wouldn't it be sensible if the effect of applying CAST to an expression/variable would be applied to the first supported element?
If the current PMML expression element ("child") does not support the dataType
attribute, but this element is contained in another PMML expression element ("parent") that does, then it would be OK to define the data type change there.
However, in the current case, the topmost element is Apply@function="substring"
.
I thought that with the following expression .. would yield the following PMML snippet
The FieldRef
expression element does not support the dataType
attribute.
It's kind of stupid to create a DerivedField
element for (data-) type casting, when we could have:
<FieldRef field="SOME_NUMERIC_COLUMN" dataType="string"/>
@psxmc6 Anyway, you have full access to the JPMML-SparkML source code, so you can change it to do anything you want.
version info:spark2.4.0 jpmml 1.5.0
I want to change column type in pipelinemodel,its successful in pipelinemodel,but not in pmml build.Any one can help me.Thanks!
when I run code:
return error: