Goal: ensure that Spark SQL can operate in a strict ANSI SQL mode.
SQL dialects appear in two places:
one is the dialect of SQL that SparkSQL supports (the one we care most about)
the second is the dialect of SQL that the JDBC datasource supports, which governs DataFrames created from a JDBC connection to a database and also saving back to that database (we are less interested in this dialect)
For the Spark SQL dialect support, there's some historical context to be aware of:
In Spark 1.4.0 in May 2015 pluggable dialect support was added to Spark in SPARK-5213
An alternate implementation was proposed and abandoned during that process, at SPARK-6200
Pluggable dialects was in Spark from 1.4.0 up until the latest of the 1.6.x branch
For Spark 2.0.0 the pluggable dialects feature was removed from Spark in SPARK-12855
This was done as part of a unification between the long-supported HiveQL grammar and a simple scala parser-combinator grammar. The unified, reimplemented version mostly aimed to eliminate the HiveQL dependency (brought in a gnarly dependency tree) rather than ensure strict ANSI SQL parsing. The umbrella task for the work was SPARK-12362
As part of that unification, they removed parser pluggability in subtask SPARK-12855
Before removing pluggability Reynold checked in with the two known users of the pluggable parser feature: Cheng Hao from Intel who contributed the initial support (SPARK-5213) who no longer needed the feature, and Fei Wang who contributed the alternate implementation (SPARK-6200). In resulting discussion on https://github.com/apache/spark/pull/5827#issuecomment-172147089 Fei's primary reason for wanting pluggability was to "support ANSI tpcds sql".
Now Srinath from Databricks has included parser pluggability in his list of hooks and extension points for Spark at SPARK-18127 in point 5
I just filed SPARK-18499 as a subtask of the SPARK-18127 umbrella task and registered my interest in it on the ticket.
Goal: ensure that Spark SQL can operate in a strict ANSI SQL mode.
SQL dialects appear in two places:
For the Spark SQL dialect support, there's some historical context to be aware of:
I just filed SPARK-18499 as a subtask of the SPARK-18127 umbrella task and registered my interest in it on the ticket.