sql-formatter-org / sql-formatter

A whitespace formatter for different query languages
https://sql-formatter-org.github.io/sql-formatter/
MIT License
2.37k stars 405 forks source link

Error when formatting Spark SQL #746

Open liyuanhao6 opened 5 months ago

liyuanhao6 commented 5 months ago

First Bug

set livy.session.conf.spark.driver.maxResultSize=8g;
set livy.session.conf.spark.driver.memory=12g;
set livy.session.conf.spark.driver.memoryOverhead=8g;
set livy.session.conf.spark.kryoserializer.buffer.max=512m;

Unable to format SQL: Error: Parse error: Unexpected "8g; set li" at line 1 column 50

We usually use the Spark configurations in the SQL codes, and it reports the bug. We suppose those configurations are useful, and could you please help us fix this bug?

Second Bug

WHERE
    dayno = ${v_day}

Unable to format SQL: Error: Parse error: Unexpected "${v_day} )" at line 112 column 17

We usually use shell parameter in the SQL codes, and it reports the bug. We suppose those parameters are useful, and could you please help us fix this bug?

nene commented 5 months ago

Thanks for reporting.

amadeuspzs commented 4 months ago

@liyuanhao6 if you are using VSCode try:

  "SQL-Formatter-VSCode.paramTypes": {
    "custom": [
      {
        "regex": "\\$?\\{[a-zA-Z0-9_]+\\}"
      }
    ]
  },

for the second problem.

palinkasnorbert commented 2 weeks ago

@amadeuspzs Thank you for the fix!

Interesting though, that I could format my sparksql code without this fix in the settings. However after formatting our special param syntax, I'm getting an unnecessary whitespace after the last } as example below:

--- code before formatting:

create or replace view ${unity_catalog_name}.${environment_name}_sublayer.tablename 

--- code after formatting:

create or replace view ${unity_catalog_name}.${environment_name} _sublayer.tablename -- whitespace appears after closing '}'
nene commented 2 weeks ago

Yeah, this ${...} syntax is actually supported by default.

However the formatter assumes that ${...} constitutes some kind of identifier. If it works as just a plain string substitution, then all bets are off. In that case I guess you could also have other SQL syntax as part of these substitutions, in which case the formatter would have no idea how it should format it:

CREATE ${orReplace} ${entity} ${type};

But really I know very little about Spark. I'm not even sure if this ${...} syntax is part of Apache Spark itself or comes from some other tool.

amadeuspzs commented 1 week ago

We're using ${...} as a plain string substitution in SparkSQL cells in Azure Synapse Analytics, so yes, it acts as a substitution for any string.

The additional whitespace looks like a result of nene's explanation that the formatter assumes it's an identifier, rather than a general purpose string replacement.

This is why whitespace isn't introduced after a . e.g.

${one}.${two}

whereas

${one}_${two}

becomes

${one} _ ${two}

One workaround would be to use/derive a "whole word" parameter instead of partial one e.g.

spark.conf.set("environment_name_sublayer", f"{environment_name}_sublayer")

followed by

create or replace view ${unity_catalog_name}.${environment_name_sublayer}.tablename 

which yields

CREATE    OR REPLACE VIEW ${unity_catalog_name}.${environment_name_sublayer}.tablename