tconbeer / sqlfmt

sqlfmt formats your dbt SQL files so you don't have to
https://sqlfmt.com
Apache License 2.0
394 stars 16 forks source link

Respect case-sensitive json key path extractions #269

Open sgoley opened 1 year ago

sgoley commented 1 year ago

Describe the bug sqlfmt utility automatically formats json keys which can be case sensitive. It will make all keys fully lower case and return nulls as values if the .sql keys do not access the actual case sensitive json keys.

To Reproduce Use sqlfmt on any sql file that includes something like the following key path extractions:

src:salesperson.name <- all lowercase SRC:salesperson.name <- column ref lower case SRC:Salesperson.Name <- json keys have first letter capitals

Expected behavior sqlfmt should be sensitive to adapter specific json extraction notations.

Example: Snowflake "dot" or "bracket" semi-structured notation

Example: Postgres "colon" semi-structured notation

Exception: Bigquery at least appears to have only stringified semi-structured extracts so this adapter should be safe as-is.

tconbeer commented 1 year ago

Thanks for the report, @sgoley !

BigQuery is case-sensitive when accessing JSON properties using dot notation, also (mentioned in this discussion). I don't think I'll be solving the BQ issue, since parsing the names as field/property names instead of table/column names is out of scope as long as I'm the sole maintainer.

BUT if colons are rare enough to make parsing easy (e.g., they don't appear in other, similar contexts), I would consider respecting case sensitivity here. The hardest and most time-consuming part of this will just be researching the colon operator in all of the major dialects and determining how unique its usage as a JSON property accessor is. I know the colon can also be used to slice arrays in some dialects, but that should be easy to disambiguate from this operator (since it will be directly inside of square brackets). I don't know if there are other uses for the colon in SQL.

If you or anyone else could contribute this research, I'd be very grateful.

Until then, the workaround is to quote all json properties that are not snake_case:

src:salesperson.name
"SRC":salesperson.name
"SRC":"Salesperson"."Name"
gregbayer commented 1 year ago

I'm running into this issue in with DBT + Snowflake as well. Example: After formating, parse_json(event_properties):numFieldsFilledIn becomes parse_json(event_properties):numfieldsfilledin which returns null.

tconbeer commented 1 year ago

Thanks. Still looking for help researching the various uses for the colon in different dialects

ryaminal commented 1 year ago

Oh man, i just got burned by this. took me forever to find the problem. This was for snowflake specifically.

I know pyfmt is opinionated without options, but perhaps one of the opinions could be to force quotation marks around the keys e.g. c:"FIELD_ONE" instead of just to lowering them or perhaps force bracket notation e.g. c['FIELD_ONE'] ?

jlucas91 commented 6 months ago

Coming over from https://github.com/tconbeer/sqlfmt/issues/568.

Completely empathize with the challenges of sole maintenance and respect the call here. If I might suggest a documentation change to save others in my shoes similar pain - I would consider documenting the known cases where sqlfmt will break a project. We missed this change when implementing sqlfmt on a portion of our project, which ended up causing a significant outage for us in production (since case changes lead to silent failures). Ultimately the responsibility lies fully on us for not catching it. But given it appears to be a known concern - I might consider listing it in the Beta warning block (or similar) in the docs. We had (incorrectly in retrospect) assumed that given sqlfmt is now the default implementation in DBT cloud that it was fairly unlikely to break a dbt/snowflake project out of the box.

ryaminal commented 6 months ago

@tconbeer , i was looking at this to work on i). after a quick perusal of the code, it appears the casing happens here in node_manager.py.

would a modification to standardize_value in node_manager be where this work is done?

Is one solution for this to add a dialect-specific check for a SEMICOLON token(and others) that either:

to simplify this work, would it be possible/advised that we break this in to dialect-specific implementation? not meaning a different implementation but slowly adding this for snowflake first then BQ or whatever.

my guess is doing snowflake first would solve the problem for the largest amount of users and potentially simplify this implementation.

brianschillaci commented 4 months ago

I just ran into this issue as well. Something like properties:someRandomKey was formatted to properties:somerandomkey which broke the logic.