Bug: excessive alias parsing in presto dialect

evyasonov commented 1 year ago

Input

SELECT 1a

Current behaviour: parse a as alias Expected behaviour: should rise an error, the query is not valid

georgesittas commented 1 year ago

SQLGlot is very lenient when it comes to parsing errors by design. This is so that it can easily parse as much valid SQL as possible, so I don't think this qualifies as a bug.

@tobymao do you think this should be treated as a tokenizer error? Postgres would treat a as "junk after numeric value", hence it'd fail at the tokenizing phase. I'm not sure if we need this behavior for any weird syntax, though.

evyasonov commented 1 year ago

@GeorgeSittas I believe that this is bug, because SQLGlot in this scenario does not return equivalent code.

sqlglot.transpile("SELECT 1a", read="presto", write="presto")[0] returns SELECT 1 AS a which is not equivalent. This is if I run SELECT 1a and SELECT 1 AS a, presto will return different outputs. While it should return the same.

This is behaviour that a user of SQLGlot does not expect.

So this is the bug that is needed to be fixed.

georgesittas commented 1 year ago

Hey,

presto will return different outputs.

It's still invalid SQL in Presto though, correct? What I meant is that we don't generally prioritize being robust when it comes to detecting errors, because being lenient means that the codebase is more comprehensive.

Alternatively: SQLGlot's contract is that it will try its best to parse / transpile everything that's valid SQL, but doesn't promise anything about invalid inputs.

I may be wrong here, but fixing this will probably entail changing a core part of the tokenizer's logic, and I'm not sure if it's something worth doing at the moment. I'm happy to tackle this if Toby agrees, though.

evyasonov commented 1 year ago

Hey,

It's still invalid SQL in Presto though, correct?

Yes, it's correct. And currently SQLGlot changes invalid sql into valid, but different sql. Which is unexpected and might cause problems on users' side. Because users use such a great tool as SQLGlot in a different ways as it's written in the documentation:

SQLGlot is a no dependency Python SQL parser, transpiler, optimizer, and engine. It can be used to format SQL or translate between 18 different dialects like DuckDB, Presto, Spark, Snowflake, and BigQuery. It aims to read a wide variety of SQL inputs and output syntactically correct SQL in the targeted dialects.

I use it to format sql which might contain errors and in case of incorrect sql I want to show an error and not change user's input...

Regarding the contract -- it's one of the possible option, but in this case the contract should be written in the header of documentation formatted with capslock and red bold font, so users expect unexpected behaviour in case of invalid sql.

@tobymao , what do you think?

P.S. by the way, documentation says that

Syntax errors are highlighted and dialect incompatibilities can warn or raise depending on configurations.

but in case of SELECT 1a it's not true.

P.P.S. Yeap, it's written that

A syntax error will result in a parser error:

but in case of SELECT 1a it's not true.

georgesittas commented 1 year ago

Yes, I see what you mean. Maybe we should emphasize the point about errors more in the README. By the way, for what it's worth, doing SELECT 1s is valid in both DuckDB and SparkSQL.

In any case, thanks for taking the time to report this.

evyasonov commented 1 year ago

Should the issue be reopened? Because it's not completed and no actions were done yet. I don't want it to be lost

tobymao commented 1 year ago

no it won’t be fixed

evyasonov commented 1 year ago

@tobymao Okay... And what about telling that "sqlglot might change behaviour of your code in case sql is not valid" with red bold capslock in documentation?

tobymao / sqlglot

Bug: excessive alias parsing in presto dialect #912