tobymao / sqlglot

Python SQL Parser and Transpiler
https://sqlglot.com/
MIT License
6.1k stars 611 forks source link

sqlglot seems to remove adjacent comments #712

Closed SudarshanVS closed 1 year ago

SudarshanVS commented 1 year ago

The following line,

Select * from Table /*comment 1*/
/*comment 2*/

becomes,

Select * from Table /*comment 2*/

after we run the parser,

expression_tree = parse_one(query)

This is an unexpected behaviour.

georgesittas commented 1 year ago

Is that correct? I tried to reproduce this and got:

>>> sql = """
... Select * from Table /*comment 1*/
... /*comment 2*/
... """
>>> expr = sqlglot.parse_one(sql)
>>> expr
(SELECT expressions:
  (STAR ), from:
  (FROM expressions:
    (TABLE this:
      (IDENTIFIER this: Table, quoted: False), comment: comment 1)))
>>> expr.sql()
'SELECT * FROM Table /* comment 1 */'
>>> print(expr.sql(pretty=True))
SELECT
  *
FROM Table /* comment 1 */

SQLGlot currently tries to preserve comments in a best-effort basis, because It's not a trivial problem to do a completely lossless transpilation. Basically, comments need to somehow be attached to tokens and then to AST nodes according to a set of heuristics, so in this case the second comment gets discarded during the tokenizing phase because there's no nearby token to attach it to (that is, without complicating the logic further, like attaching several comments to a token etc).

SudarshanVS commented 1 year ago

Hi George,

Thank you for your explanation. :)

I am indeed getting the behaviour as mentioned in my original post.

For the below code

q = "Select * from Table /*comment 1*/ /*comment 2*/"
print(sqlglot.parse_one(q).sql(pretty=True))

I get the following result,

SELECT
  *
FROM Table /* comment 2 */

I am using sqlglot v10.0.0

It is very strange that you are getting a completely different but similar result!!

georgesittas commented 1 year ago

Hello SudarshanVS,

The logic is different if you place the comment on the same line as the previous token, as opposed to placing it on a new line. That's why we get different outputs. However, you have a point, since the tokenizer is expected to not overwrite the previous token's comment if any.

I'll post a fix so that the two cases are consistent and we always keep the first comment.

mpf82 commented 1 year ago

Just my 2 cents: what about token.comment as a list, so you could attach multiple comments in order? The default could still be None, so you don't create a bunch of unused empty lists by default.

georgesittas commented 1 year ago

@mpf82 I'll check if / how this can be implemented a little later, thanks for the suggestion!

georgesittas commented 1 year ago

Just my 2 cents: what about token.comment as a list, so you could attach multiple comments in order? The default could still be None, so you don't create a bunch of unused empty lists by default.

One pain point about this is that we need to figure out how we'd actually format a list of comments in both pretty and normal mode.. For example, let's look at the above query:

SELECT * FROM table /*comment 1*/ /*comment 2*/

If we stored both comments in a list as ["comment 1", "comment2"], then how would we want to print them out? Maybe we could add them side-by-side like shown above (as different comments), but what about the pretty mode? Would we still do that or maybe print each in its own line? Additionally, what would be the logic for attaching multiple comments to a single token? I'm trying to think of a reasonable approach for all of these but I'm not sure yet.

Also, I think some of these examples don't make much sense, so I'm not sure if there's any point in prioritizing fixing them right now. For instance, the user could very well merge the two comments in the original SQL as /* comment1 comment2 */, although I guess an argument could be made here that the list approach would help preserve comments like in the following query, but again I'd expect that multi-line comments would be used more frequently in such cases.

-- comment 1
-- comment 2
-- comment 3
SELECT * FROM foo

Let me know if you have any ideas, happy to discuss this further. We might also want to move this conversation to Discussions.

mpf82 commented 1 year ago

Again, this is just my preference, but I'd say yes, for "normal" mode, side-by-side should be used, so SELECT * FROM table /*comment 1*/ /*comment 2*/ would look the same after parsing and printing.

For pretty mode, it's probably easiest to also output the comments using / ... / notation, that way you do not have to worry about multi-line strings inside comments, the difference to normal mode being, that each comment has a line-break prepended.

Let's talk code.

Initial statement: SELECT * FROM tbl /*line1\nline2\nline3*/ /*another comment*/ where 1=1 -- comment at the end

Current output (10.0.8) ("another comment" being ignored/swallowed)

SELECT * FROM tbl /* line1
line2
line3 */ WHERE 1 = 1 /* comment at the end */

Suggestion normal mode:

SELECT * FROM tbl /* line1
line2
line3 */ /* another comment */ WHERE 1 = 1 /* comment at the end */

Suggestion pretty mode:

SELECT * FROM tbl
/* line1
line2
line3 */
/* another comment */
WHERE 1 = 1
/* comment at the end */

To extend on the idea, you could also remember if the original comment was a block or inline comment, however this would require either a Comment class where you store the inline/block attribute, or you could make an educated guess, based on whether or not the comment includes a \n.

I guess would be the best way, as (in pretty mode) it could print very much the comment as it originally was, but it's the most work.

For normal mode, using always block comments in order is probably best, so

-- comment 1
-- comment 2
-- comment 3
SELECT * FROM foo

would print as

/* comment 1 */ /* comment 2 */ /* comment 3 */ SELECT * FROM foo

with the above mentioned changes, in pretty mode, the output could be the same as the input, as it would remember the inline attribute on the comment object.


By the way, it seems currently comments are not supported "everywhere", e.g.

SELECT /* a */ * FROM /* b */ tbl WHERE /* c */ 1=2"

parses to

(SELECT expressions:
  (STAR ), from:
  (FROM expressions:
    (TABLE this:
      (IDENTIFIER this: tbl, quoted: False))), where:     
  (WHERE this:
    (EQ this:
      (LITERAL this: 1, is_string: False), expression:    
      (LITERAL this: 2, is_string: False))), comment:  a )

and prints as

/* a */ SELECT * FROM tbl WHERE 1 = 2

I didn't have time to look at the code to figure out if there's a reason for this behaviour, just wanted to mention it.