ivoa-std / ADQL

Astronomical Data Query Language Standard
https://wiki.ivoa.net/twiki/bin/view/IVOA/ADQL
Creative Commons Attribution Share Alike 4.0 International
7 stars 7 forks source link

Fix set operators: UNION, EXCEPT and INTERSECT #42

Closed gmantele closed 3 years ago

gmantele commented 4 years ago

A proposal to fix the grammar of the set operators and their description.

Resolves #20

gmantele commented 4 years ago

The grammar fix proposed here should allow the following ADQL queries:

I tested their SQL translation with PostgreSQL, MySQL and MS-SQLServer. Although some syntactic changes were required depending on the DBMS, all of them pass.

=> Tell me if I forgot some important cases.

gmantele commented 4 years ago

Some notes about the MySQL and MS-SQLServer:

INTERSECT and EXCEPT do not exist in MySQL, but it is possible to get the same result with joins.

Embedded CTE with MS-SQLServer does not work. The only hack I found is to put all CTEs at the root level. Let me know if this hack will not be enough in some cases.

msdemlei commented 4 years ago

On Fri, Oct 23, 2020 at 10:34:23AM -0700, Grégory Mantelet wrote:

The grammar fix proposed here should allow the following ADQL queries:

I think I agree with all of the queries except this one:

  • Embedded CTE:
      SELECT hip FROM hipparcos WHERE hip <= 10
    UNION
      (
          WITH tt AS (SELECT hip FROM hipparcos WHERE hip > 120406)
          SELECT hip FROM tt
      )
    ORDER BY 1 DESC

Is there an overriding reason to allow CTEs in operands of set operators? I cannot see how that would raise the expressiveness of the language (and I think your parallel comment about pulling CTEs to the top leven in SQLServer says about as much).

I see a slight advantage in that you can stick arbitrary queries into such arguments, which might be nice when composing queries out of pre-fabricated fragments, but then the cut-and-paste effort of taking the inner with_queries and sticking them to the top level doesn't seem to be prohibitive, and you get queries that are better understandable because there are fewer places you might have to check for table expressions.

The reason I'd rather not have these "embedded" CTEs is that table reolution (i.e., figuring out the table definition for a given table reference) is hard enough already. When CTEs are only allowed at the top level, it doesn't get much harder (I was actually positively surprised how little code was necessary for that). Allowing CTEs in extra positions will make that a good deal more complicated.

So: I'd say embedded CTEs are little gain for noticable pain, and hence I'd say we shouldn't have them (unless someone finds a stronger rationale than the cut-and-pastee case I'm making above).

In grammar terms, I think that would mean introducing a:

<select_expression> ::= 
       <query_set_expression>
        [ <order_by_clause> ]
        [ <offset_clause> ]

With that, we'd have:

<query_specification> ::=
        WITH <with_query> {, <with_query>} ...
        <select_expression>

<query_set_primary> ::=
        <select_query>
        | <left_paren> <select_expression> <right_parent>

<query_expression> ::=
        <select_expression>
        | <joined_table>

Incidentally, even if there's actually a case for CTEs in set operands, I'm pretty sure we wouldn't want the CTEs in a normal subquery (which is what query_expression is used for) -- or do we?

gmantele commented 4 years ago

I agree, I do not see the point of having CTEs inside a set's operand. I added this query example because there is no reason why it should not be possible considering that CTEs are currently allowed in "normal" subqueries : this way, it is consistent. Besides, it works like that in most DBMS.

However, I admit I have not yet started to implement the set operators in my ADQL parser (but I will this week...and before merging this PullRequest for sure), so I am now very sensible to the point @msdemlei raises about the table resolution : table resolution will be indeed more complicated than it is now (and it is already a bit challenging).

As @msdemlei , I agree that CTEs should be allowed only at the top level query and not any more in subqueries or set's operands. This will help parser development but also it will make ADQL queries more readable (all CTEs will always be declared at only one place: right before the main query).

Me too, I can not find any example which will make "embedded" CTEs different than having them declared at the top level. Does anybody else have any example?

gmantele commented 3 years ago

About the "embedded" CTEs, I do not have to change anything in my parser to make it work, even for the table resolution. So, either solution (allowing them, or not) is fine with me, on an implementation point of view.

Now, because it makes appararently no difference whether the CTEs are embedded or not, and because it would make ADQL queries more clear, I propose that in ADQL(-2.1) CTE are only accepted at the root level.

So, to sum up with examples:

    SELECT ra, dec FROM myTable
UNION
    (WITH cteTable AS (SELECT ra, dec FROM anotherTable)
     SELECT * FROM cteTable)

won't be accepted in ADQL, but the following will be:

WITH cteTable AS (SELECT ra, dec FROM anotherTable)
    SELECT ra, dec FROM myTable
UNION
    SELECT * FROM cteTable

This way, if later we discover there are use-cases where embedded CTEs are the only solution, we can easily extend the ADQL grammar without breaking anything.

I change the grammar right now. If any opposition, speak now (or before this PullRequest is accepted/merged).

gmantele commented 3 years ago

I moved the section about the WITH clause (or CTEs), just before the set operators so that speaking of CTEs in the set operators section makes sense.

I also added the following sentence in the WITH section:

For clarity reason, common table expressions can be defined only in the main query. They are not allowed in sub-queries.

gmantele commented 3 years ago

It's time for whoever is ready to review this Pull Request. According to me, this Pull Request and its corresponding issue can now be closed.