ivoa-std / ADQL

Astronomical Data Query Language Standard
https://wiki.ivoa.net/twiki/bin/view/IVOA/ADQL
Creative Commons Attribution Share Alike 4.0 International
7 stars 7 forks source link

Fix BNF of BOX, CIRCLE and POLYGON #29

Closed gmantele closed 4 years ago

gmantele commented 4 years ago

The grammar of BOX, CIRCLE and POLYGON allows only pair(s) of coordinates, though the description of these functions have been modified since 2.1, now allowing a second form with any expression able to return a POINT (i.e. a column, CENTROID, UDF).

Current grammar:

<box> ::=
    BOX <left_paren>
        [ <coord_sys> <comma> ]
        <coordinates>
        <comma> <numeric_value_expression>
        <comma> <numeric_value_expression>
    <right_paren>

<circle> ::=
    CIRCLE <left_paren>
        [ <coord_sys> <comma> ]
        <coordinates>
        <comma> <radius>
    <right_paren>

<polygon> ::=
    POLYGON <left_paren>
        [ <coord_sys> <comma> ]
        <coordinates>
        <comma> <coordinates>
        { <comma> <coordinates> } ?
    <right_paren>

Suggested grammar update:

<box> ::=
    BOX <left_paren>
        [ <coord_sys> <comma> ]
        <coordinates>
        <comma> <numeric_value_expression>
        <comma> <numeric_value_expression>
    <right_paren>
    |
    BOX <left_paren>
        [ <coord_sys> <comma> ]
        <coord_value>
        <comma> <numeric_value_expression>
        <comma> <numeric_value_expression>
    <right_paren>

<circle> ::=
    CIRCLE <left_paren>
        [ <coord_sys> <comma> ]
        <coordinates>
        <comma> <radius>
    <right_paren>
    |
    CIRCLE <left_paren>
        [ <coord_sys> <comma> ]
        <coord_value>
        <comma> <radius>
    <right_paren>

<polygon> ::=
    POLYGON <left_paren>
        [ <coord_sys> <comma> ]
        <coordinates>
        <comma> <coordinates>
        { <comma> <coordinates> } ?
    <right_paren>
    |
    POLYGON <left_paren>
        [ <coord_sys> <comma> ]
        <coord_value>
        <comma> <coord_value>
        { <comma> <coord_value> } ?
    <right_paren>

Thus, the grammar of POINT does not change:

<point> ::=
    POINT <left_paren>
        [ <coord_sys> <comma> ]
        <coordinates>
    <right_paren>
msdemlei commented 4 years ago

On Fri, Apr 17, 2020 at 07:07:16AM -0700, Grégory Mantelet wrote:

The grammar of BOX, CIRCLE and POLYGON allows only pair(s) of coordinates, though the description of these functions have been modified since 2.1, now allowing a second form with any expression able to return a POINT (i.e. a column, CENTROID, UDF).

This is reasonable, and we should do it, except:

> <box> ::=
>   BOX <left_paren>
>     [ <coord_sys> <comma> ]
>     <coordinates>
>     <comma> <numeric_value_expression>
>     <comma> <numeric_value_expression>
>   <right_paren>
>   |
>   BOX <left_paren>
>     [ <coord_sys> <comma> ]
>     <coord_value>
>     <comma> <numeric_value_expression>
>     <comma> <numeric_value_expression>
>   <right_paren>

Since we've deprecated BOX I don't think we should bother with it any more. There's nothing wrong with having it inconsistent with the others when we say it's on the way out anyway.

> <circle> ::=
>   CIRCLE <left_paren>
>     [ <coord_sys> <comma> ]
>     <coordinates>
>     <comma> <radius>
>   <right_paren>
>   |
>   CIRCLE <left_paren>
>     [ <coord_sys> <comma> ]
>     <coord_value>
>     <comma> <radius>
>   <right_paren>

To keep the rules short, I'd rewrite this as:

<circle_center> ::= 
  <coordinates>
  | <coord_value>

<circle> ::=
   CIRCLE <left_paren>
     [ <coord_sys> <comma> ]
     <circle_center>
     <comma> <radius>
   <right_paren>

On a similar consideration, I'd write

<polygon_vertices> ::=
     <coordinates>
    <comma> <coordinates>
    <comma> <coordinates>
    { <comma> <coordinates> } ...
   |
   <coord_value>
   <comma> <coord_value>
   <comma> <coord_value>
   { <comma> <coord_value> } ...

<polygon> ::=
   POLYGON <left_paren>
     [ <coord_sys> <comma> ]
     <polygon_vertices>
     <right_paren>

This also replaces the question marks (which aren't defined in our BNF) with our repetition syntax, and, since that's zero or more, adds a third mandatory point.

gmantele commented 4 years ago

I agree with this shortened version :-)

gmantele commented 4 years ago

The ADQL text says that BOX, CIRCLE and POLYGON now have an optional coordinate system argument (the 1st one), in addition of 2 forms: one with pair(s) of coordinates and another with POINT value(s). However, the BNF has not yet been updated in that sense.

While starting to fix it, I also started to implement it in VOLLT/ADQL-Lib, but it seems to raise some ambiguities due to the fact that every argument can be a column reference.

For BOX and CIRCLE, the problem can be solved based on the number of arguments, ONLY IF the coordinate system argument is allowed only with coordinates pair(s) (instead of POINT value(s)). Which gives the following BNF:

<box> ::=
        BOX <left_paren>
            [ <coord_sys> <comma> ]
            <coordinates>
            <comma> <numeric_value_expression>
            <comma> <numeric_value_expression>
        <right_paren>
      |
        BOX <left_paren>
            <coord_value>
            <comma> <numeric_value_expression>
            <comma> <numeric_value_expression>
        <right_paren>

<circle> ::= 
        CIRCLE <left_paren>
            [ <coord_sys> <comma> ]
            <coordinates>
            <comma> <radius>
        <right_paren>
      |
        CIRCLE <left_paren>
            <coord_value>
            <comma> <radius>
        <right_paren>

<coord_value> ::= <point> | <column_reference> | <user_defined_function>

Which allows the following ADQL statements with columns:

BOX(colCooSys, colRa, colDec, colWidth, colHeight)
BOX(colRa, colDec, colWidth, colHeight)
BOX(colCenterPoint, colWidth, colHeight)

CIRCLE(colCooSys, colRa, colDec, radius)
CIRCLE(colRa, colDec, radius)
CIRCLE(colCenterPoint, radius)

However, for POLYGON, because its number of arguments is variable, I can not find a way. For the moment I came up with the following BNF:

<polygon> ::=
        POLYGON <left_paren>
            [ <coord_sys> <comma> ]
            <coordinates>
            <comma> <coordinates>
            { <comma> <coordinates> } ?
        <right_paren>
      |
        POLYGON <left_paren>
            <coord_value>
            <comma> <coord_value>
            { <comma> <coord_value> } ?
        <right_paren>

Which makes all the following statements possible:

POLYGON(colCooSys, colRa1, colDec1, colRa2, colDec2, colRa3, colDec3)
POLYGON(colRa1, colDec1, colRa2, colDec2, colRa3, colDec3)
POLYGON(colPoint1, colPoint2, colPoint3)

-- but also:
POLYGON(colPoint1, colPoint2, colPoint3, colPoint4, colPoint5, colPoint6, colPoint7)
-- which is ambigus with the example with colCooSys !!!!

Am I wrong? If not, does anybody have an idea on how to solve this issue?

msdemlei commented 4 years ago

On Tue, Apr 21, 2020 at 02:46:13AM -0700, Grégory Mantelet wrote:

forms: one with pair(s) of coordinates and another with POINT value(s). However, the BNF has not yet been updated in that sense.

Ouch.

While starting to fix it, I also started to implement it in VOLLT/ADQL-Lib, but it seems to raise some ambiguities due to the fact that every argument can be a column reference.

can't be a column reference, which I think is saving us here. > However, for POLYGON, because its number of arguments is variable, I can not find a way. For the moment I came up with the following BNF: > > ``` > ::= > POLYGON > [ ] > > > { } ? > > | > POLYGON > > > { } ? > > ``` > > Which makes all the following statements possible: > > ```sql > POLYGON(colCooSys, colRa1, colDec1, colRa2, colDec2, colRa3, colDec3) > POLYGON(colRa1, colDec1, colRa2, colDec2, colRa3, colDec3) > POLYGON(colPoint1, colPoint2, colPoint3) > > -- but also: > POLYGON(colPoint1, colPoint2, colPoint3, colPoint4, colPoint5, colPoint6, colPoint7) > -- which is ambigus with the example with colCooSys !!!! > ``` > > Am I wrong? > If not, does anybody have an idea on how to solve this issue? Coosys cannot be confused with a coordinate, because neither point nor float can be a literal string. There might be a problem if we allow NULL, but since we'd like to do away with coosys ASAP, I think it's reasonable to (if we actually need this) just be clear it can't be NULL (this IIRC has been open in the old spec). Once you do that, the problem of multi-arg polygon is as it's always been; the parser can't decide between split-coordindate and point arguments. This needs to be done at a later stage when you have type information (DaCHS isn't terribly good at that at the moment, now that I check what I do. Ahem). I'd be open to retracting point arguments to polygon until we've thought about it a bit more. Who's the champion for them?
gmantele commented 4 years ago
can't be a column reference, which I think is saving us here.

Well, according to the BNF, it is possible anyway (since ADQL-2.0):

<coord_sys> ::= <string_value_expression>

<string_value_expression> ::= <character_value_expression>

<character_value_expression> ::= <concatenation> | <character_factor>

<character_factor> ::= <character_primary>

<character_primary> ::= <value_expression_primary> | <string_value_function>

<value_expression_primary> ::=
          <unsigned_value_specification>
        | <column_reference>
        | <set_function_specification>
        | <left_paren> <value_expression> <right_paren>

Considering the change it would imply, I am not sure it is something safe to fix in a minor revision of ADQL, or is it?

gmantele commented 4 years ago

Coosys cannot be confused with a coordinate, because neither point nor float can be a literal string. There might be a problem if we allow NULL, but since we'd like to do away with coosys ASAP, I think it's reasonable to (if we actually need this) just be clear it can't be NULL (this IIRC has been open in the old spec).

Once you do that, the problem of multi-arg polygon is as it's always been; the parser can't decide between split-coordindate and point arguments. This needs to be done at a later stage when you have type information (DaCHS isn't terribly good at that at the moment, now that I check what I do. Ahem).

VOLLT/ADQL-Lib is able to check type information after the building of the AST, but still, identifying such difference between the 2 syntaxes afterwards is not easy...and it would be the first time I check such situation. I mean I could probably spend some time on that in VOLLT and it would work, but I think it would be better if the grammar is more explicit and prevent this ambiguity....which I forsee as also a possible issue with PEG in a future version of ADQL.

msdemlei commented 4 years ago

On Wed, Apr 22, 2020 at 02:13:11AM -0700, Grégory Mantelet wrote:

syntaxes afterwards is not easy...and it would be the first time I check such situation. I mean I could probably spend some time on

Well, the INTERSECTS -> CONTAINS decay is a point where we've already needed it (and that was a reason to drop that decay, if it's still being dropped).

that in VOLLT and it would work, but I think it would be better if the grammar is more explicit and prevent this ambiguity....which I

+1

forsee as also a possible issue with PEG in a future version of ADQL.

No, PEG can cope just fine. But it's still a pain.

msdemlei commented 4 years ago

On Wed, Apr 22, 2020 at 02:01:09AM -0700, Grégory Mantelet wrote:

can't be a column reference, which I think is saving us here.

Well, according to the BNF, it is possible anyway (since ADQL-2.0):

<coord_sys> ::= <string_value_expression>

<string_value_expression> ::= <character_value_expression>

<character_value_expression> ::= <concatenation> | <character_factor>

<character_factor> ::= <character_primary>

<character_primary> ::= <value_expression_primary> | <string_value_function>

<value_expression_primary> ::=
          <unsigned_value_specification>
        | <column_reference>
        | <set_function_specification>
        | <left_paren> <value_expression> <right_paren>

Considering the change it would imply, I am not sure it is something safe to fix in a minor revision of ADQL, or is it?

Oops. I somehow must have repressed the memory that that change hasn't been made to the standard. I'm positive that we want

::= (where I have a hunch that implementations will fail if someone were to write 'IC' -- this is actually legal 'RS'; I'm making no promises for DaCHS). I am almost completely sure that passing in a column reference into coord_sys works on no existing system (and it would be hell in implementation). Hence, I'd say if we're worried about a formal standards process it we can just issue an erratum on 2.0. I'll do it if you ask me to.
gmantele commented 4 years ago

Before proposing such erratum should not we be sure it won't break any existing service? I know that there is very few chance that a such thing is currently supported, but it's possible. I mean, the coord. sys. parsing and then conversion itself can be both done on database side. Not trivial, but not impossible (with Postgres it can be done by adding some C code).

Personally, I did not do it in none of the TAP services I have set up. VOLLT can not either do that.

From what @msdemlei sais, it seems that DACHS is not capable of that either.

Is there any other TAP implementation which is able to accept anything else than string literals as coord. sys.? If yes, would it be a problem to change that, as @msdemlei suggests, with an erratum of ADQL-2.1?

@pdowler , what about your TAP service(s)?

gmantele commented 4 years ago

The ADQL-2.0 - Erratum 3 has been accepted. It fixes the syntax of the coordinate system argument ; it MUST be a string literal.

This done, it now fixes the syntax ambiguity with the extended POLYGON function (i.e. optional coord. sys. and a variable number of POINT values or pairs of coordinates).

gmantele commented 4 years ago
 <box> ::=
   BOX <left_paren>
     [ <coord_sys> <comma> ]
     <coordinates>
     <comma> <numeric_value_expression>
     <comma> <numeric_value_expression>
   <right_paren>
   |
   BOX <left_paren>
     [ <coord_sys> <comma> ]
     <coord_value>
     <comma> <numeric_value_expression>
     <comma> <numeric_value_expression>
   <right_paren>

Since we've deprecated BOX I don't think we should bother with it any more. There's nothing wrong with having it inconsistent with the others when we say it's on the way out anyway.

I am not especially against this logic. But in that case we should not have deprecated it and in the same time added the same text as the other functions (i.e. optional coord. sys. and possibility to use a point value instead of a pair of coordinates). Now this is done, it bothers me to make the BNF inconsistent just for BOX.

So, two solutions to this "problem":

  1. adapt the BNF in the same way as CIRCLE and POLYGON,
  2. remove the part of the description of BOX allowing the coord. sys to be optional, and the part allowing the center position to be given as a point value.

I have no favorite solution. (1) seems just much faster to adopt (because we do not have to play with the document history to get back the 2.0 version).

What do you think?

molinaro-m commented 4 years ago
  1. is more homogeneous
  2. more "correct" w.r.t. deprecation

But if the changes in BOX do not alter the 2.0 usage (if any) I'd prefer 1 for doc management simplicity. It's anyway a personal opinion, not really an objective reasoning.

msdemlei commented 4 years ago

On Wed, Jun 17, 2020 at 10:27:18AM -0700, Grégory Mantelet wrote:

Since we've deprecated BOX I don't think we should bother with it any more. There's nothing wrong with having it inconsistent with the others when we say it's on the way out anyway.

I am not especially against this logic. But in that case we should not have deprecated it and in the same time added the same text as the other functions (i.e. optional coord. sys. and possibility to use a point value instead of a pair of coordinates). Now this is done, it bothers me to make the BNF inconsistent just for BOX.

So, two solutions to this "problem":

  1. adapt the BNF in the same way as CIRCLE and POLYGON,
  2. remove the part of the description of BOX allowing the coord. sys to be optional, and the part allowing the center position to be given as a point value.

I have no favorite solution. (1) seems just much faster to adopt (because we do not have to play with the document history to get back the 2.0 version).

What do you think?

Of course, there's the third choice of not touching BOX BNF vs. 2.0 and reverting all text changes regarding BOX, except for deprecating it, of course. That would be my preferred way out, and with a bit of encouragement I'll do it, to, based on this PR.

However, for this PR, if (1) is faster and simpler, let's just do it.

gmantele commented 4 years ago

Actually, your 3rd option is my 2nd one. Well...sorry for my explanations which weren't clear enough :(

Ok. The option (1) is what I did in the PR. So, let's just keep the PR as such. Is everything in this PR fine for you?