Simplify the cql2-text grammar (future version improvements?)

jerstlouis commented 2 years ago

This is feedback from trying to implement cql2-text. Implementers (or at least us) face struggles with the current grammar.

I think it comes down mainly to these two things:

Some of the capabilities from extension conformance classes are defined as separate rules. I think it would be much easier to simply define new possible values for operators or pre-defined functions identifiers (using the same grammar rule as function calls) for operators using a function call syntax (i.e., array/spatial/temporal operators and predicates). This would cut down the number of rules dramatically, and I think would also allow to make the requirements in each conformance class clearer.
Some rules seem to exist only to restrict the data types (e.g., numericExpression, characterExpression, temporalExpression...). However, this is purely a runtime concept, since the data type that a certain expression (e.g., a property) will evaluate to will depend on the queryables. Therefore I would not have used grammar rules (which are about the syntax) to make this distinction. Instead, I think what is needed for this is to have requirements and/or permissions that specify the interpretation if an unexpected data type is used in such a context.

I think simplifying these two aspects of the grammar would directly result in simpler parser implementations, greater ease of implementation and greater interoperability.

cportele commented 2 years ago

Meeting 2022-06-20: It would be good to understand why this would result in an easier implementation. We need to discuss this in a meeting when @jerstlouis is present.

jerstlouis commented 2 years ago

Thanks @cportele . I should be attending the next meeting in a couple weeks.

As a summary, from a syntactic point of view, I think the two things I suggested above would result in fewer grammar rules (simpler grammar), and parser node classes would be a more direct / natural match to the rules. We would implement the function/operator name validation / data types checking separately from the parsing, since some of it is only known at runtime (e.g., available functions, queryable data types). e.g., in our implementation we have a CQL2CallExp node class which we plan to use to handle the array / spatial / temporal operators which syntactically look like function calls. We are hand-writing a Recursive Descent parser, borrowing heavily from our ECCSS/CMSS parser.

jerstlouis commented 2 years ago

The following excerpt from our internal CQL2 design document mapping CQL2 conformance classes and providing a concise summary of the CQL2 syntax might be insightful. A simpler grammar could potentially closely match those CQL2* AST node classes to rules. We could eventually prototype such a simpler grammar together with railroad diagrams demonstrating the idea.

Basic CQL2

Defines predicate expressions evaluating to a boolean value, which we parse as the following eC AST node classes:
- CQL2Identifier for identifiers, which are sequences of UTF-8 characters. Identifiers can also be double-quoted to include any arbitrary characters. As in ECCSS, true, false and null will be treated as identifiers in our implementation (with the drawback that they cannot be used for identifiers even double-quoted).
  - Valid identifier starting characters: ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
  - Additional valid identifier continuing characters: "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
- CQL2Expression for a generic expression class from which other CQL2Exp* are derived:
  - Sub-expressions enclosed in parentheses ( ) to override default operator priorities
  - CQL2ExpIdentifier for expressions consisting of an identifier (CQL2Identifier).
  - CQL2ExpConstant for decimal numeric literals, integer or fractional using ., no suffixes used, including support for scientific notation (E separating power of 10 exponent)
  - CQL2ExpString for UTF-8 character string literals enclosed in single quote ('); single quote characters within a string literal are represented by two consecutive single-quote characters ('')
    - Defines the concept of a date and date-time as string literals (CQL2ExpString) following RFC 3339 (profile of ISO 8601).
  - CQL2ExpCall with support for:
    - DATE as a well-known function taking a string literal (CQL2ExpString) defining a date instant
    - TIMESTAMP as a well-known function taking a string literal (CQL2ExpString) defining a datetime instant
  - CQL2ExpOperation, with support for the following operators (note that all CQL2 keywords are case-insensitive):
    - Unary operator NOT followed by an operand
    - Binary logical operators AND and OR
    - Binary relational operators =, <, >, <=, >=, <>
    - Binary relational operator IS and IS NOT (ignore extra spaces between IS and NOT) for checking against NULL identifier.
    - For relational operators with Basic CQL2, CQL2ExpIdentifier (other than true, false and null) are only supported for the left operand, while true, false, null and literals are supported in second operands only.

Property-Property

Removes the limitation in which operands of relational operators identifiers or literals can be used

Arithmetic Expressions

Adds support for the following binary operators in CQL2ExpOperation: +, -, *, / (fractional division, see also Features#711), ^ (exponent)
Adds support for the - unary operator? (Features#709)

Advanced Comparison Operators

Adds LIKE and NOT LIKE relational operators (ignore extra spaces between NOT and LIKE) that accepts a pattern where % matches 0..n arbitrary characters, _ matches exactly one arbitrary character (and those characters can be escaped by using a \ character); expects text expressions only, and string literals in right operand.
- NOTE: Equivalent but different functionality in ECCSS is provided by the ^ (starts with), $ (ends with), ~ (contains) text operators and their negated counterparts.
Adds BETWEEN and NOT BETWEEN ternary relational operators (e.g., depth BETWEEN 100.0 AND 150.0); expects numeric expressions only.
Adds IN and NOT IN relational operators taking a comma-separated list of expressions (CQL2ExpList) within parentheses as second operand; items in the list are expected to be of same type as value being tested.

Functions

Adds CQL2ExpCall with support for implementation-defined custom functions, taking a list of expressions within parentheses ( ) as arguments following an identifier (CQL2Identifier) for the function to call
Implies use of CQL2ExpList for function arguments separated by commas
Although the CQL2 specification and grammar does not currently define it as such, syntactically all of the following extended conformance classes could have been defined using the functions calls grammar rule, and our parser implement it as such using an CQL2ExpCall AST node. This demonstrates that functions are a mechanism by which CQL2 could be extended independently from the specification.
- Except for WKT, only the array literals would require the addition of a new grammar rule since it uses [...] rather than e.g., ARRAY(...)). My suggestion in Features#718 is to use (1,2,3) for array literals instead. To support WKT, support for space-separated tuples are also required e.g., 10 30 in POLYGON((10 30, 40 20, 50 80, 10 30)).

Case-insensitive Comparison

Adds CQL2ExpCall with support for the CASEI well-known function returning a case-desensitised version of a string.

Accent-insensitive Comparison

Adds CQL2ExpCall with support for the ACCENTI well-known function returning an accent-desensitised version of a string.

Basic Spatial Operators

Adds CQL2ExpCall with support for the POINT, LINESTRING, POLYGON, MULTIPOINT, MULTILINESTRING, MULTIPOLYGON, GEOMETRYCOLLECTION and ENVELOPE well-known functions defining vector geometry objects following the simple features model (WKT encoding).
- Also implies support for space-separated tuples and array literals using ( ) to support the WKT notation as arguments to those function calls
Adds the S_INTERSECTS well-known function for spatial intersection operator
Implies use of CQL2ExpList for function arguments separated by commas.

Spatial Operators

Implies Basic Spatial operator support, and adds the following well-known functions for additional spatial operators:
- S_CONTAINS, S_CROSSES, S_DISJOINT, S_EQUALS, S_OVERLAPS, S_TOUCHES, S_WITHIN

Temporal Operators

Adds CQL2ExpCall with support for:
- INTERVAL as a well-known function taking two instants string literals (CQL2ExpString) defining a temporal interval object
- the following operators taking both instants and intervals as arguments: T_AFTER, T_BEFORE, T_DISJOINT, T_EQUALS, T_INTERSECTS
- the following operators taking only intervals as arguments: T_CONTAINS, T_DURING, T_FINISHEDBY, T_FINISHES, T_MEETS, T_METBY, T_OVERLAPPEDBY, T_OVERLAPS, T_STARTEDBY, T_STARTS
Implies use of CQL2ExpList for function arguments separated by commas

Array Operators

Adds CQL2ExpArray (array literals as a list of expressions (CQL2ExpList) within [ ])
Adds CQL2ExpCall with support for the A_CONTAINEDBY, A_CONTAINS, A_EQUALS and A_OVERLAPS array operators as well-known functions
Implies use of CQL2ExpList for expressions array and for function arguments separated by commas

@pvretano

jerstlouis commented 2 years ago

See first draft of proposed simpler grammar rules in https://github.com/opengeospatial/ogcapi-features/issues/723#issuecomment-1172603159.

jerstlouis commented 2 years ago

Note that in the approach I suggest in defining the grammar production rules, operators / functions are not really keywords, but regular identifiers used in function call expressions (or spatial/literal/array literals definitions using same syntax as function calls). For example, this means that a date or s_intersects queryable would not require to be double-quoted (as in the current abstract tests), since date would only take its meaning of a temporal literal when it is followed an opening parenthesis (, and therefore there really is no ambiguity to date<>DATE('2022-04-16').

In my opinion this makes it much easier to extend the language with additional functions / operators, since those additions would not introduce additional keywords that break implementations not previously requiring queryables with the same name to be double-quoted. The list of keywords in 8.2 (which would need to be double-quoted, if allowed at all) would be reduced to:

AND
BETWEEN
DIV
FALSE
IN
IS
LIKE
NOT
NULL
OR
TRUE

All of the other ones would get tokenized by the lexer as an identifier which can be used as operators/function calls, or to define literals and only get resolved in the contexts where they apply. This is the approach taken in C-like languages where standard functions and data types/structs (or classes in C++) are not classified as keywords.

Also note that SQL keywords (or "reserved" words) do not seem to include any function-like keywords either. Things like UPPER() changing case are described as functions instead.

jerstlouis commented 5 months ago

See the CartoSym-CSS BNF lexer / grammar for ANTLR4 which should (in theory) be a true superset of CQL2:

https://github.com/opengeospatial/styles-and-symbology/blob/main/core/schemas/CartoSym-CSS-Lexer.g4

https://github.com/opengeospatial/styles-and-symbology/blob/main/core/schemas/CartoSym-CSS-Grammar.g4

The starting rule for CQL2 is expression (e.g., you can paste the Lexer and Grammar at http://lab.antlr.org/ and test any CQL2 expression with expression as the start rule).

When I have a chance I will extract only the CQL2 relevant part.

opengeospatial / ogcapi-features