Open jerstlouis opened 2 years ago
Meeting 2022-06-20: It would be good to understand why this would result in an easier implementation. We need to discuss this in a meeting when @jerstlouis is present.
Thanks @cportele . I should be attending the next meeting in a couple weeks.
As a summary, from a syntactic point of view, I think the two things I suggested above would result in fewer grammar rules (simpler grammar), and parser node classes would be a more direct / natural match to the rules. We would implement the function/operator name validation / data types checking separately from the parsing, since some of it is only known at runtime (e.g., available functions, queryable data types). e.g., in our implementation we have a CQL2CallExp node class which we plan to use to handle the array / spatial / temporal operators which syntactically look like function calls. We are hand-writing a Recursive Descent parser, borrowing heavily from our ECCSS/CMSS parser.
The following excerpt from our internal CQL2 design document mapping CQL2 conformance classes and providing a concise summary of the CQL2 syntax might be insightful. A simpler grammar could potentially closely match those CQL2*
AST node classes to rules. We could eventually prototype such a simpler grammar together with railroad diagrams demonstrating the idea.
true
, false
and null
will be treated as identifiers in our implementation (with the drawback that they cannot be used for identifiers even double-quoted).
":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
"-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
(
)
to override default operator priorities.
, no suffixes used, including support for scientific notation (E
separating power of 10 exponent)'
); single quote characters within a string literal are represented by two consecutive single-quote characters (''
)
DATE
as a well-known function taking a string literal (CQL2ExpString) defining a date instantTIMESTAMP
as a well-known function taking a string literal (CQL2ExpString) defining a datetime instantNOT
followed by an operandAND
and OR
=
, <
, >
, <=
, >=
, <>
IS
and IS NOT
(ignore extra spaces between IS
and NOT
) for checking against NULL
identifier.true
, false
and null
) are only supported for the left operand, while true
, false
, null
and literals are supported in second operands only.+
, -
, *
, /
(fractional division, see also Features#711), ^
(exponent)-
unary operator? (Features#709)LIKE
and NOT LIKE
relational operators (ignore extra spaces between NOT
and LIKE
) that accepts a pattern where %
matches 0..n arbitrary characters, _
matches exactly one arbitrary character (and those characters can be escaped by using a \
character); expects text expressions only, and string literals in right operand.
^
(starts with), $
(ends with), ~
(contains) text operators and their negated counterparts.BETWEEN
and NOT BETWEEN
ternary relational operators (e.g., depth BETWEEN 100.0 AND 150.0
); expects numeric expressions only.IN
and NOT IN
relational operators taking a comma-separated list of expressions (CQL2ExpList) within parentheses as second operand; items in the list are expected to be of same type as value being tested.(
)
as arguments following an identifier (CQL2Identifier) for the function to call[...]
rather than e.g., ARRAY(...)
). My suggestion in Features#718 is to use (1,2,3)
for array literals instead. To support WKT, support for space-separated tuples are also required e.g., 10 30
in POLYGON((10 30, 40 20, 50 80, 10 30))
.CASEI
well-known function returning a case-desensitised version of a string.ACCENTI
well-known function returning an accent-desensitised version of a string.POINT
, LINESTRING
, POLYGON
, MULTIPOINT
, MULTILINESTRING
, MULTIPOLYGON
, GEOMETRYCOLLECTION
and ENVELOPE
well-known functions defining vector geometry objects following the simple features model (WKT encoding).
(
)
to support the WKT notation as arguments to those function callsS_INTERSECTS
well-known function for spatial intersection operator S_CONTAINS
, S_CROSSES
, S_DISJOINT
, S_EQUALS
, S_OVERLAPS
, S_TOUCHES
, S_WITHIN
INTERVAL
as a well-known function taking two instants string literals (CQL2ExpString) defining a temporal interval objectT_AFTER
, T_BEFORE
, T_DISJOINT
, T_EQUALS
, T_INTERSECTS
T_CONTAINS
, T_DURING
, T_FINISHEDBY
, T_FINISHES
, T_MEETS
, T_METBY
, T_OVERLAPPEDBY
, T_OVERLAPS
, T_STARTEDBY
, T_STARTS
[
]
)A_CONTAINEDBY
, A_CONTAINS
, A_EQUALS
and A_OVERLAPS
array operators as well-known functions@pvretano
See first draft of proposed simpler grammar rules in https://github.com/opengeospatial/ogcapi-features/issues/723#issuecomment-1172603159.
Note that in the approach I suggest in defining the grammar production rules, operators / functions are not really keywords, but regular identifiers used in function call expressions (or spatial/literal/array literals definitions using same syntax as function calls). For example, this means that a date
or s_intersects
queryable would not require to be double-quoted (as in the current abstract tests), since date
would only take its meaning of a temporal literal when it is followed an opening parenthesis (
, and therefore there really is no ambiguity to date<>DATE('2022-04-16')
.
In my opinion this makes it much easier to extend the language with additional functions / operators, since those additions would not introduce additional keywords that break implementations not previously requiring queryables with the same name to be double-quoted. The list of keywords in 8.2 (which would need to be double-quoted, if allowed at all) would be reduced to:
All of the other ones would get tokenized by the lexer as an identifier which can be used as operators/function calls, or to define literals and only get resolved in the contexts where they apply. This is the approach taken in C-like languages where standard functions and data types/structs (or classes in C++) are not classified as keywords.
Also note that SQL keywords (or "reserved" words) do not seem to include any function-like keywords either. Things like UPPER()
changing case are described as functions instead.
See the CartoSym-CSS BNF lexer / grammar for ANTLR4 which should (in theory) be a true superset of CQL2:
https://github.com/opengeospatial/styles-and-symbology/blob/main/core/schemas/CartoSym-CSS-Lexer.g4
The starting rule for CQL2 is expression
(e.g., you can paste the Lexer and Grammar at http://lab.antlr.org/ and test any CQL2 expression with expression
as the start rule).
When I have a chance I will extract only the CQL2 relevant part.
This is feedback from trying to implement cql2-text. Implementers (or at least us) face struggles with the current grammar.
I think it comes down mainly to these two things:
I think simplifying these two aspects of the grammar would directly result in simpler parser implementations, greater ease of implementation and greater interoperability.