Open ChristianGruen opened 1 month ago
I guess a start would be to distinguish keywords that are only used in the query prolog (such as option
) from those that can be used in expressions. We could also try to distinguish keywords that are only used as the second word in a keyword pair, for example validate strict/lax/type
, or for key/value
Note also: we haven't looked at the effect on XQuery Update.
Note also: we haven't looked at the effect on XQuery Update.
…and XQuery Full Text:
FTWeight ::= "weight" "{" Expr "}"
FTWindow ::= "window" AdditiveExpr FTUnit
...
//p contains text "term" weight { 0.5 }
//p contains text "term" window 1 words
...
@johnlumley Did you already have a chance to parse XQUF and XQFT?
Did you already have a chance to parse XQUF and XQFT?
No - I haven't - (actually I've never used XQ or any of its derivatives;-). But I should be able to - Actually I have the appropriate mechanisms and I could fairly easily check all the literal tokens for ambiguity in the constructor, though it might be a day or two.
I think it's in part how much lookahead you are prepared for. I should explain that my use of an Earley parser with my iXML versions of the grammars allows effective propagation of all possibilities, so examines whether a given expression is ambiguous using the whole grammar and input, rather than being ambiguous (unresolved?) in a local context.
So an initial simple pass of:
element <name> {}
(where <name>
is one of the 143 words in literal terminals), shows only 17 ambiguities, e.g. to
, union
, and
etc. - those that can be infix binary operators, where the parse is either the constructor or the appropriate binary expression, regardless of whether it is meaningful in static type analysis.
Others may require a larger context, which I'll investigate. But to illustrate the point:
let $f := element return {}
parses as a simple child::element
axis step bound to $f
, but in
let $f := element return {} return {}
it binds unambiguously to a CompElemConstructor
for $f
.
So the question is how deep and complex does the expression have to be to exhibit this form of ambiguity and does lookahead distance come into the equation for practical reasons? I suspect my lack of XQuery experience means I'm not seeing some of the possible forms.
Thanks, John. Michael’s comment in another issue may be helpful in this regard: https://github.com/qt4cg/qtspecs/issues/1345#issuecomment-2260290593
I certainly think we must avoid any need for unbounded lookahead to make the grammar unambiguous.
Another way of looking at this: when we find that the first token in an expression is element
, and the next two tokens are N
and {
, where N
is any NCName, then we need to decide whether N
the start of something that follows the expression (which would then be element
as a NameTest
), or an element name. What are the circumstances in which an NCName can immediately follow an expression (or more specifically, a NameTest
)? The obvious cases are (a) where N
is an infix operator, or the first part of a compound infix operator such as cast as
, (b) where N
introduces a FLWOR clause (for, let, group, where, while, order, return), and (c) else
. (Note that as the first token in a FLWOR clause, for
and let
cannot (currently) be followed by {
- but this may change if we introduce destructuring assignment).
There are other cases that can be resolved by lookahead: consider for example order by element ascending ...
or case array(*) return element case ...
but these are all intrinsically error-prone.
In the Qt4 tests, the main change I had to make was to attribute type {...}
in some of the XQUpdate tests. There were also a couple of uses of element element {}
, but I think that was deliberately testing edge cases. Note also, if you want the code to run in both 3.1 and 4.0, it's best to change it to attribute {"type"} {...}
rather than `attribute "type" {...}.
I'm tempted to reopen issue #747 proposing QName literals, so that rather than using a string in quotes here, we use an explicit QName literal.
https://github.com/search?q=%22element+div%22+language%3AXQuery+&type=code returns 298 occurrences. I haven't checked other keywords.
I'm tempted to reopen issue #747 proposing QName literals, so that rather than using a string in quotes here, we use an explicit QName literal.
As it’s already possible to use the Q
within the quotes…
element { 'Q{URI}div' } { }
…we should probably allow the same syntax without curly brackets: element 'Q{URI}div' {}
. I could imagine that an additional QName literal element Q'{URI}div' {}
could get confusing in practice.
Did you already have a chance to parse XQUF and XQFT?
I've managed to generate the current state XQUF grammar and test it against the 814 test expressions in the Update section of the current test suite (it involves some hacking to build on top of my XQuery support in my testdriver). As far as I can tell, there are no fundamental ambiguities in the grammar at present.
I'll have a look at the XQFT set next week (I must admit I'd never seen it before) but am slightly puzzled as to where I can get a reasonable corpus of expressions to test against. The only ones I can find are very very old (like 2007 vintage, pointed to from the 2015 spec) and not in QT3 format.
I'll have a look at the XQFT set next week (I must admit I'd never seen it before) but am slightly puzzled as to where I can get a reasonable corpus of expressions to test against. The only ones I can find are very very old (like 2007 vintage, pointed to from the 2015 spec) and not in QT3 format.
Yes, unfortunately there’s no newer version of this test suite. Back then, we have written a separate driver for this format (and we are still using it to run the old XQuery Update format).
@johnlumley You may run into https://www.w3.org/Bugs/Public/show_bug.cgi?id=30015. Since then I've modified that to:
CastExpr ::= TransformWithExpr ( "cast" "as" SingleType )?
TransformWithExpr ::= ArrowExpr ("transform" "with" "{" Expr? "}")?
IIRC, that's because that's the way BaseX specifies the precedence, allowing you to have all 3 (arrow, transform, cast) expressions together.
IIRC, that's because that's the way BaseX specifies the precedence, allowing you to have all 3 (arrow, transform, cast) expressions together.
The current grammar has the same ordering but with 40 has changed the SingleType
to a more general type (mix of EBNF and iXML):
CastExpr ::= TransformWithExpr ( "cast" "as" CastTarget "?"? )?
TransformWithExpr ::= ArrowExpr ("transform" "with" "{" Expr? "}")?
CastTarget ::= TypeName | ChoiceItemType | EnumerationType
ChoiceItemType:-"(", s, ItemType, (-"|", s, ItemType)*, -")", s.
I'll have a look at the XQFT set next week (I must admit I'd never seen it before) but am slightly puzzled as to where I can get a reasonable corpus of expressions to test against. The only ones I can find are very very old (like 2007 vintage, pointed to from the 2015 spec) and not in QT3 format.
I've managed to both generate an iXML grammar for XQFT and a minimal QT4 test set from the 2007 tests (just only checking for parsing the expression and checking for anticipated static XP parsing errors). In this case the 685 test cases pass and most welcome, none of them show ambiguity!
I should caution that I've never encountered XQuery FullText before so while the expressions parse and the trees look reasonable, I have no deeper understanding of checking all the details. Next week I'll publish all the current iXML grammars, for XPath, XSLT Patterns, XQuery, XQuery Update and XQuery FullText
@ChristianGruen Is there any need to look at the scripting extension for XQ? I can't find any spec references, nor any test cases for it?
I think the scripting extension can be regarded as dead. The people who were keen on it went off and did JSONiq, breaking compatibility with XQuery.
In the Qt4 tests, the main change I had to make was to
attribute type {...}
in some of the XQUpdate tests. There were also a couple of uses ofelement element {}
, but I think that was deliberately testing edge cases.
One more constructor that may need to be fixed uses namespace in
. As a side note, I believe K2-ComputeConElem-20
(element div {}
) should return XPDY0002
instead of XPST0003
.
I already had the chance to experience that the current change would compromise code that is used in production. Cleaning up such code is a hassle, and it is difficult to memorize the long list of keywords – and it may expand in the future – so I am more and more wondering if we should go a step further, drop the old EQName syntax, and restrict literal names to string literals and (for elements & attributes) qualified names:
CompNodeName ::= StringLiteral | URIQualifiedName | ("{" Expr "}")
CompNodeNCName ::= StringLiteral | ("{" Expr "}")
It is safe to argue this would be a massive breaking change. On the other hand, we have already opened the dam. It is generally worthwhile and recommended practice to clean up affected code consistently rather than selectively and randomly. It is ugly to read partially sanitized code snippets like element "div" { element p { ... } }
; maybe we should not motivate that by being tentative.
If we do the step, it would definitely give us more freedom to introduce shorter language constructs as well as shorten existing ones. Examples:
XQUF: We could allow delete //x
and rename $x as 'p'
instead of delete nodes //x
and rename node $x as 'p'
.
XQFT: "A" contains text "A"
could be shortened to "A" contains "A"
.
The XQuery files that I have analyzed in a previous comment contain appr. 13,000 named node constructors in total. The most frequent cases are as follows (I have excluded results that may reveal customer-specific code):
attribute type {
element td {
attribute id {
element hit {
attribute xml:space {
element report {
attribute name {
element csv {
element option {
element msg {
attribute class {
element record {
attribute selected {
element tr {
attribute count {
element b {
attribute value {
element file {
element result {
attribute msg {
attribute cnt {
attribute pos {
element stdout {
element info {
element id {
The XQuery code (relying on some changes in the File Module that have not been published as version 2.0 yet):
let $regex := "(element|attribute|processing-instruction|namespace)\s+(allowing|ancestor|ancestor-or-self|and|array|as|at|attribute|base-uri|boundary-space|by|case|cast|castable|catch|child|collation|comment|construction|context|copy-namespaces|count|decimal-format|decimal-separator|declare|default|descendant|descendant-or-self|digit|div|document|document-node|element|else|empty|empty-sequence|encoding|end|enum|eq|every|except|exponent-separator|false|fixed|fn|following|following-sibling|for|function|ge|group|grouping-separator|gt|idiv|if|import|in|infinity|inherit|instance|intersect|is|item|items|key|keys|lax|le|let|lt|map|member|minus-sign|mod|module|namespace|namespace-node|NaN|ne|next|no-inherit|no-preserve|node|of|only|option|or|order|ordered|ordering|otherwise|pairs|parent|pattern-separator|per-mille|percent|preceding|preceding-sibling|preserve|previous|processing-instruction|record|return|satisfies|schema|schema-attribute|schema-element|self|sliding|some|stable|start|strict|strip|switch|text|then|to|treat|true|try|tumbling|type|typeswitch|union|unordered|validate|value|values|variable|version|when|where|while|window|xquery|zero-digit|ascending|descending|external|greatest|least)\s+\{"
let $regex := "(element|attribute|processing-instruction|namespace)\s+([0-9a-zA-Z_:]+)\s+\{"
let $matches := (
for $file in file:descendants('PATH-TO-PROJECTS/')
where matches($file, '\.xq(m|l|y|uery)?$')
let $text := file:read-text($file, fallback := true())
where matches($text, $regex)
let $result := analyze-string($text, $regex)
return $result/*:match ! ('`' || string-join(*:group, ' ') || ' {`')
)
for $group in $matches
group by $match := $group
let $count := count($group)
order by $count descending
return `* {$count}x {$match}`
I'm attempting to construct a list of keywords that can appear within an Expr by analysing the grammar. First attempt attached. find-keywords-0.txt
This produces the list
NaN allowing ancestor ancestor-or-self and array as ascending at attribute base-uri boundary-space by case cast castable catch child collation comment construction context copy-namespaces count decimal-format decimal-separator declare default descendant descendant-or-self descending digit div document document-node element else empty empty-sequence encoding end enum eq every except exponent-separator external false fn following following-sibling for function ge greatest group grouping-separator gt idiv if import in infinity inherit instance intersect is item key lax le least let lt map member minus-sign mod module namespace namespace-node ne next no-inherit no-preserve node of only option or order ordered ordering parent pattern-separator per-mille percent preceding preceding-sibling preserve previous processing-instruction record return satisfies schema schema-attribute schema-element self sliding some stable start strict strip switch text then to treat true try tumbling type typeswitch union unordered validate value variable version when where while window xquery zero-digit
Two immediate problems here: (a) some operators such as otherwise
are missing; (b) I don't understand why decimal format properties such as exponent-separator
have leaked into the list. Further work needed. It would also be nice to eliminate words such as "type" and "as" which can only appear as the second keyword of a keyword pair.
As axis names appear in the list: Can we say what would be an ambiguity for e.g. parent
?
We could certainly argue that axis names are always distinguishable by the following "::" token and therefore do not need to be in the list.
We could also eliminate words that are always followed by "(", such as item
and node
.
My next iteration eliminates axis and decimal-format names and sorts out the productions listed under OperatorExpr
, producing the list:
allowing and array as ascending at by case catch collation comment count default descending div document-node element else empty empty-sequence end enum eq every except false fn for function ge greatest group gt idiv if in intersect is item key le least let lt map member mod namespace-node ne next node only or order otherwise previous processing-instruction record return satisfies schema-attribute schema-element sliding some stable start switch text then true try tumbling typeswitch union value when where while window
Eliminating strings that are always followed by '(' reduces it to:
allowing and as ascending at by case catch collation count default descending div else empty end eq every except fn for function ge greatest group gt idiv in intersect is key le least let lt member mod ne next only or order otherwise previous return satisfies sliding some stable start switch then try tumbling union value when where while window
Eliminating strings that are always followed by some specific string other than '{' reduces it to:
and as ascending by case catch collation default descending else empty end every except fn for ge greatest in least let mod only or otherwise return satisfies stable start switch then try when where while
One could do some further tweaking of this list, for example let
and for
can't be followed by {
, but my analysis of the grammar would need to go a bit deeper to detect that.
Two immediate problems here: (a) some operators such as
otherwise
are missing; (b) I don't understand why decimal format properties such asexponent-separator
have leaked into the list. Further work needed. It would also be nice to eliminate words such as "type" and "as" which can only appear as the second keyword of a keyword pair.
I'm trying a parallel search on my XQ iXML grammar. My first simple run finds 146 keywords, cf 137 for Michael's initial. The additional include some from the terminal symbols table (the iXML grammar has no lexer and therefore includes terminal grammars). But it also finds otherwise
and pairs
, keys
, values
and items
from Modifier
. Notably it doesn't find NaN
I'll keep working on refining the search, along similar lines to Michael's on the basis that an independent run might be helpful. Once it's done I can easily repeat on XQUF and XQFT.
Umm no, that's clearly wrong because it has dropped div
and idiv
(while retaining mod
). That's because div
and idiv
are followed by another string within a choice, rather than within a sequence. Fixing that gives
and as ascending by case catch collation default descending div else empty end eq every except fn for function ge greatest gt idiv in intersect is le least let lt mod ne only or otherwise return satisfies some stable start switch then try union when where while
Really, to do this properly we need to build the finite state automaton for the grammar and run queries on that. Does anyone have a tool to do that? I thought about converting the grammar into an XSD schema (every production becomes a model group, every terminal becomes an element particle), then export this as an SCM file, which contains an XML representation of the determinised FSA. But XSD doesn't allow any lookahead in the grammar so that's probably a non-starter.
I've taken a slightly different 'bottom-up' approach with the following steps:
{
, i.e. where any alternative has {
only proceeded by repeat0
or option
terms.{
, This accounts for 52 productions in the current XQNCName
literals which are followed (possibly from within a choice, option or repetition) directly by reference to any of these productions, or {
directly. => CaseA
followed by:
NCName
literal, e.g. MultiplicativeOps
, which has div
, idiv
, and mod
{
-starting production references or {
directly, and take their NCName
literals. => CaseB
The set of CaseA+CaseB
is currently producing 40 candidates:
and array attribute by case comment div document element else eq except ge gt idiv in intersect is le lt map mod namespace ne or ordered otherwise processing-instruction return satisfies text then to try type union unordered when where while
which on first inspection seems plausible, certainly collecting 'all the usual suspects', but as Mike points out, this is not an exhaustive examination.
Observations/conclusions from the exchange on Slack regarding computed node constructors:
count
orvalue
)A quick evaluation over appr. 8,000 XQuery files resulted in the following list of occurrences of possible incompatibilities (spread across appr. 400 files):
attribute type {
element option {
element record {
attribute count {
attribute value {
element value {
element type {
attribute text {
attribute values {
attribute namespace {
attribute default {
element text {
attribute key {
element key {
element group {
element count {
element to {
element item {
element div {
attribute collation {
attribute encoding {
attribute comment {
attribute context {
attribute case {
element map {
element empty {
element items {
attribute to {
attribute empty {
attribute item {
element element {
attribute where {
element values {
attribute if {
element in {
element comment {
attribute start {
attribute end {
element document {