qt4cg / qtspecs

QT4 specifications
https://qt4cg.org/
Other
29 stars 15 forks source link

Computed node constructors: observations #1528

Open ChristianGruen opened 1 month ago

ChristianGruen commented 1 month ago

Observations/conclusions from the exchange on Slack regarding computed node constructors:

  1. We should highlight the breaking change in the appendix: J.1 Incompatibilities relative to XQuery 3.1.
  2. We should present a list of keywords which is ambiguous and exclude other ones (like e.g. count or value)
  3. Maybe we should generally discourage users from using the legacy NCName syntax, and remove corresponding examples in the spec, as further versions of the language may lead to new incompatibilities.
  4. Syntax errors in quoted element names were should already be detected at parse time, for example by using pseudo quotes:
# currently
CompNodeName    ::=  StringLiteral | UnreservedName | ("{" Expr "}")
CompNodeNCName  ::=  StringLiteral | UnreservedNCName | ("{" Expr "}")
# proposed
CompNodeName    ::=  UnreservedName | ('"' UnreservedName '"') | ("'" UnreservedName "'") | ("{" Expr "}")
CompNodeNCName  ::=  UnreservedNCName | ('"' UnreservedNCName '"') | ("'" UnreservedNCName "'") | ("{" Expr "}")

A quick evaluation over appr. 8,000 XQuery files resulted in the following list of occurrences of possible incompatibilities (spread across appr. 400 files):

michaelhkay commented 1 month ago

I guess a start would be to distinguish keywords that are only used in the query prolog (such as option) from those that can be used in expressions. We could also try to distinguish keywords that are only used as the second word in a keyword pair, for example validate strict/lax/type, or for key/value

Note also: we haven't looked at the effect on XQuery Update.

ChristianGruen commented 1 month ago

Note also: we haven't looked at the effect on XQuery Update.

…and XQuery Full Text:

FTWeight ::= "weight" "{" Expr "}"
FTWindow ::= "window" AdditiveExpr FTUnit
...

//p contains text "term" weight { 0.5 }
//p contains text "term" window 1 words
...

@johnlumley Did you already have a chance to parse XQUF and XQFT?

johnlumley commented 1 month ago

Did you already have a chance to parse XQUF and XQFT?

No - I haven't - (actually I've never used XQ or any of its derivatives;-). But I should be able to - Actually I have the appropriate mechanisms and I could fairly easily check all the literal tokens for ambiguity in the constructor, though it might be a day or two.

johnlumley commented 1 month ago

I think it's in part how much lookahead you are prepared for. I should explain that my use of an Earley parser with my iXML versions of the grammars allows effective propagation of all possibilities, so examines whether a given expression is ambiguous using the whole grammar and input, rather than being ambiguous (unresolved?) in a local context.

So an initial simple pass of:

element <name> {}

(where <name> is one of the 143 words in literal terminals), shows only 17 ambiguities, e.g. to, union, and etc. - those that can be infix binary operators, where the parse is either the constructor or the appropriate binary expression, regardless of whether it is meaningful in static type analysis.

Others may require a larger context, which I'll investigate. But to illustrate the point:

let $f := element return {}

parses as a simple child::element axis step bound to $f, but in

let $f := element return {} return {}

it binds unambiguously to a CompElemConstructor for $f.

So the question is how deep and complex does the expression have to be to exhibit this form of ambiguity and does lookahead distance come into the equation for practical reasons? I suspect my lack of XQuery experience means I'm not seeing some of the possible forms.

ChristianGruen commented 1 month ago

Thanks, John. Michael’s comment in another issue may be helpful in this regard: https://github.com/qt4cg/qtspecs/issues/1345#issuecomment-2260290593

michaelhkay commented 1 month ago

I certainly think we must avoid any need for unbounded lookahead to make the grammar unambiguous.

michaelhkay commented 1 month ago

Another way of looking at this: when we find that the first token in an expression is element, and the next two tokens are N and {, where N is any NCName, then we need to decide whether N the start of something that follows the expression (which would then be element as a NameTest), or an element name. What are the circumstances in which an NCName can immediately follow an expression (or more specifically, a NameTest)? The obvious cases are (a) where N is an infix operator, or the first part of a compound infix operator such as cast as, (b) where N introduces a FLWOR clause (for, let, group, where, while, order, return), and (c) else. (Note that as the first token in a FLWOR clause, for and let cannot (currently) be followed by { - but this may change if we introduce destructuring assignment).

There are other cases that can be resolved by lookahead: consider for example order by element ascending ... or case array(*) return element case ... but these are all intrinsically error-prone.

michaelhkay commented 1 month ago

In the Qt4 tests, the main change I had to make was to attribute type {...} in some of the XQUpdate tests. There were also a couple of uses of element element {}, but I think that was deliberately testing edge cases. Note also, if you want the code to run in both 3.1 and 4.0, it's best to change it to attribute {"type"} {...} rather than `attribute "type" {...}.

I'm tempted to reopen issue #747 proposing QName literals, so that rather than using a string in quotes here, we use an explicit QName literal.

rhdunn commented 1 month ago

https://github.com/search?q=%22element+div%22+language%3AXQuery+&type=code returns 298 occurrences. I haven't checked other keywords.

ChristianGruen commented 1 month ago

I'm tempted to reopen issue #747 proposing QName literals, so that rather than using a string in quotes here, we use an explicit QName literal.

As it’s already possible to use the Q within the quotes…

element { 'Q{URI}div' } { }

…we should probably allow the same syntax without curly brackets: element 'Q{URI}div' {}. I could imagine that an additional QName literal element Q'{URI}div' {} could get confusing in practice.

johnlumley commented 1 month ago

Did you already have a chance to parse XQUF and XQFT?

I've managed to generate the current state XQUF grammar and test it against the 814 test expressions in the Update section of the current test suite (it involves some hacking to build on top of my XQuery support in my testdriver). As far as I can tell, there are no fundamental ambiguities in the grammar at present.

I'll have a look at the XQFT set next week (I must admit I'd never seen it before) but am slightly puzzled as to where I can get a reasonable corpus of expressions to test against. The only ones I can find are very very old (like 2007 vintage, pointed to from the 2015 spec) and not in QT3 format.

ChristianGruen commented 1 month ago

I'll have a look at the XQFT set next week (I must admit I'd never seen it before) but am slightly puzzled as to where I can get a reasonable corpus of expressions to test against. The only ones I can find are very very old (like 2007 vintage, pointed to from the 2015 spec) and not in QT3 format.

Yes, unfortunately there’s no newer version of this test suite. Back then, we have written a separate driver for this format (and we are still using it to run the old XQuery Update format).

rhdunn commented 1 month ago

@johnlumley You may run into https://www.w3.org/Bugs/Public/show_bug.cgi?id=30015. Since then I've modified that to:

CastExpr ::= TransformWithExpr ( "cast" "as" SingleType )?
TransformWithExpr ::= ArrowExpr ("transform" "with" "{" Expr? "}")?

IIRC, that's because that's the way BaseX specifies the precedence, allowing you to have all 3 (arrow, transform, cast) expressions together.

johnlumley commented 1 month ago

IIRC, that's because that's the way BaseX specifies the precedence, allowing you to have all 3 (arrow, transform, cast) expressions together.

The current grammar has the same ordering but with 40 has changed the SingleType to a more general type (mix of EBNF and iXML):

CastExpr    ::= TransformWithExpr ( "cast" "as" CastTarget "?"? )?
TransformWithExpr ::= ArrowExpr ("transform" "with" "{" Expr? "}")?
CastTarget  ::= TypeName | ChoiceItemType | EnumerationType

ChoiceItemType:-"(", s, ItemType, (-"|", s, ItemType)*, -")", s.
johnlumley commented 1 month ago

I'll have a look at the XQFT set next week (I must admit I'd never seen it before) but am slightly puzzled as to where I can get a reasonable corpus of expressions to test against. The only ones I can find are very very old (like 2007 vintage, pointed to from the 2015 spec) and not in QT3 format.

I've managed to both generate an iXML grammar for XQFT and a minimal QT4 test set from the 2007 tests (just only checking for parsing the expression and checking for anticipated static XP parsing errors). In this case the 685 test cases pass and most welcome, none of them show ambiguity!

I should caution that I've never encountered XQuery FullText before so while the expressions parse and the trees look reasonable, I have no deeper understanding of checking all the details. Next week I'll publish all the current iXML grammars, for XPath, XSLT Patterns, XQuery, XQuery Update and XQuery FullText

johnlumley commented 1 month ago

@ChristianGruen Is there any need to look at the scripting extension for XQ? I can't find any spec references, nor any test cases for it?

michaelhkay commented 1 month ago

I think the scripting extension can be regarded as dead. The people who were keen on it went off and did JSONiq, breaking compatibility with XQuery.

ChristianGruen commented 1 month ago

In the Qt4 tests, the main change I had to make was to attribute type {...} in some of the XQUpdate tests. There were also a couple of uses of element element {}, but I think that was deliberately testing edge cases.

One more constructor that may need to be fixed uses namespace in. As a side note, I believe K2-ComputeConElem-20 (element div {}) should return XPDY0002 instead of XPST0003.

I already had the chance to experience that the current change would compromise code that is used in production. Cleaning up such code is a hassle, and it is difficult to memorize the long list of keywords – and it may expand in the future – so I am more and more wondering if we should go a step further, drop the old EQName syntax, and restrict literal names to string literals and (for elements & attributes) qualified names:

CompNodeName    ::=  StringLiteral | URIQualifiedName | ("{" Expr "}")
CompNodeNCName  ::=  StringLiteral | ("{" Expr "}")

It is safe to argue this would be a massive breaking change. On the other hand, we have already opened the dam. It is generally worthwhile and recommended practice to clean up affected code consistently rather than selectively and randomly. It is ugly to read partially sanitized code snippets like element "div" { element p { ... } }; maybe we should not motivate that by being tentative.

If we do the step, it would definitely give us more freedom to introduce shorter language constructs as well as shorten existing ones. Examples:

XQUF: We could allow delete //x and rename $x as 'p' instead of delete nodes //x and rename node $x as 'p'. XQFT: "A" contains text "A" could be shortened to "A" contains "A".

ChristianGruen commented 1 month ago

The XQuery files that I have analyzed in a previous comment contain appr. 13,000 named node constructors in total. The most frequent cases are as follows (I have excluded results that may reveal customer-specific code):

The XQuery code (relying on some changes in the File Module that have not been published as version 2.0 yet):

let $regex := "(element|attribute|processing-instruction|namespace)\s+(allowing|ancestor|ancestor-or-self|and|array|as|at|attribute|base-uri|boundary-space|by|case|cast|castable|catch|child|collation|comment|construction|context|copy-namespaces|count|decimal-format|decimal-separator|declare|default|descendant|descendant-or-self|digit|div|document|document-node|element|else|empty|empty-sequence|encoding|end|enum|eq|every|except|exponent-separator|false|fixed|fn|following|following-sibling|for|function|ge|group|grouping-separator|gt|idiv|if|import|in|infinity|inherit|instance|intersect|is|item|items|key|keys|lax|le|let|lt|map|member|minus-sign|mod|module|namespace|namespace-node|NaN|ne|next|no-inherit|no-preserve|node|of|only|option|or|order|ordered|ordering|otherwise|pairs|parent|pattern-separator|per-mille|percent|preceding|preceding-sibling|preserve|previous|processing-instruction|record|return|satisfies|schema|schema-attribute|schema-element|self|sliding|some|stable|start|strict|strip|switch|text|then|to|treat|true|try|tumbling|type|typeswitch|union|unordered|validate|value|values|variable|version|when|where|while|window|xquery|zero-digit|ascending|descending|external|greatest|least)\s+\{"
let $regex := "(element|attribute|processing-instruction|namespace)\s+([0-9a-zA-Z_:]+)\s+\{"

let $matches := (
  for $file in file:descendants('PATH-TO-PROJECTS/')
  where matches($file, '\.xq(m|l|y|uery)?$')
  let $text := file:read-text($file, fallback := true())
  where matches($text, $regex)
  let $result := analyze-string($text, $regex) 
  return $result/*:match ! ('`' || string-join(*:group, ' ') || ' {`')
)
for $group in $matches
group by $match := $group
let $count := count($group)
order by $count descending
return `* {$count}x {$match}`
michaelhkay commented 1 month ago

I'm attempting to construct a list of keywords that can appear within an Expr by analysing the grammar. First attempt attached. find-keywords-0.txt

This produces the list

NaN allowing ancestor ancestor-or-self and array as ascending at attribute base-uri boundary-space by case cast castable catch child collation comment construction context copy-namespaces count decimal-format decimal-separator declare default descendant descendant-or-self descending digit div document document-node element else empty empty-sequence encoding end enum eq every except exponent-separator external false fn following following-sibling for function ge greatest group grouping-separator gt idiv if import in infinity inherit instance intersect is item key lax le least let lt map member minus-sign mod module namespace namespace-node ne next no-inherit no-preserve node of only option or order ordered ordering parent pattern-separator per-mille percent preceding preceding-sibling preserve previous processing-instruction record return satisfies schema schema-attribute schema-element self sliding some stable start strict strip switch text then to treat true try tumbling type typeswitch union unordered validate value variable version when where while window xquery zero-digit

Two immediate problems here: (a) some operators such as otherwise are missing; (b) I don't understand why decimal format properties such as exponent-separator have leaked into the list. Further work needed. It would also be nice to eliminate words such as "type" and "as" which can only appear as the second keyword of a keyword pair.

ChristianGruen commented 1 month ago

As axis names appear in the list: Can we say what would be an ambiguity for e.g. parent?

michaelhkay commented 1 month ago

We could certainly argue that axis names are always distinguishable by the following "::" token and therefore do not need to be in the list.

We could also eliminate words that are always followed by "(", such as item and node.

michaelhkay commented 1 month ago

My next iteration eliminates axis and decimal-format names and sorts out the productions listed under OperatorExpr, producing the list:

allowing and array as ascending at by case catch collation comment count default descending div document-node element else empty empty-sequence end enum eq every except false fn for function ge greatest group gt idiv if in intersect is item key le least let lt map member mod namespace-node ne next node only or order otherwise previous processing-instruction record return satisfies schema-attribute schema-element sliding some stable start switch text then true try tumbling typeswitch union value when where while window

find-keywords-1.txt

michaelhkay commented 1 month ago

Eliminating strings that are always followed by '(' reduces it to:

allowing and as ascending at by case catch collation count default descending div else empty end eq every except fn for function ge greatest group gt idiv in intersect is key le least let lt member mod ne next only or order otherwise previous return satisfies sliding some stable start switch then try tumbling union value when where while window

Eliminating strings that are always followed by some specific string other than '{' reduces it to:

and as ascending by case catch collation default descending else empty end every except fn for ge greatest in least let mod only or otherwise return satisfies stable start switch then try when where while

One could do some further tweaking of this list, for example let and for can't be followed by {, but my analysis of the grammar would need to go a bit deeper to detect that.

find-keywords-2.txt

johnlumley commented 1 month ago

Two immediate problems here: (a) some operators such as otherwise are missing; (b) I don't understand why decimal format properties such as exponent-separator have leaked into the list. Further work needed. It would also be nice to eliminate words such as "type" and "as" which can only appear as the second keyword of a keyword pair.

I'm trying a parallel search on my XQ iXML grammar. My first simple run finds 146 keywords, cf 137 for Michael's initial. The additional include some from the terminal symbols table (the iXML grammar has no lexer and therefore includes terminal grammars). But it also finds otherwise and pairs, keys, values and items from Modifier. Notably it doesn't find NaN

I'll keep working on refining the search, along similar lines to Michael's on the basis that an independent run might be helpful. Once it's done I can easily repeat on XQUF and XQFT.

michaelhkay commented 1 month ago

Umm no, that's clearly wrong because it has dropped div and idiv (while retaining mod). That's because div and idiv are followed by another string within a choice, rather than within a sequence. Fixing that gives

and as ascending by case catch collation default descending div else empty end eq every except fn for function ge greatest gt idiv in intersect is le least let lt mod ne only or otherwise return satisfies some stable start switch then try union when where while

Really, to do this properly we need to build the finite state automaton for the grammar and run queries on that. Does anyone have a tool to do that? I thought about converting the grammar into an XSD schema (every production becomes a model group, every terminal becomes an element particle), then export this as an SCM file, which contains an XML representation of the determinised FSA. But XSD doesn't allow any lookahead in the grammar so that's probably a non-starter.

johnlumley commented 4 weeks ago

I've taken a slightly different 'bottom-up' approach with the following steps:

followed by:

The set of CaseA+CaseB is currently producing 40 candidates:

and array attribute by case comment div document element else eq except ge gt idiv in intersect is le lt map mod namespace ne or ordered otherwise processing-instruction return satisfies text then to try type union unordered when where while

which on first inspection seems plausible, certainly collecting 'all the usual suspects', but as Mike points out, this is not an exhaustive examination.