Tokenization and element constructors

michaelhkay commented 3 months ago

The new rules in Appendix A.3 on tokenization are, I believe, a great improvement on what went before. But I think there is one thing missing: they claim that the rules allow you to identify boundaries between tokens unambiguously independently of the syntactic context, but in the case of a token starting with <, this isn't true: to distinguish whether < represents a less-than-operator (or <= operator) or whether it is the start of an element constructor, you need some context information.

Saxon's tokenization is still based on the principles outlined in the XPath 1.0 spec where tokens are disambiguated based on the immediately preceding and following tokens; this is becoming increasingly unviable. Most cases can be handled instead by moving the disambiguation into the parser rather than the tokenizer, but this relies on being able to find token boundaries without knowledge of context (as described in the 4.0 spec), which appears to be possible in all cases except <.

Essentially we need to add an exception to the rule: "If the current position is not the end of the input, then return the longest [literal terminal]( or [variable terminal] that can be matched starting at the current position..."

I think the exception might be formulated as follows:

In XQuery, when the next character is < and this is immediately followed by an NCNameStart character (for example X) the next token could be either a less than operator, or a DirElemConstructor. The "longest terminal" rule cannot reliably distinguish these cases. Instead, the decision must take into account the syntactic context. A DirElemConstructor can only appear where the parser is expecting to read an expression, while the less-than operator can never appear where the parser is expecting an expression. This aspect of the syntactic context therefore needs to be communicated from the parser to the tokenizer.

Alternatively, the two cases might be distinguished by backtracking. The tokenizer could attempt to interpret the text following the < character as a DirElemConstructor, and revert to the alternative interpretation if this fails.

Note: this was not explained clearly in 3.1. Perhaps it was covered by the quixotic phrase "the longest token consistent with the EBNF".

michaelhkay commented 3 months ago

Perhaps it's simplest to stick to the "longest token" formulation rather than making it dependent on syntactic context. I suspect that if the string starting at < matches one of:

^<\i\c\+s*> (for example <elem>)
^<\i\c\+s*/> (for example <elem/>)
^<\i\c\s+\i\c\s*= (for example <elem att=)

then it can only be the start of a DirElemConstructor, and if it doesn't match one of these, then it can only be a < operator (having ruled out things like <= and <?pi?> that are easily eliminated).

That is to say, if the string at the current position starts with a valid DirElemConstructor, or even if it starts with one of the above substrings, then we can assume it's got to be a DirElemConstructor.

Not easy to prove this though...

michaelhkay commented 3 months ago

I think there might also be difficulties involving processing instruction constructors.

For example 1 <?x cast as xs:boolean?> 0

The contents of a PI constructor are less constrained than an element constructor, so I think this might not be tokenisable without knowing the syntactic context.

michaelhkay commented 3 months ago

I have added this test case Constr-pi-content-9:

map{'a':<z>4</z>, 'b':<z>6</z>} ! (?a <?b and ?a treat as node()?>>?a)

This parses correctly (and evaluates to false) if < is treated as a less-than operator, but fails to parse if <? ... ?> is tokenized as a PI constructor: that is, the "longest token" rule, if applied regardless of syntactic context, leads to a parse failure. Note this applies equally to 3.1.

michaelhkay commented 3 months ago

I was looking for discussion of tokens starting with < in the spec, and all I found (in 3.1) was under the "leading-lone-slash" section in A1.2 which mentions "and the < token could be either an operator or the start of a [DirectConstructor]".

The significance is that after a leading "/", you don't know whether the next token will be an operator, or the start of an expression, so the syntactic context doesn't help you decipher something like / <?x and ?x instance of node()?>

The leading-lone-slash constraint says "if the token immediately following a slash can form the start of a [RelativePathExpr], then ..." which rather assumes that you know what the token is without reference to the syntactic context.

So I'm inclined to go for a rule that recognises direct element and PI constructors independently of the syntactic context.

rhdunn commented 3 months ago

I found when implementing the tokenizer for my XQuery plugin that I implemented a "maybe open XML element tag" pseudo-token which matches:

MaybeOpenXmlElementTag ::= "<" QName ( "/>" | ">" | NameStartChar )

If I don't see the QName then I'm treating it as the "<" token. If it matches the MaybeOpenXmlElementTag token I know it is an XML start element, so will retokenize it as such. [1]

For "</", "<?", etc. I'm applying the maximal match rule, so will identify those as single tokens accordingly.

NOTE: My plugin also keeps this token for incomplete variants in order to properly handle syntax highlighting within IntelliJ.

[1] I found the ws:explicit annotation on DirElemConstructor is not enough. -- I interpreted that as "1 < a" not being an element start, but "1 <a" being an element start. However, implementors are working on the basis that if it parses as an element it is an element. Hence me using the MaybeOpenXmlElementTag pseudo-token.

michaelhkay commented 3 months ago

MaybeOpenXmlElementTag ::= "<" QName ( "/>" | ">" | NameStartChar )

That is indeed very close to what I'm proposing, except I take it a bit further and require

MaybeOpenXmlElementTag ::= "<" QName ( "/>" | ">" | ( QName "=" ) )

Otherwise you could get a false match on

$x <for member ...

or other similar expressions that start with two names.

Also note whitespace can appear before the > or /> but not immediately after the <.

qt4cg / qtspecs

Tokenization and element constructors #1311