Open faassen opened 6 months ago
I agree that the appendix on tokenization in hthe 3.1 spec is very inadequate, which is why I put forward (and the WG accepted) a complete rewrite for 4.0. Please let me know if you think that it is an improvement. In particular, the phrase "that is consistent with the EBNF" has gone; I have complained about it for years. The rewrite wasn't intended to represent a change in the sense of requiring any changes to user queries or parser implementations, just to define the rules more precisely and avoid some of the confusion.
As for map{$m?a:true()}
the test description explicitly mentions that it is testing out edge cases in tokenization, and I suspect I wrote the test in order to force the question to be answered. I think under the 4.0 rules this is clearly a syntax error.
The Saxon parser accepts this one because it has a special rule in the tokenizer that a name following '?' cannot contain a colon. I guess that rule is there specifically to pass this test, and in a desperate attempt to make sense of the "consistent with the EBNF" clause. There's a similar special case for the name that follows "*:" - but that's OK because Wildcard is a terminal and probably the tokenizer should handle it as a lexical unit.
What you've revealed here is that the rewrite of the rules in 4.0 does actually change the interpretation of this edge case.
I have changed this test to throw XPST0003 in the qt4tests branch, as this is now clearly an error in 4.0. The situation for 3.1 is far less clear. But I'm going to change the Saxon parser to disallow it regardless of version, I think.
Also relevant: https://github.com/qt4cg/qtspecs/issues/327
Reading all this is a relief, thank you! I will just ignore that failling tests for now.
I was wondering about the new spec though:
for example,
BracedURILiteral does not quality because it appears only inURIQualifiedName>
quality should be qualify, right? And that true in XPath 4? In Xpath 3.1BracedURILiteral can also appear inside of wildcards, right?
This is going to be pretty long. I was fairly dismayed when I learned about this so please bear with me. I am going to talk about a test case, but I'm going to argue the XPath spec is ambiguous and unclear for this case, so perhaps this is more about improving the specification rather than interpreting it.
MapConstructor-025
Let's consider the test
MapConstructor-025
.The interesting bit here is:
map{$m?a:true()}
The test parses this as:
map { $m ? a : true() }
Now we turn to a piece of text in section A.2, which the description of the test references
The test interprets this as follows:
a:true
is the longest token, but it's only grammatical when the token isa
, so the longest possible token isa
.But that's not really independent tokenizing like a lexer does. Tokenizing is not aware of the grammar; tokenizing creates tokens that can be used by the grammar.
So another interpretation of this same text could be:
a:true
is the longest possible token allowed by the grammar, soa:true
is the token. It's true that such a token leads to the above code not to be parsable, but that's okay: all kinds of sequences of correct tokens are not parsable.What is a token?
But what is a token in the first place? We go back to A.2, which is titled "Lexical structure" and as we have seen, talks about tokenizing. It describes terminal symbols, at first sight a decent candidate for tokens.
What kind of terminal symbol is
a:true
? It would be a prefixed qname, as described by theQName
production.QName
is described as a terminal symbol.QName
refers to the grammar in the XML spec itself, where aQName
isPrefixedName | UnprefixedName
Both are composed fromNCName
. AnUnprefixedName
is basically aNCName
.But wait. The XPath spec also describes an
NCName
by itself as a terminal symbol. So when we encounter anNCName
in the grammar, is it to be interpreted as aQName
or not? That cannot actually be determined without using the grammar.So it appears that the terminal symbols are not really describing tokens at all.
It's unclear what is meant by "tokenizing".
What is a terminal symbol?
As a digression: A.2.1 says "Terminal symbols" but that evidently isn't the complete list of terminal symbols, as
A.2.2
introduces a lot more delimiting and non-delimiting terminal symbols. Which I've interpreted as tokens, but you can't quite do that. Almost but not quite.For instance, there is also an ambiguity with
NCName
for reserved function names such asif
.Is
if
anNCName
or not? If it's used as a functionif()
, it's not, but in other contexts it is a valid ncname, and in another context it's a keyword in the language.My adventure
In my XPath engine I have a separate lexer (tokenizing) and parsing phase. This is pretty common in programming language construction. The lexer produces tokens and the grammar then defines how those tokens combine into an AST. Whitespace is handled in the lexing phase and then disregarded during parsing.
This works pretty well in most cases, but I noticed it wasn't working well for some cases where
ws: explicit
is in use. It also didn't work well for QNames:foo : bar
, with whitespace, could incorrectly be interpreted as a valid QName.So I thought, let's handle these cases in the lexer. So now the lexer produces various new tokens, like
URIQualifiedName
and even wildcards (likefoo:*
) are interpreted as tokens as the "ws: explicit" rule is in place for them too.But what to do about
QName
? As we discussed before, without the grammar it is ambiguous whether a givenNCName
is aQName
or not. But, I thought, it's not ambiguous whether something is aPrefixedName
.So I introduced a token for that. And previous cases where the test were failing are now passing. But to my dismay,
MapConstructor-025
started failing.I am unclear how to resolve this: I really don't want to make the whole grammar whitespace aware for the few exceptional cases; I want to keep handling this in the lexer. But handling
PrefixedName
as a token leads to new trouble.MapConstructor-025
takes one interpretation of the specification, but I really think the specification is ambiguous as it doesn't define what a token is. I thinkPrefixedName
is a perfectly reasonable interpretation of "longest possible match according to the EBNF" .What is a terminal?
Does this really say much at all beyond "if we put the rule in the grammar for terminals, it's a terminal, but if we put it in the main grammar it's not"?
Concrete steps
I'd love for the code in
MapConstructor-025
to be considered invalid grammatically, but I don't know whether it can be, and the specification is unclear anyway.I think a concrete step is to make the specification more clear. What's a token? (or avoid that word altogether). Is a terminal symbol a token?
If the interpretation in
MapConstructor-025
is correct, make explicit thatQName
orPrefixedName
cannot be a token (and safely tucked away in the XML specification), and thatNCName
is ambiguous on the token level and can only be resolved grammatically.I'm going to have to think hard for a while to find a doable way to make
MapConstructor-025
work in my implementation.