MapConstructor-025. What is a token? What is a terminal symbol?

faassen commented 6 months ago

This is going to be pretty long. I was fairly dismayed when I learned about this so please bear with me. I am going to talk about a test case, but I'm going to argue the XPath spec is ambiguous and unclear for this case, so perhaps this is more about improving the specification rather than interpreting it.

MapConstructor-025

Let's consider the test MapConstructor-025.

let $m := map{'a':1} return map:size(map{$m?a:true()})

The interesting bit here is:

map{$m?a:true()}

The test parses this as:

map { $m ? a : true() }

Now we turn to a piece of text in section A.2, which the description of the test references

When tokenizing, the longest possible match that is consistent with the EBNF is used.

The test interprets this as follows: a:true is the longest token, but it's only grammatical when the token is a, so the longest possible token is a.

But that's not really independent tokenizing like a lexer does. Tokenizing is not aware of the grammar; tokenizing creates tokens that can be used by the grammar.

So another interpretation of this same text could be:

a:true is the longest possible token allowed by the grammar, so a:true is the token. It's true that such a token leads to the above code not to be parsable, but that's okay: all kinds of sequences of correct tokens are not parsable.

What is a token?

But what is a token in the first place? We go back to A.2, which is titled "Lexical structure" and as we have seen, talks about tokenizing. It describes terminal symbols, at first sight a decent candidate for tokens.

What kind of terminal symbol is a:true? It would be a prefixed qname, as described by the QName production. QName is described as a terminal symbol.

QName refers to the grammar in the XML spec itself, where a QName is PrefixedName | UnprefixedName Both are composed from NCName. An UnprefixedName is basically a NCName.

But wait. The XPath spec also describes an NCName by itself as a terminal symbol. So when we encounter an NCName in the grammar, is it to be interpreted as a QName or not? That cannot actually be determined without using the grammar.

So it appears that the terminal symbols are not really describing tokens at all.

It's unclear what is meant by "tokenizing".

What is a terminal symbol?

As a digression: A.2.1 says "Terminal symbols" but that evidently isn't the complete list of terminal symbols, as A.2.2 introduces a lot more delimiting and non-delimiting terminal symbols. Which I've interpreted as tokens, but you can't quite do that. Almost but not quite.

For instance, there is also an ambiguity with NCName for reserved function names such as if.

Is if an NCName or not? If it's used as a function if(), it's not, but in other contexts it is a valid ncname, and in another context it's a keyword in the language.

My adventure

In my XPath engine I have a separate lexer (tokenizing) and parsing phase. This is pretty common in programming language construction. The lexer produces tokens and the grammar then defines how those tokens combine into an AST. Whitespace is handled in the lexing phase and then disregarded during parsing.

This works pretty well in most cases, but I noticed it wasn't working well for some cases where ws: explicit is in use. It also didn't work well for QNames: foo : bar, with whitespace, could incorrectly be interpreted as a valid QName.

So I thought, let's handle these cases in the lexer. So now the lexer produces various new tokens, like URIQualifiedName and even wildcards (like foo:*) are interpreted as tokens as the "ws: explicit" rule is in place for them too.

But what to do about QName? As we discussed before, without the grammar it is ambiguous whether a given NCName is a QName or not. But, I thought, it's not ambiguous whether something is a PrefixedName.

So I introduced a token for that. And previous cases where the test were failing are now passing. But to my dismay, MapConstructor-025 started failing.

I am unclear how to resolve this: I really don't want to make the whole grammar whitespace aware for the few exceptional cases; I want to keep handling this in the lexer. But handling PrefixedName as a token leads to new trouble.

MapConstructor-025 takes one interpretation of the specification, but I really think the specification is ambiguous as it doesn't define what a token is. I think PrefixedName is a perfectly reasonable interpretation of "longest possible match according to the EBNF" .

What is a terminal?

[Definition: A terminal is a symbol or string or pattern that can appear in the right-hand side of a rule, but never appears on the left-hand side in the main grammar, although it may appear on the left-hand side of a rule in the grammar for terminals.]

Does this really say much at all beyond "if we put the rule in the grammar for terminals, it's a terminal, but if we put it in the main grammar it's not"?

Concrete steps

I'd love for the code in MapConstructor-025 to be considered invalid grammatically, but I don't know whether it can be, and the specification is unclear anyway.

I think a concrete step is to make the specification more clear. What's a token? (or avoid that word altogether). Is a terminal symbol a token?

If the interpretation in MapConstructor-025 is correct, make explicit that QName or PrefixedName cannot be a token (and safely tucked away in the XML specification), and that NCName is ambiguous on the token level and can only be resolved grammatically.

I'm going to have to think hard for a while to find a doable way to make MapConstructor-025 work in my implementation.

michaelhkay commented 6 months ago

I agree that the appendix on tokenization in hthe 3.1 spec is very inadequate, which is why I put forward (and the WG accepted) a complete rewrite for 4.0. Please let me know if you think that it is an improvement. In particular, the phrase "that is consistent with the EBNF" has gone; I have complained about it for years. The rewrite wasn't intended to represent a change in the sense of requiring any changes to user queries or parser implementations, just to define the rules more precisely and avoid some of the confusion.

As for map{$m?a:true()} the test description explicitly mentions that it is testing out edge cases in tokenization, and I suspect I wrote the test in order to force the question to be answered. I think under the 4.0 rules this is clearly a syntax error.

The Saxon parser accepts this one because it has a special rule in the tokenizer that a name following '?' cannot contain a colon. I guess that rule is there specifically to pass this test, and in a desperate attempt to make sense of the "consistent with the EBNF" clause. There's a similar special case for the name that follows "*:" - but that's OK because Wildcard is a terminal and probably the tokenizer should handle it as a lexical unit.

What you've revealed here is that the rewrite of the rules in 4.0 does actually change the interpretation of this edge case.

michaelhkay commented 6 months ago

I have changed this test to throw XPST0003 in the qt4tests branch, as this is now clearly an error in 4.0. The situation for 3.1 is far less clear. But I'm going to change the Saxon parser to disallow it regardless of version, I think.

michaelhkay commented 6 months ago

Also relevant: https://github.com/qt4cg/qtspecs/issues/327

faassen commented 5 months ago

Reading all this is a relief, thank you! I will just ignore that failling tests for now.

I was wondering about the new spec though:

for example, BracedURILiteral does not quality because it appears only in URIQualifiedName>

quality should be qualify, right? And that true in XPath 4? In Xpath 3.1BracedURILiteral can also appear inside of wildcards, right?

w3c / qt3tests