Open johnlumley opened 3 months ago
I agree that this would be useful: every time I work on an ixml grammar for a pre-existing language, it seems I trip over a rule that says and identifier is anything that matches a given definition except for a reserved word, or the equivalent.
Some questions do arise for me. For concreteness assume we write the operator with the keyword except
.
First, I wonder about the effect of such an operator on the expressive power of ixml.
A except B
is regular. In this case the operator does not extend the expressive power of the language, although as JL points out it does make some things much terser and much less error-prone to write.A except B
) = L(A
) \ L(B
) = L(A
) ∩ ¬L(B
). Since ¬L(B
) (the complement of B) is also regular, we have (according to Bar-Hillel, Perles, and Shamir 1961) the result that L(A except B
) is context-free, and again does not extend the expressive power of ixml.B
)) is not guaranteed context-free -- I'm not sure it's even guaranteed context-sensitive -- and its intersection with A won't be guaranteed context-free whether A is regular or context-free.except
expressions are part of ixml, then neither A nor B is guaranteed context-free. Some bright people have said that as a general rule it's best to use the weakest applicable tool, which I seem to have internalized in the stronger form of being nervous about anything that increases the expressive power of a notation. Whether that's a good principle or not, I think adding an operator that will allow an ixml grammar to recognize a context-sensitive language is a big step. We should think about long and hard before agreeing to it.
Operationally we know that the cost of parsing input against A except B
is the cost of parsing the input against A and then parsing it again against B -- so, for an Earley parser, the same cost as parsing the input against A; B
. That seems to suggest that for context-free A and B the cost is worst-case cubic.
The second question in my mind is: if we add a set-difference operator, should we also add an intersection operator and a negation / complementation operator?
Hmm. I was going to say that we need to think about this. But on further reflection, I think there is very little thinking to do. If we add a subtraction operator, we have also added the ability to express complementation and intersection.
U = ~[]*.
.except
operator, the complement of any nonterminal N is ~[]* except N
. Adding an explicit operator just makes it easier to write. except
also allows us to express intersection: the intersection of L(A) and L(B) can be expressed as (A | B) except ((A except B) | (B except A))
. So the question is not whether to make intersection and complementation expressible, but whether to provide convenient syntax for them.
It would be good to have implementation and user experience. If only there were a way to mark a grammar feature as a non-standard extension in an ixml grammar! Then we could gather practical experience with the operator without having to work with non-conforming ixml grammars or processors.
What @johnlumley describes reminded me of negative look-ahead in regular expressions. When doing a negative look-ahead, the parser first tries to parse the negative look-ahead pattern, and if that fails continues processing from the current character position. If parsing the negative look-ahead pattern succeeds, the expression in which this occurs fails to parse, and the character position is not advanced. The way a regular expression parser does this seems to be efficient; First try to parse the negative look-ahead, and only if that fails parse the rest of the pattern.
As @cmsmcq says, it is possible to specify an intersection with this.
Using De Morgan's law "A and B = not(not(A) or not(B))", indeed, in javascript, "abcdcba".match(/(?! (?! [ac]) | (?! [bc]) )c/g)
returns ['c', 'c']
.
There is a difference between subtraction and negative look-ahead.
The pattern /(?![0-9])[0-9A-Za-z]+/
can be used to match a sequence of digits and letters that starts with a letter.
The equivalent grammar rule would be:
id : ["a"-"z"; "A"-Z";"0"-"9"]+ except ( ["0"-"9"] , ["a"-"z"; "A"-Z";"0"-"9"]* ) .
(Of course there is a much better way to write this, but that is not the point I am trying to make.) I wonder what this means for efficiency. The subtraction operator subtracts the second sub-language from the first. If the string to parse starts with a digit, we cannot stop parsing after the digit, because rest of the expression. It might be more efficient to use
id : ["a"-"z"; "A"-Z";"0"-"9"]+ except ( ["0"-"9"] , ~[]* ) .
In most cases, this will be irrelevant, and the right-hand-side of except
will be a simple expression.
As an example, this afternoon I had to find the equivalent for *
PITarget ::= Name - (('X' | 'x') ('M' | 'm') ('L' | 'l'))
That would be really easy with a subtraction operator.
In my use case of developing an iXML grammar for XPath expressions, I've encountered a case where the name of a
FunctionCall
can be anyQName
except a small number of reserved 'tokens', to avoid ambiguity against certain keywords of the language. For exampleif(fred)
is not permitted as a function call, as of course it can be the start of anif-then-else
clause, buti(fred)
andiff(fred)
can be. Similarlyelement(foo)
is not permitted as a function call, as it is ambiguous against an item typeelement(foo)
, butelementType(foo)
would be fine.In the current iXML there is no means of describing 'match A unless B', which means trying to express this use case either involves exhaustive lists of rules:
or postprocessing the XML to choose the appropriate correct parse in the case of an ambiguity.
To overcome this issue and make iXML more flexible for such cases I suggest adding an extra operator, which might be termed either a 'subtraction' or a 'set-difference' operator:
where the
subtraction
production completes when the firstterm
completes unless the secondterm
completes at the same character position. In this way we can express our use case as something like:So for example
iff(foo)
would be fine since theQName
will complete on character 3, but the second term will complete forif
on character 2. And forif(foo)
both terms complete on character 2, so theFunctionName
production doesn't complete and hence this doesn't match as a function call.In the case of my Earley parser implementation, on encountering a
subtraction
factor, both left and right branches are predicted and then followed. When the left branch completes, propagation of its consequence to itsnonterminal
caller is delayed until either the right branch completes or all productions for that character position have been processed. If the right branch has completed at that character position, no further propagation of effect occurs; if the right branch hasn't completed, the consequence propagates as usual. Note that it's possible that during the processing of productions for a given character position, either the left or right branches might complete processing first, so some simple logical processing of the 'completion' state needs to be handled, but this hasn't proved to be an issue, at least in my implementation.This operator has really only been tested on my XPath use case, but it seems to work fine.
Brickbats, suggestions, reactions or other constructive criticisms welcome.