invisibleXML / ixml

Invisible XML
GNU General Public License v3.0
51 stars 7 forks source link

A 'Subtraction' operator #249

Open johnlumley opened 3 months ago

johnlumley commented 3 months ago

In my use case of developing an iXML grammar for XPath expressions, I've encountered a case where the name of a FunctionCall can be any QName except a small number of reserved 'tokens', to avoid ambiguity against certain keywords of the language. For example if(fred) is not permitted as a function call, as of course it can be the start of an if-then-else clause, but i(fred) and iff(fred) can be. Similarly element(foo) is not permitted as a function call, as it is ambiguous against an item type element(foo), but elementType(foo) would be fine.

In the current iXML there is no means of describing 'match A unless B', which means trying to express this use case either involves exhaustive lists of rules:

functionNameI: "i"; "i",~["f"]; "if",[L]+.

or postprocessing the XML to choose the appropriate correct parse in the case of an ambiguity.

To overcome this issue and make iXML more flexible for such cases I suggest adding an extra operator, which might be termed either a 'subtraction' or a 'set-difference' operator:

-factor: terminal;
         nonterminal;
         insertion;
         subtraction;
         -"(", s, alts, -")", s.

subtraction: term, -"¬", term.

where the subtraction production completes when the first term completes unless the second term completes at the same character position. In this way we can express our use case as something like:

FunctionName: QName ¬ ("if";"element"....).

So for example iff(foo) would be fine since the QName will complete on character 3, but the second term will complete for if on character 2. And for if(foo) both terms complete on character 2, so the FunctionName production doesn't complete and hence this doesn't match as a function call.

In the case of my Earley parser implementation, on encountering a subtraction factor, both left and right branches are predicted and then followed. When the left branch completes, propagation of its consequence to its nonterminal caller is delayed until either the right branch completes or all productions for that character position have been processed. If the right branch has completed at that character position, no further propagation of effect occurs; if the right branch hasn't completed, the consequence propagates as usual. Note that it's possible that during the processing of productions for a given character position, either the left or right branches might complete processing first, so some simple logical processing of the 'completion' state needs to be handled, but this hasn't proved to be an issue, at least in my implementation.

This operator has really only been tested on my XPath use case, but it seems to work fine.

Brickbats, suggestions, reactions or other constructive criticisms welcome.

cmsmcq commented 3 months ago

I agree that this would be useful: every time I work on an ixml grammar for a pre-existing language, it seems I trip over a rule that says and identifier is anything that matches a given definition except for a reserved word, or the equivalent.

Some questions do arise for me. For concreteness assume we write the operator with the keyword except.

First, I wonder about the effect of such an operator on the expressive power of ixml.

Some bright people have said that as a general rule it's best to use the weakest applicable tool, which I seem to have internalized in the stronger form of being nervous about anything that increases the expressive power of a notation. Whether that's a good principle or not, I think adding an operator that will allow an ixml grammar to recognize a context-sensitive language is a big step. We should think about long and hard before agreeing to it.

Operationally we know that the cost of parsing input against A except B is the cost of parsing the input against A and then parsing it again against B -- so, for an Earley parser, the same cost as parsing the input against A; B. That seems to suggest that for context-free A and B the cost is worst-case cubic.

The second question in my mind is: if we add a set-difference operator, should we also add an intersection operator and a negation / complementation operator?

Hmm. I was going to say that we need to think about this. But on further reflection, I think there is very little thinking to do. If we add a subtraction operator, we have also added the ability to express complementation and intersection.

So the question is not whether to make intersection and complementation expressible, but whether to provide convenient syntax for them.

It would be good to have implementation and user experience. If only there were a way to mark a grammar feature as a non-standard extension in an ixml grammar! Then we could gather practical experience with the operator without having to work with non-conforming ixml grammars or processors.

nverwer commented 4 weeks ago

What @johnlumley describes reminded me of negative look-ahead in regular expressions. When doing a negative look-ahead, the parser first tries to parse the negative look-ahead pattern, and if that fails continues processing from the current character position. If parsing the negative look-ahead pattern succeeds, the expression in which this occurs fails to parse, and the character position is not advanced. The way a regular expression parser does this seems to be efficient; First try to parse the negative look-ahead, and only if that fails parse the rest of the pattern.

As @cmsmcq says, it is possible to specify an intersection with this. Using De Morgan's law "A and B = not(not(A) or not(B))", indeed, in javascript, "abcdcba".match(/(?! (?! [ac]) | (?! [bc]) )c/g) returns ['c', 'c'].

There is a difference between subtraction and negative look-ahead. The pattern /(?![0-9])[0-9A-Za-z]+/ can be used to match a sequence of digits and letters that starts with a letter. The equivalent grammar rule would be:

id : ["a"-"z"; "A"-Z";"0"-"9"]+ except ( ["0"-"9"] , ["a"-"z"; "A"-Z";"0"-"9"]* ) .

(Of course there is a much better way to write this, but that is not the point I am trying to make.) I wonder what this means for efficiency. The subtraction operator subtracts the second sub-language from the first. If the string to parse starts with a digit, we cannot stop parsing after the digit, because rest of the expression. It might be more efficient to use

id : ["a"-"z"; "A"-Z";"0"-"9"]+ except ( ["0"-"9"] , ~[]* ) .

In most cases, this will be irrelevant, and the right-hand-side of except will be a simple expression. As an example, this afternoon I had to find the equivalent for *

PITarget  ::=  Name - (('X' | 'x') ('M' | 'm') ('L' | 'l'))

That would be really easy with a subtraction operator.