invisibleXML / ixml

Invisible XML
GNU General Public License v3.0
52 stars 8 forks source link

A 'Subtraction' operator #249

Open johnlumley opened 6 months ago

johnlumley commented 6 months ago

In my use case of developing an iXML grammar for XPath expressions, I've encountered a case where the name of a FunctionCall can be any QName except a small number of reserved 'tokens', to avoid ambiguity against certain keywords of the language. For example if(fred) is not permitted as a function call, as of course it can be the start of an if-then-else clause, but i(fred) and iff(fred) can be. Similarly element(foo) is not permitted as a function call, as it is ambiguous against an item type element(foo), but elementType(foo) would be fine.

In the current iXML there is no means of describing 'match A unless B', which means trying to express this use case either involves exhaustive lists of rules:

functionNameI: "i"; "i",~["f"]; "if",[L]+.

or postprocessing the XML to choose the appropriate correct parse in the case of an ambiguity.

To overcome this issue and make iXML more flexible for such cases I suggest adding an extra operator, which might be termed either a 'subtraction' or a 'set-difference' operator:

-factor: terminal;
         nonterminal;
         insertion;
         subtraction;
         -"(", s, alts, -")", s.

subtraction: term, -"¬", term.

where the subtraction production completes when the first term completes unless the second term completes at the same character position. In this way we can express our use case as something like:

FunctionName: QName ¬ ("if";"element"....).

So for example iff(foo) would be fine since the QName will complete on character 3, but the second term will complete for if on character 2. And for if(foo) both terms complete on character 2, so the FunctionName production doesn't complete and hence this doesn't match as a function call.

In the case of my Earley parser implementation, on encountering a subtraction factor, both left and right branches are predicted and then followed. When the left branch completes, propagation of its consequence to its nonterminal caller is delayed until either the right branch completes or all productions for that character position have been processed. If the right branch has completed at that character position, no further propagation of effect occurs; if the right branch hasn't completed, the consequence propagates as usual. Note that it's possible that during the processing of productions for a given character position, either the left or right branches might complete processing first, so some simple logical processing of the 'completion' state needs to be handled, but this hasn't proved to be an issue, at least in my implementation.

This operator has really only been tested on my XPath use case, but it seems to work fine.

Brickbats, suggestions, reactions or other constructive criticisms welcome.

cmsmcq commented 6 months ago

I agree that this would be useful: every time I work on an ixml grammar for a pre-existing language, it seems I trip over a rule that says and identifier is anything that matches a given definition except for a reserved word, or the equivalent.

Some questions do arise for me. For concreteness assume we write the operator with the keyword except.

First, I wonder about the effect of such an operator on the expressive power of ixml.

Some bright people have said that as a general rule it's best to use the weakest applicable tool, which I seem to have internalized in the stronger form of being nervous about anything that increases the expressive power of a notation. Whether that's a good principle or not, I think adding an operator that will allow an ixml grammar to recognize a context-sensitive language is a big step. We should think about long and hard before agreeing to it.

Operationally we know that the cost of parsing input against A except B is the cost of parsing the input against A and then parsing it again against B -- so, for an Earley parser, the same cost as parsing the input against A; B. That seems to suggest that for context-free A and B the cost is worst-case cubic.

The second question in my mind is: if we add a set-difference operator, should we also add an intersection operator and a negation / complementation operator?

Hmm. I was going to say that we need to think about this. But on further reflection, I think there is very little thinking to do. If we add a subtraction operator, we have also added the ability to express complementation and intersection.

So the question is not whether to make intersection and complementation expressible, but whether to provide convenient syntax for them.

It would be good to have implementation and user experience. If only there were a way to mark a grammar feature as a non-standard extension in an ixml grammar! Then we could gather practical experience with the operator without having to work with non-conforming ixml grammars or processors.

nverwer commented 3 months ago

What @johnlumley describes reminded me of negative look-ahead in regular expressions. When doing a negative look-ahead, the parser first tries to parse the negative look-ahead pattern, and if that fails continues processing from the current character position. If parsing the negative look-ahead pattern succeeds, the expression in which this occurs fails to parse, and the character position is not advanced. The way a regular expression parser does this seems to be efficient; First try to parse the negative look-ahead, and only if that fails parse the rest of the pattern.

As @cmsmcq says, it is possible to specify an intersection with this. Using De Morgan's law "A and B = not(not(A) or not(B))", indeed, in javascript, "abcdcba".match(/(?! (?! [ac]) | (?! [bc]) )c/g) returns ['c', 'c'].

There is a difference between subtraction and negative look-ahead. The pattern /(?![0-9])[0-9A-Za-z]+/ can be used to match a sequence of digits and letters that starts with a letter. The equivalent grammar rule would be:

id : ["a"-"z"; "A"-Z";"0"-"9"]+ except ( ["0"-"9"] , ["a"-"z"; "A"-Z";"0"-"9"]* ) .

(Of course there is a much better way to write this, but that is not the point I am trying to make.) I wonder what this means for efficiency. The subtraction operator subtracts the second sub-language from the first. If the string to parse starts with a digit, we cannot stop parsing after the digit, because rest of the expression. It might be more efficient to use

id : ["a"-"z"; "A"-Z";"0"-"9"]+ except ( ["0"-"9"] , ~[]* ) .

In most cases, this will be irrelevant, and the right-hand-side of except will be a simple expression. As an example, this afternoon I had to find the equivalent for *

PITarget  ::=  Name - (('X' | 'x') ('M' | 'm') ('L' | 'l'))

That would be really easy with a subtraction operator.

johnlumley commented 2 weeks ago

Some more thoughts

Meaning

Earlier on I defined the meaning of the construct (with a little more detail) as effectively

the subtraction production A ¬ B completes starting from an input character position x when the first term A completes at character position y (y ≥ x), unless the second term B completes over the same input character range x → y.

I don't think this meaning needs altering, (see all the discussion by others above) but some of its 'edge cases' need a little exploration. See the section below.

The grammar definition

The definition given earlier needs a minor tweak to accommodate optional trailing whitespace, in keeping with the convention in the rest of the grammar:

-factor: terminal;
         nonterminal;
         insertion;
         subtraction;
         -"(", s, alts, -")", s.

subtraction: term, -"¬", s, term.

As it is defined here, however, there is an inherent associativity ambiguity in that:

S: A ¬ B ¬ C.

parses two ways as:

<ixml xmlns:ixml="http://invisiblexml.org/NS" ixml:state="ambiguous">
   <rule name="S">
      <alt>
         <subtraction>
            <nonterminal name="A"/>
            <subtraction>
               <nonterminal name="B"/>
               <nonterminal name="C"/>
            </subtraction>
         </subtraction>
      </alt>
   </rule>
</ixml>
<ixml xmlns:ixml="http://invisiblexml.org/NS" ixml:state="ambiguous">
   <rule name="S">
      <alt>
         <subtraction>
            <subtraction>
               <nonterminal name="A"/>
               <nonterminal name="B"/>
            </subtraction>
            <nonterminal name="C"/>
         </subtraction>
      </alt>
   </rule>
</ixml>

which of course can be suppressed with suitable bracketing, but needs to be resolved. Assuming the first ambiguous parse above would be considered S: A ¬ (B ¬ C) (which is equivalent to S:(A | C) ¬ B) we need to implement left-association.

If we do so, so the expression parses as S: (A ¬ B) ¬ C which is of course equivalent to S:A ¬ (B | C). (Such possible reductions and optimisations can be considered part of the iXML grammar compilation step.)

To do this we need to alter the grammar for subtraction to something equivalent to:

subtraction: term, -"¬", s, (terminal; nonterminal; insertion; -"(", s, alts, -")", s).

which has the disadvantage of repeating part of the construct of factor but has the advantage of not adding another nonterminal to the grammar. (Whether insertion makes sense is considered in the Edge Cases section below.)

The operator symbol

I chose the ¬ symbol because i) it didn't conflict with any other symbol, ii) it's a single keystroke on my keyboard ;-) and iii) it wasn't a word, such as except, as so far all operators have been non-letter character sequences and it will be worth keeping it so if we can.

Of course it isn't correct in its conventional mathematical meaning, where it denotes logical negation, and perhaps we should consider \ (backslash) for set difference as the alternative - is of course used as the serialisation suppression mark. Currently we don't use \ anywhere in the iXML grammar (even for escapes in strings - in fact we don't have direct escape mechanism within a string) so currently there wouldn't be any conflict, though it could be misread as a forward slash, which also we don't use.

Edge cases

The 'normal' use cases for this operator would be typically of the form GeneralCase ¬ SpecificExclusions where the subterms might be expressed as nonterminals or alternate lists of terminals. But what happens when (degenerate?) edge case terms (which @cmsmcq revelled in) are involved? Here I examine a few samples running them on my Earley parser based implementation:

  1. S: ~[]* ¬ "a"., that is anything except a single a. This seems to work fine: abc parses, the empty string parses and a fails.
  2. S: ~[]* ¬ ~[]*., that is anything except anything. This always fails for any input including the empty string. This is what I would expect.
  3. S: "a" ¬ ~[]*., that is "a" except anything. a fails as expected, but there are problems in my implementation such that the empty string parses and ab does too - the 'except' term carries on matching rather than being killed on failure of the left-hand term.
  4. S:"a" ¬ +"b"., that is "a" except an insertion. a parses with no insertion of b. Empty string also parses (to b) incorrectly (again an issue with my Earley parser implementation not killing the except branch which is then treated as a positive option)

There is of course more exploration to be done, but my implementation needs some correction on killing the 'except' terms when the left-hand term fails.

GuntherRademacher commented 2 weeks ago

I strongly agree that an "exclusion" operator would be useful (rather than calling it "subtraction"). The grammar notation used in the XML and XQuery recommendations includes an exclusion operator, and REx implements it at the tokenization level.

In fact, all examples of possible applications that I have seen so far involve entities that would traditionally be tokens in parsing, typically represented by regular descriptions. Since regular languages are closed under difference, it is theoretically possible to transform a language description that uses an exclusion operator into the description of an equivalent language without it. Here, I am focusing on recognition and setting aside issues related to serialization of the result. If such a transformation were available, introducing exclusion would not impact the use of any subsequent algorithms.

In a "scannerless" approach, however, the algorithmic fit of exclusion is far from obvious. Implementing exclusion within current ixml processors would require substantial adjustments, and proving correctness would likely be challenging.

I understand that ixml deliberately chose the scannerless approach. However, introducing a separate tokenization step could, in my opinion, make it easier to implement exclusion and yield additional benefits.

Therefore, I propose a different approach: define an optional tokenization step in ixml, restrict it to non-recursive EBNF without serialization marks, and place the exclusion operator there. This would ensure that exclusion applies only at the low-level recognition stage. Additionally, this approach could offer further advantages, such as implicit whitespace handling, additional methods for disambiguation, and potentially improved performance.

johnlumley commented 2 weeks ago

Thanks @GuntherRademacher. There does seem to be a growing consensus in our discussions that, at least for programming language use cases, such an exclusion/except construct is needed, if only to avoid ambiguity.

In a "scannerless" approach, however, the algorithmic fit of exclusion is far from obvious. Implementing exclusion within current ixml processors would require substantial adjustments, and proving correctness would likely be challenging.

I understand that ixml deliberately chose the scannerless approach. However, introducing a separate tokenization step could, in my opinion, make it easier to implement exclusion and yield additional benefits.

Therefore, I propose a different approach: define an optional tokenization step in ixml, restrict it to non-recursive EBNF without serialization marks, and place the exclusion operator there.

I'm starting to be inclined to agree - the simple cases I've worked with so far, such as

FunctionName: QName ¬ ("if";"element"....).

pragmatically work fine in my (home-built) Earley parser.

But the issues of some of those edge cases outlined above (e.g. S: "a" ¬ ~[]*.) begin to require very much more complex state-propagation killing mechanisms to supress prediction/completion propagation of the right-hand sides on failure of the left. I'll work on some more to see if I'm missing something obvious, but it's unsurprisingly in some of the most delicate ('do not touch unless stone-cold sober') parts of my implementation code.

There are also significant questions about how such semantics could be implemented in other parsing algorithms. Here I defer to others with significant experience in the field of parsers.

I deliberately didn't use exclusion as it is already the term used for the ~[....] construct in iXML, which is effectively any character except ... as the negation of inclusion.

ndw commented 2 weeks ago

Adding a tokenizer would definitely have benefits, potentially large ones in terms of what can be expressed practically and perhaps in terms of performance.

Unfortunately, it makes everything a lot more complicated for the user. They need to have two sets of rules, they need to understand the interaction between them, and they need to come to grips with a problem that's suddenly a lot more complicated than matching some characters in the input.

nverwer commented 2 weeks ago

About a scanner / lexical analyzer:

Adding a tokenizer would definitely have benefits, potentially large ones in terms of what can be expressed practically and perhaps in terms of performance.

Unfortunately, it makes everything a lot more complicated for the user. They need to have two sets of rules, they need to understand the interaction between them, and they need to come to grips with a problem that's suddenly /a lot/ more complicated than matching some characters in the input.

In my talk about parsing R at Declarative Amsterdam, I showed how a tokenizer avoids ambiguity. I solved it by adding a pre-processor step. In the same talk, I also used /transparent/ ixml, to allow and ignore XML elements embedded in the text that I parse. The video will soon be published.

I think the tokenizer step could be implemented separately from ixml. In this way, tokenization could be done in different ways. The tokenization and parsing could then be pipelined in XQuery or XProc, like $document => tokenize() => parse().

In my presentation at XML Prague 2024, I hinted at how I am planning to do this in a project that I am currently working on. In this project, I have many (a few hundred thousand) /named entities/ that are recognized with a trie-based [https://en.wikipedia.org/wiki/Trie] parser. The recognized named entities become XML elements in the document, and I want these to be recognized as "pre-parsed non-terminals". Implementing this will be one of the next steps in my project, but it is rather hard. In this particular case, some named entities will become the content of

elements. (It is an application recognizing references in legal documents.) A element and its content is considered a pre-parsed non-terminal in the ixml grammar. In my XML Prague paper, an ixml rule that recognizes is: reference : regulation-elements, spaces, elements-regulation-connection, spaces, ** . The parser would recognize the element, and report it as the 'regulation' non-terminal. I think that this should be possible using the SMAX representation of the XML document, which I also use for /transparent/ ixml (see my XML Prague paper). I am not sure if the angle-bracket notation would conflict with renaming. Best regards, Nico