invisibleXML / ixml

Invisible XML
GNU General Public License v3.0
51 stars 7 forks source link

Dynamic naming / name from the input data #168

Open cmsmcq opened 1 year ago

cmsmcq commented 1 year ago

Issue 13 suggests allowing different nonterminals to be serialized with the same element or attribute name, in a way that allows the expected name to be determined by inspection of the grammar.

Experience with ixml grammars for parsing XML suggests it may be helpful to contemplate allowing elements and attributes to carry names given in (or more generally derived from) the input stream.

Other use cases:

More use cases would be helpful.

Two observations may be in order:

LdBeth commented 1 year ago

I have started trying out IXML and encountered similar problem of embedding some markup in the input that can be converted to XML tags in the final form.

I suppose XML with an unconstrained set of tag is not covered by CFG. Actually, if the input is structured (i.e. can be handled by grammar below) it is not necessary to have the closing tag carrying a tag name (by specify -closetag).

line: (text; node)+ .
node: @opentag, text, @closetag

If one wants to verify input is well formed XML, or a “proper” LaTeX markup, it would likely be done in a XSLT post pass.

I think that leads to the demand for an implementation definable extension to IXML that adds the source location to the resulting XML, that for any serious markup development this can be used as concrete syntax tree.

Now, for certain markup design, the closing tag is optional or implicit. One example is GML which influenced HTML to allow the closing </p> become optional. In such cases IXML is more like a lexer and further parsing requires the power of XSLT.

spemberton commented 1 year ago

I think this sort of mail is better sent to the working group rather than as a response to a github issue.

Best wishes,

Steven

On Wednesday 11 January 2023 06:42:24 (+01:00), LdBeth wrote:

I have started trying out IXML and encountered similar problem of embedding some markup in the input that can be converted to XML tags in the final form.

I suppose XML with an unconstrained set of tag is not covered by CFG. Actually, if the input is structured (i.e. can be handled by grammar below) it is not necessary to have the closing tag carrying a tag name (by specify -closetag).

line: (text; node)+ . node: @opentag, text, @closetag

If one wants to verify input is well formed XML, or a “proper” LaTeX markup, it would likely be done in a XSLT post pass.

I think that leads to the demand for an implementation definable extension to IXML that adds the source location to the resulting XML, that for any serious markup development this can be used as concrete syntax tree.

Now, for certain markup design, the closing tag is optional or implicit. One example is GML which influenced HTML to allow the closing

become optional. In such cases IXML is more like a lexer and further parsing requires the power of XSLT.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.

LdBeth commented 1 year ago

Thank, forwarded to the mailing list.

ldb

In @.> Steven Pemberton @.> wrote:

I think this sort of mail is better sent to the working group rather than as a response to a github issue.

Best wishes,

Steven

spemberton commented 1 year ago

On Tuesday 13 December 2022 17:39:02 (+01:00), C. M. Sperberg-McQueen wrote:

Issue 13 suggests allowing different nonterminals to be serialized with the same element or attribute name, in a way that allows the expected name to be determined by inspection of the grammar.

Yes, issue 13 addresses the problem of naming in the serialisation being bound to the input syntax.

date: day, s, textmonth, s, year; day, -"/", month, -"/", year. day: d, d. month: d, d. year: d, d, d, d. textmonth: -"January", +"01"; -"February", +"02";
... -"December", +"12". -s: -" "+.

Issue 13 is about recognising that the input syntax is different, but the output serialisation is the same. It is only about static renaming.

Experience with ixml grammars for parsing XML suggests it may be helpful to contemplate allowing elements and attributes to carry names given in (or more generally derived from) the input stream.

Dynamic renaming is a whole other kettle of fish, and once you add variables, you open a whole can of worms. Just suggesting it puts us on the slippery slope already, and should be approached with care. The end of the slope is Turing-completeness, and is reached very quickly. But for people interested, take a look at Affix Grammars, which address the issue.

https://en.wikipedia.org/wiki/Affix_grammar in particular the section headed Types.

Steven

Other use cases:

parsing LaTeX and turning \begin{x} ... \end{x} into an element named x. Or more generally, parsing the input language for any similar document formatter (Script, runoff, roff, ...) parsing comma-delimited data streams with header lines

More use cases would be helpful.

Two observations may be in order:

This can in fact be handled fairly easily by a downstream transformation (as can the functionality of issue 13). It's easy to imagine specifying the dynamic name using some expression language; that expression language could easily turn into a very steep slippery slope. It may be possible to avoid sliding down that slope by restricting ourselves to the ability expressions with the meaning "the string value of [some specific instance of some specific nonterminal]"; the complexity is then restricted to the problem of pointing to specific instances of specific nonterminals.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.

cmsmcq commented 1 year ago

I'm finding it a little hard to follow Steven's comments here, given that either his mail user agent seems not to distinguish in any visible way between quoted material and new material, or some process in the middle (maybe Github?) is stripping the distinction out. Github's refusal to allow post-hoc editing of email comments on an issue also doesn't help. In case Github is the culprit, I am sending this mail to public-ixml as well as to the reply address on Steven's mail.

Steven Pemberton @.***> writes:

On Tuesday 13 December 2022 17:39:02 (+01:00), C. M. Sperberg-McQueen wrote:

Issue 13 suggests allowing different nonterminals to be serialized with the same element or attribute name, in a way that allows the expected name to be determined by inspection of the grammar.

Yes, issue 13 addresses the problem of naming in the serialisation being bound to the input syntax.

I'm not quite sure what "bound to the input syntax" means. Or rather, I guess it must mean that the names used in the serialization can be found by inspecting the ixml grammar without reference to the input (although my internal parser for English doesn't quite see how that meaning emerges from those words). So, so far we seem to be in agreement on the subject of issue #13.

Issue 13 is about recognising that the input syntax is different, but the output serialisation is the same. It is only about static renaming.

Experience with ixml grammars for parsing XML suggests it may be helpful to contemplate allowing elements and attributes to carry names given in (or more generally derived from) the input stream.

Dynamic renaming is a whole other kettle of fish,

Good. We are in agreement again: issue #168 and issue #13 are usefully distinct.

and once you add variables, you open a whole can of worms.

I'm not sure anyone has suggested variables, but I agree that adding them has great potential for worminess.

Just suggesting it puts us on the slippery slope already, and should be approached with care. The end of the slope is Turing-completeness, and is reached very quickly.

I'm glad to see you agree.

But for people interested, take a look at Affix Grammars, which address the issue.

https://en.wikipedia.org/wiki/Affix_grammar in particular the section headed Types.

Thank you for the pointer.

Since it may not be obvious to all readers how VW grammars would be used to implement dynamic naming, perhaps it would be helpful to have a worked example showing how two-level grammars could make this work. I append one to this mail.

Steven is right, I think, to suggest that VW grammars show very convincingly that Turing completeness may be achieved with great economy of mechanism (which in turn means that any mechanism we invent might end us at the bottom of that fabled slippery slope). But I do not suggest VW grammars as a solution to the use cases described here; I think these use cases can be supported by mechanisms which are weaker and easier to work with than VW grammars (easier to work with both for grammar writers and for processor developers).

Example

Consider an ixml grammar for a simple approximation of XML, similar in spirit but simpler than the one given in the paper on pragmas by Hillman et al. in the proceedings of last year's Balisage. Unlike that one, the grammar I have in mind omits attributes, comments, and processing instructions. It would recognize input like the following:

<haiku>
  <author>Basho</date>
  <date>1686</author>

  <l>When the old pond</l>
  <l>gets a new frog</l>
  <l>it's a new pond.</l>
</haiku>

But also

<haiku>
  <author>Basho</author>
  <date>1686</date>

  <line>When the old pond</line>
  <line>gets a new frog</line>
  <line>it's a new pond.</line>
</uhuru>

And of course it does not generate XML that looks like its input.

[1] https://balisage.net/Proceedings/vol27/html/Sperberg-McQueen01/BalisageVol27-Sperberg-McQueen01.html#d9306e986

If we imagine a parser for VW grammars which works like an ixml processor in serializing a parse tree (specifically the parse tree against the first-level context-free grammar generated by the two-level input grammar), then I believe Steven must have some mechanism roughly similar to the following in mind.

First, the VW grammar has hyperrules, which by convention use :: to separate left- and right-hand sides, and no commas between terms.

{ NAME will be used for nonterminals }
NAME :: LETTER; LETTER NAMECHARS.
NAMECHARS :: NAMECHAR NAMECHARS.
LETTER :: a; b; c; d; ... ; z.
DIGIT :: 0; 1; 2; ... ; 9.
NAMECHAR :: LETTER; DIGIT; _.

Note that NAME defines an infinite set of strings like a, ab, abc, aaa994, l, haiku, and so on. As does NAMECHARS (which includes additional strings like 994 and _994).

Second, the VW grammar has metarules, which can be thought of as patterns for rules in a context-free grammar, made up of fragments of a conventional context-free grammar and hypernotions (the things defined by hyperrules). I'm going to use ixml syntax for the meta-rules, more or less, and to simplify my own life I'm not going to try to rewrite our quoted literals and other terminals using the 'letter x' convention.

document: ws?, element, ws? .
-element: NAME.
NAME: starttag.NAME, content, endtag.NAME; soletag.NAME .
-starttag.NAME:  -"<", gi.NAME, ws?, -">".
-endtag.NAME:  -"</", gi.NAME, ws?, -">"
-soletag.NAME:  -"<", gi.NAME, ws?, -"/>".
-content: pcdata?, (element**pcdata, pcdata?)?.
-pcdata:  (~["<>&"]; "&amp;"; "&lt;"; "&gt;"; "&apos;"; "&quot;")+.
-ws:  -(#20; #A; #C; #9)+.

-gi.NAMECHAR = letter NAMECHAR.
-gi.NAMECHAR NAMECHARS = letter NAMECHAR, gi.NAMECHARS.

Note that in this imaginary VW-flavored ixml, whitespace in nonterminals is ignored. So the last meta-rule syntactically OK, not an error.

Note also that by convention (or fiat), a symbol of the form 'letter' + anything appear in the generated context-free grammar denotes a terminal symbol, just as in ixml a character enclosed in quotation marks denotes a terminal symbol.

Here, the second metarule defines an infinite number of rules in the first-level context-free grammar, including

element: a.
element: ab.
element: abc.
element: aaa994.
element: l.
element: haiku.

The third rule similarly generates an infinite number of rules, including:

a: starttag.a, content, endtag.a; soletag.a .
ab: starttag.ab, content, endtag.ab; soletag.ab .
abc: starttag.abc, content, endtag.abc; soletag.abc .
aaa994: starttag.aaa994, content, endtag.aaa994; soletag.aaa994 .
l: starttag.l, content, endtag.l; soletag.l .
haiku: starttag.haiku, content, endtag.haiku; soletag.haiku .

Note that when a first-level rule is generated from the third meta-rule, NAME is replaced by the same string in all occurrences. So the third meta-rule does NOT generate anything like

{NOT} haiku: starttag.a, content, endtag.ab; soletag.aaa994 . {NOT}

The fourth metarule, meanwhile, generates rules like these:

-starttag.ab:  -"<", gi.ab, ws?, -">".
-starttag.aaa994:  -"<", gi.aaa994, ws?, -">".
-starttag.haiku:  -"<", gi.haiku, ws?, -">".
-starttag.l:  -"<", gi.l, ws?, -">".

The nonterminals gi.ab, gi.aaa994, gi.haiku, and gi.l rely on first-level rules which are generate by the last two meta-rules. Each of these meta-rules generates an infinite number of first-level rules, including the following, which are important for the derivation of the well-formed example above:

-gi.haiku = letter h, gi.aiku.
-gi.aiku = letter a, gi.iku.
-gi.iku = letter i, gi.ku.
-gi.ku = letter k, gi.u.
-gi.u = letter u.

-gi.l = letter l.

End of example.

-- C. M. Sperberg-McQueen Black Mesa Technologies LLC http://blackmesatech.com

cmsmcq commented 1 year ago

"C. M. Sperberg-McQueen" @.***> writes:

I see that towards the end of my preceding comment I lost focus enough that I failed to say explicitly some things that probably should be made explicit.

First: the VW grammar given works, on the given input, to produce XML of the same form as the input because the 'haiku' element is recognized by a nonterminal named 'haiku', the 'l' or 'line' elements by a nonterminal named 'l' or 'line', and so on.

The crucial idea is that the VW grammar given as input grammar generates an (infinite) ixml grammar which is used to parse the input string. In practice, parsers for VW grammars generate a finite subset of the infinite ixml grammar sufficiently large to handle the input.

To produce an element with a given name N, the requirement is to generate a grammar in which a nonterminal named N generates the desired element, and similarly also for attributes. When the same name may be used for both elements and attributes, some indirection and possibly some cleverness in writing the grammar will be required. Since VW grammars are Turing complete, there is guaranteed to be a way, but it is not guaranteed to be pretty.

Second: the specific finite subset needed to parse the input will vary with the input. Consider the following sample input:

<haiku>
  <author>Basho</date>
  <date>1686</author>

  <l>When the old pond</l>
  <l>gets a new frog</l>
  <l>it's a new pond.</l>
</haiku>

One of the infinite grammar's sufficiently large subsets is given below.

-document: ws?, element, ws? .
-element: haiku.
-element: author.
-element: date.
-element: l.
haiku: starttag.haiku, content, endtag.haiku; soletag.haiku .
author: starttag.author, content, endtag.author; soletag.author .
date: starttag.date, content, endtag.date; soletag.date .
l: starttag.l, content, endtag.l; soletag.l .
-starttag.haiku:  -"<", gi.haiku, ws?, -">".
-starttag.author:  -"<", gi.author, ws?, -">".
-starttag.date:  -"<", gi.date, ws?, -">".
-starttag.l:  -"<", gi.l, ws?, -">".
-endtag.haiku:  -"</", gi.haiku, ws?, -">"
-endtag.author:  -"</", gi.author, ws?, -">"
-endtag.date:  -"</", gi.date, ws?, -">"
-endtag.l:  -"</", gi.l, ws?, -">"

-content: pcdata?, (element**pcdata, pcdata?)?.
-pcdata:  (~["<>&"]; "&amp;"; "&lt;"; "&gt;"; "&apos;"; "&quot;")+.
-ws:  -(#20; #A; #C; #9)+.

-gi.haiku = letter h, gi.aiku.
-gi.aiku = letter a, gi.iku.
-gi.iku = letter i, gi.ku.
-gi.ku = letter k, gi.u.
-gi.u = letter u.
-gi.author = letter a, gi.uthor.
-gi.uthor = letter u, gi.thor.
-gi.thor = letter t, gi.hor.
-gi.hor = letter h, gi.or.
-gi.or = letter o, gi.r.
-gi.r = letter r.
-gi.date = letter d, gi.ate.
-gi.ate = letter a, gi.te.
-gi.te = letter t, gi.e.
-gi.e = letter e.
-gi.l = letter l.

In writing it, I have used a modified ixml syntax. As given, the grammar violates the ixml spec's rule against multiple definitions of the same nonterminal symbol; when the same nonterminal is defined multiply (as for 'element'), each definition is an alternative. So 'element' could also be defined thus:

-element: haiku; author; date; l.

The grammar just given also uses a mixture of ixml and VW conventions for terminal symbols. Each occurrence of 'letter X' for any X could be written as a quoted string literal, so the final rule would be:

-gi.l - 'l'.

I leave reformulation of the grammar in pure conformant ixml as an exercise for the reader.

Third: to make the behavior of an affix grammar reliably predictable, some grammar writers take care to place hypernotions in metarules next to characters which won't occur in the hypernotions. In the following metarule, the VW grammar given earlier uses '.' as a sort of delimiter between the hypernotion NAME and the rest of the nonterminal of which it forms a part.

-starttag.NAME:  -"<", gi.NAME, ws?, -">".

In attribute grammars, a similar simplification is achieved by making inherited and synthesized attributes be syntactically distinct from the nonterminals they decorate. My limited experience with attribute and affix grammars is that attribute grammars are much easier to write, read, understand, and reason about than unrestricted affix grammars. I suspect (although I cannot offer any argument) that attribute grammars are easier to constrain in ways that limit their expressive power that affix grammars, and that we have a better hope of avoiding the slippery slope towards a Turing-complete grammar formalism if we think about mechanisms for this use case in terms of highly restricted atribute grammars than if we think about them in terms of affix grammars.

Michael

-- C. M. Sperberg-McQueen Black Mesa Technologies LLC http://blackmesatech.com

spemberton commented 1 year ago

Yes, issue 13 addresses the problem of naming in the serialisation being bound to the input syntax.

I'm not quite sure what "bound to the input syntax" means. In the example I gave I tried to illustrate it: there is a rule "month" which has a numeric syntax and a rule "textmonth" which has a textual syntax, but they both have the same serialisation (numeric), but you don't have the ability to rename the textmonth element, because it is "bound to the input syntax" not the output syntax.

Steven