invisibleXML / ixml

Invisible XML
GNU General Public License v3.0
52 stars 7 forks source link

We should document the XML tag set that results from parsing an ixml grammar with the ixml specification grammar #137

Open ndw opened 2 years ago

cmsmcq commented 2 years ago

I think that would be a good idea.

One question is: what form should the documentation take? The two possibilities I see are:

[Edit, 9 October. It turns out one can attach an image after all. Here is a screen shot of an RNG schema with embedded TEI-encoded documentation.] Screenshot from 2022-10-09 16-26-36 [For what it's worth, here is the XML form of the section on the tei:c element.] Screenshot from 2022-10-09 16-34-00

cmsmcq commented 1 year ago

We discussed this on the call of 15 November; NDTW suggested that we could avoid having to chose between TEI and Docbook as the basis by doing the entire thing in XHTML, which was accepted as a Solomonic decision. He took an action to build a prototype.

ndw commented 1 year ago

Where are the tools that build the RNG grammar from the ixml grammar? I'm a bit confused by some of the output, for example:

   <rng:define name="e.version">
      <rng:element name="version">
         <rng:ref name="extension-attributes"/>
         <rng:interleave>
            <rng:ref name="extension-elements"/>
            <rng:group>
               <rng:ref name="RS"/>
               <rng:ref name="RS"/>
               <rng:ref name="string"/>
               <rng:ref name="s"/>
            </rng:group>
         </rng:interleave>
      </rng:element>
   </rng:define>

Why is RS duplicated? (And elsewhere, why is s duplicated?)

I wonder if a mechanically generated grammar will ever be simple enough to be usefully documented.

I'm finding this, for example, hard to follow and difficult to imagine documenting:

whitespace = h.whitespace
h.whitespace =
  # alt with no realized children
  empty
  | tab
  | lf
  | cr
tab = h.tab
h.tab =
  # alt with no realized children
  empty
cmsmcq commented 1 year ago

The RNG is generated by running the Gingersnap stylesheet ixml-to-rng.xsl on ixml.xml (using Saxon HE). The RNC is generated from the RNG using trang.

The design principle of the transform, for what it is worth, is to stay as close to the structure of the ixml as possible. (That's not an end in itself, but it does help keep the transform simple by eliminating the temptation to perform simplifications of various kinds. RNG is well suited for simplification, so I just shoved responsibility for all simplifications and normalizations onto RNG tools.) The material quoted in Norm's comment may become clearer with (a) consideration of the corresponding ixml and (b) some explanation of the naming conventions used to handle the marks and tmarks of the ixml.

The ixml rule for whitespace is:

-whitespace: -[Zs]; tab; lf; cr.

Since the default marking for whitespace is -, any reference elsewhere in the grammar to whitespace (without a mark) will have the same effect as a reference to -whitespace. It will be, in effect, a reference to hidden whitespace (as opposed to whitespace-as-attribute or whitespace-as-element, which would also be possible). We record that with a definition: whitespace without a prefix means the same as h.whitespace (the prefix h. being used to render the mark - which hides a nonterminal).

whitespace = h.whitespace

The four right-hand sides of the ixml rule for whitespace turn into four disjuncts in the definition of h.whitespace. Since -[Zs] will produce the empty string in the visible-XML grammar, it is rendered with the RNC keyword empty; the three nonterminals are rendered as they appear.

h.whitespace =
  # alt with no realized children
  empty
  | tab
  | lf
  | cr

Next, we come to the definition of tab. The ixml rule is

-tab: -#9.

which says first that an unqualified reference to tab is hidden (so we will need to say that tab = h.tab, by default the nonterminal tab is hidden), and second that that hidden nonterminal will dominate the empty string (since the terminal -#9 is hidden).

tab = h.tab
h.tab =
  # alt with no realized children
  empty

I hope it is now slightly easier to follow.

It is probably not any easier to imagine documenting it, but since none of the nonterminals involved here turn into elements or attributes in the visible-XML form of the grammar, I don't think it needs to be documented. What need to be documented are the elements and attributes that can appear in an ixml grammar written in XML, and their content models. Because hidden nonterminals can easily get in the way and make the schema harder to understand, it may be best to start from a version of the schema in which many of the definitions have been expanded in place. I spend some time fiddling with Erik van der Vlist's sequence of XSLT stylesheets which implement the simplification / rewriting rules of the RNG spec, before discovering that the simplest way to get a schema in a form suitable for this kind of documentation is to use the undocumented -s option to Jing:

jing -s my-schema.rng > my-schema.simplified.rng 

I would show you the simplified form of the bit quoted above, if it existed, but in the simplified form of the schema, the whitespace nonterminal / definition has disappeared entirely. Which is, I think as it should be.

However, the definition of version shown does look a bit troubling. What's with the double RS?

The ixml spec grammar specifies:

version: -"ixml", RS, -"version", RS, string, s, -'.' .

The three hidden terminals disappear in the XML (they are hidden), so the right-hand side turns into RS, RS, string, s. If all of these were visible nonterminals, the ixml grammar would require that there be two RS elements before the string element; as it happens, RS is hidden, as are all of its potential descendants except for comment. The Jing-simplified version of ixml.rnc defines version as:

version =
  element version {
    attribute * - local:* { text }*,
    (_1*
     & ((comment?)+,
        (comment?)+,
        attribute string { text },
        (comment?)+))
  }

where _1 is a nameless definition for extension elements, and the double RS is visible in (comment?)+, (comment?)+.

Hmm. This is OK for me to eyeball when I am trying to look to see which children an element can have, but maybe some further simplification in the content model would help. It would be nice if the documentation could say the content model of the version element is something like

version = element version { 
    foreign-attributes*, 
    (foreign-elements* 
    & 
    (comment*, string, comment*)) 
}

or even something simpler which just ignores the foreign attributes and foreign elements (e.g. by starting the simplification with ixml-strict.rng, not ixml.rng).

But for that we appear to need to go back to Erik's transforms and figure out which ones will give us something closer to what we want. (Unless we can to find a different undocumented option to Jing.)

I hope this helps.