https://github.com/invisibleXML/ixml/blob/master/samples/URI/rfc-3987.ixml

spemberton commented 1 year ago

Although this is straight out of the RFC, it is not good enough for proper use. HEXDIG should include "a"-"f" ipchar, iunreserved and ucschar should have a "-" before the rule. The grammar is ambiguous, but that needs work to investigate (on it).

spemberton commented 1 year ago

Also unused rules: **** Unused rules: {"CR"; "DQUOTE"; "IRI-reference"; "LF"; "SP"; "absolute-IRI"; "ipath"; "reserved"}

ndw commented 1 year ago

Regarding "a"-"f", the ABNF doesn't include the lowercase versions, but the relevant part of RFC 2234 is apparently:

   NOTE:     ABNF strings are case-insensitive and
             the character set for these strings is us-ascii.

So all quoted strings in the ABNF form have to be changed to support mixed case.

spemberton commented 1 year ago

Removing unused rules, more rules become unused: Unused rules: {"IRI-reference"; "gen-delims"} Unused rule: {"irelative-ref"} Unused rule: {"irelative-part"} Unused rule: {"ipath-noscheme"} **** Unused rule: {"isegment-nz-nc"}

spemberton commented 1 year ago

One source of ambiguity is: The input from line.pos 9.8 to 9.19 can be interpreted as 'ihost' in 2 different ways: 1: ihost[9.8:]: IPv4address[:9.19] 2: ihost[9.8:]: ireg-name[:9.19]

This is because "192.168.0.org" is a valid ireg-name, and they don't bother to discern.

That is "192.168.0.0" matches ireg-name anyway.

And that is because they are lazy and don't discern subdomains, just allowing a host to be any mixture of ALPHA | DIGIT | "-" | "." | "_" | "~" | ucschar. (which I believe isn't syntactically valid)

spemberton commented 1 year ago

Regarding "a"-"f", the ABNF doesn't include the lowercase versions, but the relevant part of RFC 2234 is apparently:
   NOTE:     ABNF strings are case-insensitive and
             the character set for these strings is us-ascii.
So all quoted strings in the ABNF form have to be changed to support mixed case.

I believe all other parts of the grammar already supports mixed case.

ndw commented 1 year ago

It might be useful to write test cases against the sample grammars. My processor in --pedantic mode would have flagged the unused productions, I think.

spemberton commented 1 year ago

Commenting out the use of ipv4 in ihost makes all my test examples (not a huge number) unambiguous.

cmsmcq commented 1 year ago

Some of the suggestions in this issue seem to me to make sense; others do not.

Our judgement may depend on what we think the purpose of the exercise is. My goal was an ixml translation of the grammar in the RFC, with marks to make the XML nicer (for some subjective judgement of 'niceness'). I did not think the goal was to suggest improvements to the normative grammar in the RFC.

I don't object in principle to a sample grammar that deviates in well defined ways from the normative grammar for the language in question, but I think it needs to be strongly motivated and the deviations clearly explained. If we think, for example, that the ixml grammar would be more useful if we made host and ihost unambiguous, or if ireg-name were defined as

ireg-name = label ++ ".".
label = ...

or as

ireg-name = (sub-domain ** ".", ".")?, TLD.
sub-domain = label.
TLD = label.
-label = ...

then we can do so, but we need to explain (first to each other and then to the public) why we think that's more helpful and what class of domain names will be grammatical in the normative grammar but ungrammatical in ours, or vice versa, and why we think deviating from the normative spec for those domain names will probably not matter in practice. So far, I haven't seen any reason to change my understanding of the goal of these grammars.

Eliminating unused nonterminals

I am reluctant to do this, at least for some nonterminals.

Both RFC 3986 nor RFC 3987 use the same set of production rules to define multiple objects. Implicitly, they each define multiple grammars with distinct root symbols and the same set of production rules. Which start symbol you use depends on what you are trying to do. For that reason, I am reluctant to remove (say) the definition for IRI-reference or absolute-IRI (or even ipath) from the grammar.

I don't have a very strong opinion about the low-level rules imported from RFC 2234. The spec explicitly imports the rules shown, and my recollection is that I kept them all even though some of them are not actually used, because that seemed to me a more accurate reflection of the RFC. Retaining them seems less important to me than retaining the alternative roots for the grammar. I don't, however, see that they do anyone any harm.
Hiding low-level nonterminals.

Agreed.

It would probably be better to make IRI-reference the start symbol for the IRI grammar, analogous to the choice made in the translation of 3986. It's clear, looking at the grammars, that when I did the translation I spent more time tweaking 3986. More generally, I think it would be helpful to align the two grammars better by using IRI-reference as the root and hiding the productions SP mentions (and any others which correspond to nonterminals hidden in the 3986 grammar, including IRI-reference).
Allowing lower-case hex characters

Agreed. Thank you; good catch.
Test cases.

Agreed.

The directory contains a file with 110 URIs gathered from the examples used in the specification to illustrate various syntactic possibilities. Turning that into a set of tests for the test collection strikes me as a good idea, as does making similar test collections for the other sample grammars. For at least a few grammars, I think it would be good to have a thorough set of positive and negative test cases; currently, I believe we achieve that (or come close) for the specification grammar, but not for any others. The real-world grammars in our samples directory are the best candidates for that treatment.

My apologies if anything in this comment seems terse or ungenerous; my ego seems to be reacting with less equanimity than one could wish to some of the wording in the comments on this issue.

spemberton commented 1 year ago

My apologies if anything in this comment seems terse or ungenerous; my ego seems to be reacting with less equanimity than one could wish to some of the wording in the comments on this issue.

Oh, I'm sorry if I offended you. Recognising the grammar as a direct transliteration of the RFC 3987 grammar, I didn't think you would feel any personal ownership, otherwise I would have tempered my language.

Any criticism that there was was entirely directed at messrs Duerst and Suignard (both of whom I know personally) for the inconsistencies in their grammar, even though I am entirely grateful that they produced such a grammar. Try to find one for internationalised email addresses and you end up in a twisty maze of passages all alike.

(I should point out that I was forced to turn RFC 3987 into a regular expression for the XForms spec.)

But we should recognise that the purpose of RFC 3987 is to define the syntax of a correct IRI; our purpose on the other hand is to reveal the structure.

For that reason I personally would prefer, to take an example, the (sub-)grammar for IPv6 to have the form:

IPv6: h4**":";

           h4**":", zeros, h4**":".
     h4: h; 
         h, h; 
         h, h, h; 
         h, h, h, h.
     zeros: "::".        h: ["0"-"9"; "A"-"F"; "a"-"f"].

rather than the hoops that they have to jump through to ensure that there are no more than 8 colons in an IPv6 address.

spemberton commented 1 year ago

Some of the suggestions in this issue seem to me to make sense; others do not.

Our judgement may depend on what we think the purpose of the exercise is. My goal was an ixml translation of the grammar in the RFC, with marks to make the XML nicer (for some subjective judgement of 'niceness'). I did not think the goal was to suggest improvements to the normative grammar in the RFC.

Absolutely understood. But as I also said elsewhere, published syntaxes are typically to define what is correct. while our aim is to expose structure. The imperfect syntax of ihost in rfc3987 being a point in case.

I don't object in principle to a sample grammar that deviates in well defined ways from the normative grammar for the language in question, but I think it needs to be strongly motivated and the deviations clearly explained. If we think, for example, that the ixml grammar would be more useful if we made host and ihost unambiguous, or if ireg-name were defined as

ireg-name = label ++ ".". label = ...

or as

ireg-name = (sub-domain ** ".", ".")?, TLD. sub-domain = label. TLD = label. -label = ...

then we can do so, but we need to explain (first to each other and then to the public) why we think that's more helpful and what class of domain names will be grammatical in the normative grammar but ungrammatical in ours, or vice versa, and why we think deviating from the normative spec for those domain names will probably not matter in practice. So far, I haven't seen any reason to change my understanding of the goal of these grammars.

I think one principle of ixml supplied grammars should be: you don't need to reparse any subtrees.

It would probably be better to make IRI-reference the start symbol for the IRI grammar

Sounds good, then several other nonterminals become reachable (but still not absolute-IRI, ipath, reserved, gen-delims, CR, DQUOTE, LF and SP.)

Steven

spemberton commented 1 year ago

(Sorry, ctrl-return sends the message, so if I take my finger off the ctrl too late, it sends. Here is the message as intended.)

My apologies if anything in this comment seems terse or ungenerous; my ego seems to be reacting with less equanimity than one could wish to some of the wording in the comments on this issue.

Oh, I'm sorry if I offended you. Recognising the grammar as a direct transliteration of the RFC 3987 grammar, I didn't think you would feel any personal ownership, otherwise I would have tempered my language.

Any criticism that there was was entirely directed at messrs Duerst and Suignard (both of whom I know personally) for the inconsistencies in their grammar, even though I am entirely grateful that they produced such a grammar. Try to find one for internationalised email addresses and you end up in a twisty maze of passages all alike.

(I should point out that I was forced to turn RFC 3987 into a regular expression for the XForms spec.)

But we should recognise that the purpose of RFC 3987 is to define the syntax of a correct IRI; our purpose on the other hand is to reveal the structure.

For that reason I personally would prefer, to take an example, the (sub-)grammar for IPv6 to have a form like: IPv6: h4":"; h4":", zeros, h4**":". h4: h; h, h; h, h, h; h, h, h, h. zeros: "::".
-h: ["0"-"9"; "A"-"F"; "a"-"f"].

rather than the hoops that they have to jump through to ensure that there are no more than 8 colons in an IPv6 address.

This makes our grammar easier to manage, and easier to read, at the expense of allowing more than 8 colons in an IPv6 address. Is that good or bad? It depends.

Steven

On Friday 12 August 2022 22:26:00 (+02:00), Steven Pemberton wrote:

My apologies if anything in this comment seems terse or ungenerous; my ego seems to be reacting with less equanimity than one could wish to some of the wording in the comments on this issue.

Oh, I'm sorry if I offended you. Recognising the grammar as a direct transliteration of the RFC 3987 grammar, I didn't think you would feel any personal ownership, otherwise I would have tempered my language.

Any criticism that there was was entirely directed at messrs Duerst and Suignard (both of whom I know personally) for the inconsistencies in their grammar, even though I am entirely grateful that they produced such a grammar. Try to find one for internationalised email addresses and you end up in a twisty maze of passages all alike.

(I should point out that I was forced to turn RFC 3987 into a regular expression for the XForms spec.)

But we should recognise that the purpose of RFC 3987 is to define the syntax of

a correct IRI; our purpose on the other hand is to reveal the structure.

For that reason I personally would prefer, to take an example, the (sub-)grammar for IPv6 to have the form:

IPv6: h4**":";

       h4**":", zeros, h4**":".
     h4: h; 
         h, h; 
         h, h, h; 
         h, h, h, h.
     zeros: "::".        h:

["0"-"9"; "A"-"F"; "a"-"f"].

rather than the hoops that they have to jump through to ensure that there are no more than 8 colons in an IPv6 address.

invisibleXML / ixml

https://github.com/invisibleXML/ixml/blob/master/samples/URI/rfc-3987.ixml #139