invisibleXML / ixml

Invisible XML
GNU General Public License v3.0
48 stars 7 forks source link

https://github.com/invisibleXML/ixml/blob/master/samples/URI/rfc-3987.ixml #139

Open spemberton opened 1 year ago

spemberton commented 1 year ago

Although this is straight out of the RFC, it is not good enough for proper use. HEXDIG should include "a"-"f" ipchar, iunreserved and ucschar should have a "-" before the rule. The grammar is ambiguous, but that needs work to investigate (on it).

spemberton commented 1 year ago

Also unused rules: **** Unused rules: {"CR"; "DQUOTE"; "IRI-reference"; "LF"; "SP"; "absolute-IRI"; "ipath"; "reserved"}

ndw commented 1 year ago

Regarding "a"-"f", the ABNF doesn't include the lowercase versions, but the relevant part of RFC 2234 is apparently:

   NOTE:     ABNF strings are case-insensitive and
             the character set for these strings is us-ascii.

So all quoted strings in the ABNF form have to be changed to support mixed case.

spemberton commented 1 year ago

Removing unused rules, more rules become unused: Unused rules: {"IRI-reference"; "gen-delims"} Unused rule: {"irelative-ref"} Unused rule: {"irelative-part"} Unused rule: {"ipath-noscheme"} **** Unused rule: {"isegment-nz-nc"}

spemberton commented 1 year ago

One source of ambiguity is: The input from line.pos 9.8 to 9.19 can be interpreted as 'ihost' in 2 different ways: 1: ihost[9.8:]: IPv4address[:9.19] 2: ihost[9.8:]: ireg-name[:9.19]

This is because "192.168.0.org" is a valid ireg-name, and they don't bother to discern.

That is "192.168.0.0" matches ireg-name anyway.

And that is because they are lazy and don't discern subdomains, just allowing a host to be any mixture of ALPHA | DIGIT | "-" | "." | "_" | "~" | ucschar. (which I believe isn't syntactically valid)

spemberton commented 1 year ago

Regarding "a"-"f", the ABNF doesn't include the lowercase versions, but the relevant part of RFC 2234 is apparently:

   NOTE:     ABNF strings are case-insensitive and
             the character set for these strings is us-ascii.

So all quoted strings in the ABNF form have to be changed to support mixed case.

I believe all other parts of the grammar already supports mixed case.

ndw commented 1 year ago

It might be useful to write test cases against the sample grammars. My processor in --pedantic mode would have flagged the unused productions, I think.

spemberton commented 1 year ago

Commenting out the use of ipv4 in ihost makes all my test examples (not a huge number) unambiguous.

cmsmcq commented 1 year ago

Some of the suggestions in this issue seem to me to make sense; others do not.

Our judgement may depend on what we think the purpose of the exercise is. My goal was an ixml translation of the grammar in the RFC, with marks to make the XML nicer (for some subjective judgement of 'niceness'). I did not think the goal was to suggest improvements to the normative grammar in the RFC.

I don't object in principle to a sample grammar that deviates in well defined ways from the normative grammar for the language in question, but I think it needs to be strongly motivated and the deviations clearly explained. If we think, for example, that the ixml grammar would be more useful if we made host and ihost unambiguous, or if ireg-name were defined as

ireg-name = label ++ ".".
label = ...

or as

ireg-name = (sub-domain ** ".", ".")?, TLD.
sub-domain = label.
TLD = label.
-label = ...

then we can do so, but we need to explain (first to each other and then to the public) why we think that's more helpful and what class of domain names will be grammatical in the normative grammar but ungrammatical in ours, or vice versa, and why we think deviating from the normative spec for those domain names will probably not matter in practice. So far, I haven't seen any reason to change my understanding of the goal of these grammars.

My apologies if anything in this comment seems terse or ungenerous; my ego seems to be reacting with less equanimity than one could wish to some of the wording in the comments on this issue.

spemberton commented 1 year ago

My apologies if anything in this comment seems terse or ungenerous; my ego seems to be reacting with less equanimity than one could wish to some of the wording in the comments on this issue.

Oh, I'm sorry if I offended you. Recognising the grammar as a direct transliteration of the RFC 3987 grammar, I didn't think you would feel any personal ownership, otherwise I would have tempered my language.

Any criticism that there was was entirely directed at messrs Duerst and Suignard (both of whom I know personally) for the inconsistencies in their grammar, even though I am entirely grateful that they produced such a grammar. Try to find one for internationalised email addresses and you end up in a twisty maze of passages all alike.

(I should point out that I was forced to turn RFC 3987 into a regular expression for the XForms spec.)

But we should recognise that the purpose of RFC 3987 is to define the syntax of a correct IRI; our purpose on the other hand is to reveal the structure.

For that reason I personally would prefer, to take an example, the (sub-)grammar for IPv6 to have the form:

IPv6: h4**":";

           h4**":", zeros, h4**":".
     h4: h; 
         h, h; 
         h, h, h; 
         h, h, h, h.
     zeros: "::".        h: ["0"-"9"; "A"-"F"; "a"-"f"]. 

rather than the hoops that they have to jump through to ensure that there are no more than 8 colons in an IPv6 address.

spemberton commented 1 year ago

Some of the suggestions in this issue seem to me to make sense; others do not.

Our judgement may depend on what we think the purpose of the exercise is. My goal was an ixml translation of the grammar in the RFC, with marks to make the XML nicer (for some subjective judgement of 'niceness'). I did not think the goal was to suggest improvements to the normative grammar in the RFC.

Absolutely understood. But as I also said elsewhere, published syntaxes are typically to define what is correct. while our aim is to expose structure. The imperfect syntax of ihost in rfc3987 being a point in case.

I don't object in principle to a sample grammar that deviates in well defined ways from the normative grammar for the language in question, but I think it needs to be strongly motivated and the deviations clearly explained. If we think, for example, that the ixml grammar would be more useful if we made host and ihost unambiguous, or if ireg-name were defined as

ireg-name = label ++ ".". label = ...

or as

ireg-name = (sub-domain ** ".", ".")?, TLD. sub-domain = label. TLD = label. -label = ...

then we can do so, but we need to explain (first to each other and then to the public) why we think that's more helpful and what class of domain names will be grammatical in the normative grammar but ungrammatical in ours, or vice versa, and why we think deviating from the normative spec for those domain names will probably not matter in practice. So far, I haven't seen any reason to change my understanding of the goal of these grammars.

I think one principle of ixml supplied grammars should be: you don't need to reparse any subtrees.

It would probably be better to make IRI-reference the start symbol for the IRI grammar

Sounds good, then several other nonterminals become reachable (but still not absolute-IRI, ipath, reserved, gen-delims, CR, DQUOTE, LF and SP.)

Steven

spemberton commented 1 year ago

(Sorry, ctrl-return sends the message, so if I take my finger off the ctrl too late, it sends. Here is the message as intended.)

My apologies if anything in this comment seems terse or ungenerous; my ego seems to be reacting with less equanimity than one could wish to some of the wording in the comments on this issue.

Oh, I'm sorry if I offended you. Recognising the grammar as a direct transliteration of the RFC 3987 grammar, I didn't think you would feel any personal ownership, otherwise I would have tempered my language.

Any criticism that there was was entirely directed at messrs Duerst and Suignard (both of whom I know personally) for the inconsistencies in their grammar, even though I am entirely grateful that they produced such a grammar. Try to find one for internationalised email addresses and you end up in a twisty maze of passages all alike.

(I should point out that I was forced to turn RFC 3987 into a regular expression for the XForms spec.)

But we should recognise that the purpose of RFC 3987 is to define the syntax of a correct IRI; our purpose on the other hand is to reveal the structure.

For that reason I personally would prefer, to take an example, the (sub-)grammar for IPv6 to have a form like: IPv6: h4":"; h4":", zeros, h4**":". h4: h; h, h; h, h, h; h, h, h, h. zeros: "::".
-h: ["0"-"9"; "A"-"F"; "a"-"f"].

rather than the hoops that they have to jump through to ensure that there are no more than 8 colons in an IPv6 address.

This makes our grammar easier to manage, and easier to read, at the expense of allowing more than 8 colons in an IPv6 address. Is that good or bad? It depends.

Steven

On Friday 12 August 2022 22:26:00 (+02:00), Steven Pemberton wrote:

My apologies if anything in this comment seems terse or ungenerous; my ego seems to be reacting with less equanimity than one could wish to some of the wording in the comments on this issue.

Oh, I'm sorry if I offended you. Recognising the grammar as a direct transliteration of the RFC 3987 grammar, I didn't think you would feel any personal ownership, otherwise I would have tempered my language.

Any criticism that there was was entirely directed at messrs Duerst and Suignard (both of whom I know personally) for the inconsistencies in their grammar, even though I am entirely grateful that they produced such a grammar. Try to find one for internationalised email addresses and you end up in a twisty maze of passages all alike.

(I should point out that I was forced to turn RFC 3987 into a regular expression for the XForms spec.)

But we should recognise that the purpose of RFC 3987 is to define the syntax of

a correct IRI; our purpose on the other hand is to reveal the structure.

For that reason I personally would prefer, to take an example, the (sub-)grammar for IPv6 to have the form:

IPv6: h4**":";

       h4**":", zeros, h4**":".
     h4: h; 
         h, h; 
         h, h, h; 
         h, h, h, h.
     zeros: "::".        h: 

["0"-"9"; "A"-"F"; "a"-"f"].

rather than the hoops that they have to jump through to ensure that there are no more than 8 colons in an IPv6 address.