EBNF terminal production number consistency

gkellogg commented 1 year ago

Historically, the numbers on the terminal productions have used the same order and numbering as originally used in SPARQL. These seem to have drifted over time. For example, PNAME_NS in SPARQL is [140] and in Turtle is [139s]. ECHAR is [160] in SPARQL, [159s] in Turtle, and [153s] in N-Triples.

Once the dust has settled on other grammar changes, we might want to strive for some consistency across the different grammars.

afs commented 1 year ago

There is more non-alignment than alignment at the moment. I don't know of any real problems this has caused.

Personally, I think that there is a down-side to alignment numbering.

They are like line numbers. A grammar with gaps in rule numbering 1,2,3,7,8,9 suggests something got lost or something that is necessary to find elsewhere.

Is there anywhere where rules are referred to by number? Text refers to rules by name which is clearer for the reader.

gkellogg commented 1 year ago

No, numbers serve no purpose, and many EBNF variants use no numbers at all.

gkellogg commented 1 year ago

In some cases, rule numbers were taken from other grammars and a signifier was added to reflect this provenance. For example, from Turtle:

[128s] RDFLiteral ::= String ( LANGTAG | ( "^^" iri ) )? 
[133s] BooleanLiteral ::= "true"  | "false" 
[135s] iri ::= IRIREF  | PrefixedName 
[136s] PrefixedName ::= PNAME_LN  | PNAME_NS 
[137s] BlankNode ::= BLANK_NODE_LABEL  | ANON 
[139s] PNAME_NS ::= PN_PREFIX? ":" 
[140s] PNAME_LN ::= PNAME_NS PN_LOCAL 
[144s] LANGTAG ::= "@" [a-zA-Z]+ ( "-" [a-zA-Z0-9]+ )* 
[154s] EXPONENT ::= [eE] [+-]? [0-9]+ 
[159s] ECHAR ::= "\" [tbnrf\"']  
[160s] NIL ::= "(" WS* ")" 
[161s] WS ::= #x20 | #x9 | #xD | #xA

[162s] ANON ::= "[" WS* "]" 

[163s] PN_CHARS_BASE ::= [A-Z] 
 | [a-z] 
 | [#00C0-#00D6] 
 | [#00D8-#00F6] 
 | [#00F8-#02FF] 
 | [#0370-#037D] 
 | [#037F-#1FFF] 
 | [#200C-#200D] 
 | [#2070-#218F] 
 | [#2C00-#2FEF] 
 | [#3001-#D7FF] 
 | [#F900-#FDCF] 
 | [#FDF0-#FFFD] 
 | [#10000-#EFFFF] 
[164s] PN_CHARS_U  ::=  PN_CHARS_BASE  | '_' 
[166s] PN_CHARS ::= PN_CHARS_U 
 | "-" 
 | [0-9] 
 | #00B7 
 | [#0300-#036F] 
 | [#203F-#2040] 
[167s] PN_PREFIX ::= PN_CHARS_BASE ( ( PN_CHARS | "." )* PN_CHARS )?  
[168s] PN_LOCAL ::= ( PN_CHARS_U | ':' | [0-9] | PLX ) ( ( PN_CHARS | '.' | ':' | PLX )*  ( PN_CHARS | ':' | PLX ) ) ?
[169s] PLX ::= PERCENT | PN_LOCAL_ESC
[170s] PERCENT ::= '%' HEX HEX
[171s] HEX ::= [0-9] | [A-F] | [a-f]
[172s] PN_LOCAL_ESC ::= '\' ( '_' | '~' | '.' | '-' | '!' | '$' | '&' | "'" | '(' | ')' | '*' | '+' | ',' | ';' | '=' | '/' | '?' | '#' | '@' | '%' )

all came from SPARQL. The only case where I think this is significant is for terminal productions that might be implemented via a native regular expression on different platforms (such as my own) which allows complicated patterns to be shared between implementations. It also allows us to see where definitions have drifted. Personally, I don't think this is particularly important for normal productions, unless they're fairly complicated, but can be useful for terminal productions, such as PN_CHARS_BASE.

I think rule numbering is useful for simple reference, and it has some history. But, maybe we should only look to preserve it for a subset of terminal rules which are common across N-Triples, N-Quads, Turtle, TriG, and SPARQL.

Looking at EBNF Notation, you won't find any description of rule numbering. I codified the grammars used by my EBNF tool in the README, and noted an ambiguity that production numbers actually introduce. But, I implement the following rule of production numbers:

All rules MAY start with an identifier, contained within square brackets. For example [1] rule, where the value within the brackets is a symbol ([a-z] | [A-Z] | [0-9] | "_" | ".")+

This seems to capture the range of production numbers in specs I surveyed at the time. We could consider using a common number prefix, rather than suffix, which could help protect from future re-numbering, for example [s171] HEX ::= [0-9] | [A-F] | [a-f] instead of [171s] for whatever number renege we wanted to share. But, we may want to limit shared rules to terminal productions, and not attempt to preserve numbering for non-terminal productions.

afs commented 1 year ago

We have names to show terminals in common.

The SPARQL grammar is produced by a script, with automatic, incremental numbering. The grammar - and the terminal naming - predates standards Turtle.

There is no gain to now require manual management of the numbering as well for SPARQL.

As we go forward to living specs, management of numbers across SPARQL and data languages will be a burden and, so far, no clear gain. We have names.

Suppose a change is best expressed as a new production early in the grammar and everything after changes. There is a danger that SPARQL changes would then cause RDF docs needing republishing just for that.

Missing numbers can be confusing for the reader.

gkellogg commented 1 year ago

I can support that. How would you feel about removing production numbers altogether?

afs commented 1 year ago

For SPARQL? No. They are there, have been there from SPARQL 1.0 - leave them (they can move and have moved from version to version. They are line numbers.)

For data languages? The use of numbering style spreads beyond these specs - and that's aside from the use of the specific *s designations.

I wonder what other WG members think.

Personally I think have numbers, make them "line numbers" that can move if the grammar changes i.e. it is the HTML generation step that puts in numbers, and not need them in the EBNF files where they need more manual maintenance.

(TriG has 3 different "6" rules - 6g, 6s and 6.)

gkellogg commented 1 year ago

This was addressed in #27 by treating the production numbers as simple line numbers, which bear no relationship to numbers from other grammars.

w3c / rdf-turtle

EBNF terminal production number consistency #7