Grammar should include `step` as `terminus` option, require `termstep` in `local_path`

GoogleCodeExporter commented 9 years ago

The current spec sort of conflates two rather distinct uses of the "step" 
construct.

One is navigational, e.g. "A step with a slash (/) followed by an integer 
refers to a child node or nodes"
And (because a path may ) the other is terminating, e.g. "Single path notation 
always denotes a location point"

The definition "Step Reference to Child Node" section is where the distinction 
is ignored. The canonical numbering strategy is ± the same for the two types 
of steps, but its meaning (referencing a *node* vs. pointing to a *boundary*) 
is actually very different.

For example, an odd numbered index cannot be navigated except by a terminating 
character offset, and a 0 index is essentially a terminating offset, not a 
navigable step. None of this is captured in the grammar, and the Step Reference 
section is very navigation-focused which I think muddles its understanding of 
the terminating meaning.

Original issue reported on code.google.com by nat...@gmail.com on 8 May 2013 at 6:20

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

From issue #351, here is some informative text that may be desirable to add to 
the Step Reference section:

<<<<<

Note that a path refers to a location point, yet may end in a step reference. 
Thus the last step in a path without a "terminating step" is not navigable: it 
may represent an empty collection of text content nodes or refer to an element 
position before or after a text content collection, and regardless represents 
the location between nodes and not a node (or collection of nodes) itself. This 
is consistent with other similar representations (e.g. boundary points in the 
DOM Selection and Range definitions) and allowing local paths to end in step 
references that represent terminating points rather than navigable nodes 
facilitates interoperability with those implementations.

Original comment by nat...@gmail.com on 8 May 2013 at 6:28

GoogleCodeExporter commented 9 years ago

Original comment by daniel.weck on 8 May 2013 at 6:32

Added labels: Spec-CFI, Revision-301, Priority-Medium

GoogleCodeExporter commented 9 years ago

Original comment by daniel.weck on 8 May 2013 at 7:25

Changed state: NeedsDiscussion

GoogleCodeExporter commented 9 years ago

Email discussion:

https://groups.google.com/forum/#!topic/epub-working-group/ajYExeF7_rs

Original comment by daniel.weck on 23 May 2013 at 3:16

GoogleCodeExporter commented 9 years ago

Hi natevw, perhaps the EBNF terms "termstep" and "terminus" are misleading, as 
they indeed seem to imply that "step" is not designed to be the last item in a 
CFI expression, yet it can effectively be (legally) the "leaf" of a CFI 
reference. The question is whether or not this reflects a practical reality, 
for example: would a reading system link directly to a <br/> line break 
element, or to the "empty child" inside the element's content => either the 
empty chunk of character data, or the virtual first/last elements at index 0 or 
n+2 (which is not recommended as per the SHOULD NOT conformance requirement).
Personally I think that the specification is fine as it stands now, but perhaps 
I am missing your point?
Daniel

Latest editor's draft:
https://epub-revision.googlecode.com/svn/trunk/build/linking/cfi/epub-cfi.html#s
ec-epubcfi-syntax

Latest published specification:
http://www.idpf.org/epub/linking/cfi/#sec-epubcfi-syntax

Original comment by daniel.weck on 23 May 2013 at 4:01

GoogleCodeExporter commented 9 years ago

It's a difference in meaning.

Consider the path "2/4/1:1" applied to 
`<parent><child>first</child><child>second</child></parent>`. The "4" is a 
navigational step referring to the second child.

Now consider the path "2/4" applied to same. The "4" does NOT refer to the 
second child, but rather the location between it and the (empty) text group 
after the first child. To quote the spec: "Single path notation always denotes 
a location point".

The grammar makes no distinction between these two types of references. And so 
this sentence cannot be corrected: "A step with a slash (/) followed by an 
integer refers to a child node or nodes"

That sentence is wrong because not every "step" refers to a child node or 
nodes. Some steps (the ones I would call "navigational") do. Others (the ones I 
would call "terminal") do not: if a path (which ALWAYS denotes a location 
point) ends in a step then that step represents a point just as all the 
existing terminus types do.

There's no grammatical difference between the two, both are `"/" , integer , [ 
"[" , assertion , "]" ] ;` but there is a semantic difference.

Here's sort of the gist of what I'm proposing…

fragment = "epubcfi(" , nav_path , (range | term_path ) , ")" ;
nav_path = { step }-
term_path = { step } [termstep]
range = "," , term_path , "," , term_path ;
# drop local_path

…however, I have left out any construct necessary to represent path 
indirection (the existing redirected_path construct) so this is NOT a direct 
proposal in itself, just illustrative of the semantic difference I see. If this 
were done, then the 3.1.1 portion of the spec could say something like "A 
navigational step refers to a child node or nodes … a terminal step is 
numbered similarly but refers to [etc.]"

Original comment by nat...@gmail.com on 23 May 2013 at 4:38

GoogleCodeExporter commented 9 years ago

Regarding:

"A step with a slash (/) followed by an integer refers to a child node or nodes"
That sentence is wrong because not every "step" refers to a child node or nodes.

Could you please use the updated terminology, as this is now obsolete. See:

https://groups.google.com/d/msg/epub-working-group/HC_hS7ae6mo/dm54uIui_QAJ

Meanwhile, I am reading your comment further :)

/Dan

Original comment by daniel.weck on 23 May 2013 at 4:44

GoogleCodeExporter commented 9 years ago

The updated specification prose says: "A [step] with a slash (/) followed by a 
positive integer refers to either a child element or a chunk of character data, 
as per the rules defined herein..."

The term "refers" is correctly used here, as it covers both "navigational" and 
"terminating" [steps] without ambiguity (see below).

The original prose (untouched) says: "[Steps] can either be navigational or 
terminating. Navigational [steps] may be repeated as necessary (e.g., ...). 
There may be only one terminating [step], which, if present, must be the last 
[step] in the sequence."

The problem is that what the prose says in "plain english" is not accurately 
matched by the EBNF grammar: a [step] (i.e. not a [termstep]) can in fact 
effectively "terminate" a CFI expression (i.e. be the last item), or it can be 
used to traverse / walk further down into the XML tree (in which case the term 
used to describe this is "navigational"). Furthermore, a [step] can refer to 
either XML element or text, the former being a natural candidate to "terminate" 
a CFI expression (by defining a location corresponding to the opening tag of an 
XML element, with the addition of an optional side bias useful in breaking 
context), whereas the latter is really supposed to be followed by a [termstep] 
of type "character offset" (but not necessarily, as per the syntax rules).

So, I suggest that we fix the EBNF production rule "termstep" as follows (to 
include [step]):

termstep = step | ( terminus , [ "[" , assertion , "]" ] );

This way, it is clear that both an element or a chunk of character data are 
considered valid terminating "locations" in an XML document. However, just like 
we did with "virtual" elements (+ the issue of empty first/last character 
data), I suggest that we use a SHOULD NOT conformance requirement for the 
production of CFI expressions with a terminating step that refers to character 
data without an explicit character offset. In other words, reading systems MUST 
be capable of consuming (parse + interpret / render) the implicit location just 
before the first character of a terminating character data step, but they 
SHOULD generate explicit character offsets when creating such CFI location.

natevw, I hope this addresses your concerns. Otherwise, let me know :)

Original comment by daniel.weck on 23 May 2013 at 5:55

GoogleCodeExporter commented 9 years ago

This sounds reasonable to me � a much simpler change but one which still seems 
to capture the essence of the _two_ usages of a step reference.

Original comment by nat...@gmail.com on 23 May 2013 at 6:00

GoogleCodeExporter commented 9 years ago

Formal proposed solution and 72h objection window:

https://groups.google.com/forum/#!topic/epub-working-group/iusOUBjlFLY

------------

In "2.2 Syntax":
https://epub-revision.googlecode.com/svn/trunk/build/linking/cfi/epub-cfi.html#s
ec-epubcfi-syntax

=> Rewrite the [termstep] EBNF production rule as:

termstep = step | ( terminus , [ "[" , assertion , "]" ] );

Note: this adds [step] (i.e. reference to element or to interspersed chunk of 
XML character data) as a valid "terminating" item within a CFI expression.

In "3.1.4 Terminating Step – Character Offset (:)":
https://epub-revision.googlecode.com/svn/trunk/build/linking/cfi/epub-cfi.html#s
ec-path-terminating-char

=> Just before the last phrase "No other steps may follow a character offset 
terminating step", add this line:
"
CFI expressions should not be produced with a terminating /N step (i.e. no 
explicit character offset) where N is odd to refer to a chunk of XML character 
data interspersed amongst XML elements. However, CFI processors (e.g. Reading 
Systems) must be capable of consuming (i.e. parse + interpret / render) such 
CFI expressions, by assuming the implicit /N:0 character offset.
"

------------

Original comment by daniel.weck on 23 May 2013 at 6:50

Changed state: ProposedSolution

GoogleCodeExporter commented 9 years ago

One issue is that rewriting the "termstep" rule as proposed breaks the LL(1) 
nature of the grammar, which conflicts with issue 343.

In order to stay close to the english prose, I propose to instead rename the 
"termstep" production to "termstep_offset" (the production is really about 
offsets, the terminating step itself *always* starts with an integer).

About @natevw's concern of the dual nature of steps (i.e. navigational vs. 
terminating), the EBNF is coherent since it groups both under the generic name 
"step".

Additionally, I don't think that the prose "refers to a child element" in 
section 3.1.1 is wrong (as suggested in comment #7). A navigating step refers 
to an element as a way to navigate to the XML tree, a terminating element step 
refers to an element as a way to denote a location.
The prose (as revised in issue 301) might be extended with a paragraph that 
clarifies this distinction, but is otherwise correct IMHO.

Original comment by rdeltour@gmail.com on 23 May 2013 at 9:12

GoogleCodeExporter commented 9 years ago

Romain, I think that [step] in the EBNF is "navigational" and [termstep] is 
"terminal", at least I think it was the original intent, thus the corresponding 
prose in plain english that describes these CFI components. The problem is that 
[step] can effetively also be "terminal", in addition to "navigational". I am 
afraid I fail to see how your proposal helps solve this issue. Also, the term 
"offset" in "termstep_offset" doesn't really fit with 2D coordinates / spatial 
region.

Original comment by daniel.weck on 23 May 2013 at 9:33

GoogleCodeExporter commented 9 years ago

expanding the step production rule into termstep does not break LL(1), and this 
actually better reflects the dual nature of /N

termstep    =   ( ( "/" , integer ) | terminus ) , [ "[" , assertion , "]" ] ;   

Romain, see comment #9 for a breakdown of relevant prose bits.

Original comment by daniel.weck on 23 May 2013 at 9:48

GoogleCodeExporter commented 9 years ago

I believe the change proposed in comment #14 still breaks LL(1). Attached is 
the corresponding EBNF (in W3C syntax) that can be fed to REx [1].

I also think that the proposed change does not improve the mapping to the 
english prose. Consider this CFI:

epubcfi(/6/4[chap01ref]!/4[body01]/10[para05]/2/1:3[;s=b])

My understanding is that the terminating step is "1:3[;s=b]", i.e. the last 
step in the path sequence + the character offset.
If this is correct, this full step is not produced by the proposed 'termstep' 
production, only the offset is.

About comment #13 and my suggestion to rename the production "termstep_offset", 
this was based on the very headings of the CFI spec 3.1.4 to 3.1.7, which all 
include the word "offset". 2D coordinates / spatial regions are called "spatial 
offset" in the spec.

[1] http://www.bottlecaps.de/rex/

Original comment by rdeltour@gmail.com on 24 May 2013 at 6:54

Attachments:

cfi-ebnf-LL1-fixed.txt

GoogleCodeExporter commented 9 years ago

about "offset": ah, i stand corrected, although i don't think that a spatial 
region is semantically equivalent to an "offset", but that's a different issue 
to be filed separately.

regarding "terminating" steps: Romain, can you have a look at comment #6, i 
think that the <br/> example is pretty symptomatic of the fact that we need to 
cater for the dual nature of /N (both "navigational" and "terminal").

Original comment by daniel.weck on 24 May 2013 at 7:26

GoogleCodeExporter commented 9 years ago

Romain and I had a chat in order to discuss how to have a correct LL(1) EBNF, 
whilst using "plain english" non-ambiguous, consistent prose that reflects the 
terms used in the formal syntax. We concluded that minimal changes are needed 
in the grammar, and prose adjustments are required in quite a few places:

- The production name 'termstep' is a misnomer, because the 'step' production 
rule itself can "terminate" a CFI expression. Furthermore, the concept of 
"terminal symbol" has a different meaning in EBNF. As Romain suggested, let us 
use the term 'offset' instead (consistent with headings 3.1.4-7). Note: I 
personally think that the term 'offset' is not really suitable for "spatial 2D 
region", but I am not strongly opposed to it. Rename the 3.1.4-7 headings by 
removing "Terminating Step - ".

- Instead of trying to rename the 'terminus' production rule to something less 
likely to be misconstrued (remember, a /N 'step' can also "terminate" a CFI 
expression), I suggest we merge it into the renamed version of 'termstep' (now 
'offset'). This is now consistent with the structure of the 'step' production 
rule (optional 'assertion' at the end).

To summarise:

offset = ( ( ":" , integer ) | ( "@" , number , ":" , number ) | ( "~" , number 
, [ "@" , number , ":" , number ] ) ) , [ "[" , assertion , "]" ] ;  

See:
https://epub-revision.googlecode.com/svn/trunk/build/linking/cfi/epub-cfi.html#s
ec-epubcfi-syntax

- The term "navigational" is only used in one sentence (see "2.2 Syntax"), with 
no direct equivalence in the EBNF. Generally-speaking, I think that the term 
"traversal" (XML tree) is more appropriate, because "navigation" has another 
meaning in EPUB. In fact, the CFI introduction says "The functionality ... is 
varied: from reading location maintenance to annotation attachment to 
navigation". So, this whole sentence needs to be reworked:

"Steps can either be navigational or terminating. Navigational steps may be 
repeated as necessary (e.g., to count elements, to process children or to 
follow references). There may be only one terminating step, which, if present, 
must be the last step in the sequence."

I suggest:

"Steps are denoted by the '/' forward slash character, and are used to traverse 
XML content. The last step in a CFI path represents a location within a 
document, either structural (XML element), textual (character data), or 
aural-visual (image, audio, or video media). Such terminating step may be 
complemented by an optional "offset", which denotes a particular character 
position, temporal or spatial fragment."

- In "3.1.4 Character Offset (:)":
https://epub-revision.googlecode.com/svn/trunk/build/linking/cfi/epub-cfi.html#s
ec-path-terminating-char

=> replace "A terminating step with a leading colon" with "A path terminating 
with a leading colon".

=> replace "A character offset terminating step may be present only following a 
/N step." with "A character offset may follow a /N step."

=> remove "No other steps may follow a character offset terminating step." 
(already expressed in "A path terminating with")

- In "3.1.5 Temporal Offset (~)":
https://epub-revision.googlecode.com/svn/trunk/build/linking/cfi/epub-cfi.html#s
ec-path-terminating-temporal

=> replace "A terminating step with a leading tilde" with "A path terminating 
with a leading tilde".

=> remove "No other steps can follow a temporal offset terminating step." 
(already expressed with the sentence above)

- In "3.1.6 Spatial Offset (@)"
https://epub-revision.googlecode.com/svn/trunk/build/linking/cfi/epub-cfi.html#s
ec-path-terminating-spatial

=> replace "A terminating step with a leading at sign" with "A path terminating 
with a leading 'at' sign".

=> remove "No other steps can follow a spatial offset terminating step." 
(already expressed with the sentence above)

- In "3.1.7 Temporal-Spatial Offset (~ + @)"
https://epub-revision.googlecode.com/svn/trunk/build/linking/cfi/epub-cfi.html#s
ec-path-terminating-tempspatial

=> remove "No other steps can follow a temporal-spatial position terminating 
step." (redundant, already previously expressed)

- In "3.1.8 Text Location Assertion ([)":
https://epub-revision.googlecode.com/svn/trunk/build/linking/cfi/epub-cfi.html#s
ec-path-text-location

=> replace "character offset terminating step." with "character offset."

- In "3.1.9 Side Bias ([ + ;s=)"
https://epub-revision.googlecode.com/svn/trunk/build/linking/cfi/epub-cfi.html#s
ec-path-side-bias

=> replace "Side is not defined for locations with spatial terminus." with 
"Side is not defined for locations with spatial offset."

That's it.

Original comment by daniel.weck on 24 May 2013 at 9:34

GoogleCodeExporter commented 9 years ago

Please see the complete formal proposal in the updated 72h message:

https://groups.google.com/d/msg/epub-working-group/iusOUBjlFLY/TAJKuvdhvMcJ

Original comment by daniel.weck on 24 May 2013 at 9:46

GoogleCodeExporter commented 9 years ago

72h clock ended. Matt, edits please! :)

Original comment by daniel.weck on 29 May 2013 at 4:48

GoogleCodeExporter commented 9 years ago

Specification has been updated:

https://code.google.com/p/epub-revision/source/detail?r=4652

Original comment by mgarrish on 30 May 2013 at 12:21

Changed state: FinalReview

GoogleCodeExporter commented 9 years ago

Original comment by daniel.weck on 30 May 2013 at 12:25

Changed state: Fixed

w3c / epub-specs

Grammar should include `step` as `terminus` option, require `termstep` in `local_path` #363