Clarification of CFI data model (DOM? XML InfoSet?)

GoogleCodeExporter commented 9 years ago

The discussion for issue #218 contains a side-topic about DOM vs. other XML 
data model(s).

https://groups.google.com/forum/#!topic/epub-working-group/MekA7PF-_Zw

(I plead guilty for encouraging this discussion in the first place!)

My suggestion:

------
Perhaps the CFI specification should clarify that the data model is XML InfoSet 
(just like XPointer does, although  note that XPath defines its data model in 
terms of tree of nodes, not XML InfoSet). 

http://www.w3.org/TR/xml-infoset/ 

...or at least reference XML 1.0 (5th edition), with certain conditions (e.g. 
entity references expanded by default, like in XML-InfoSet) 

http://www.w3.org/TR/REC-xml/ 

In my opinion, the current prose is at best ambiguous, at worst misleading: 

http://www.idpf.org/epub/linking/cfi/#sec-path-child-ref 

" 
This indexing method ensures that node identification is not sensitive to XML 
parser handling of whitespace text nodes, CDATA sections and entity references 
(e.g., to avoid the ambiguity that can arise depending on whether a parser 
collapses whitespace-only text nodes, keeps text, CDATA sections and entity 
references as distinct nodes or doesn't, or breaks text in multiple nodes). 
" 

I would also suggest adding something along the lines of: 

When calculating character offsets (odd indices) in CFI expressions: 
- Processing instructions and comments are ignored. 
- Contiguous text nodes (e.g. when a run of text is separated / divided by a 
comment) are preserved as individual nodes, there is no implicit merging of 
text nodes. 
- All whitespace is preserved, within runs of character data, and between 
elements. There is no collapsing of contiguous whitespaces. There is no 
trimming. 
- The content of CDATA sections is treated with the same whitespace rules as 
above. 
- Entities are expanded before CFI processing is applied. 
- etc. 
------

Original issue reported on code.google.com by daniel.weck on 17 Apr 2013 at 7:52

GoogleCodeExporter commented 9 years ago

72h proposal:

https://groups.google.com/forum/#!topic/epub-working-group/pgBJLfVcozY

Joint proposal with issue 218.

https://code.google.com/p/epub-revision/issues/detail?id=218

Original comment by daniel.weck on 7 May 2013 at 10:37

Changed state: ProposedSolution

GoogleCodeExporter commented 9 years ago

Looks like this is the more appropriate ticket for half of my comment on the 
joint issue 218:

-1 on introducing the text "ignorable items may still have an impact on text 
nodes indices". Is there discussion on why this was necessary? This seems like 
a major change, and the opposite of what I would expect! So a comment 
introduces a gap in the numbering — i.e. text section at index E-1 and E+1 
but no node at E?! (Full disclosure: our implementation is currently 
incompatible with this, as it [in my opinion] "truly" ignores ignored nodes.)

Original comment by nat...@gmail.com on 7 May 2013 at 11:05

GoogleCodeExporter commented 9 years ago

Or in terms of the discussion here:
"Contiguous text nodes (e.g. when a run of text is separated / divided by a 
comment) are preserved as individual nodes, there is no implicit merging of 
text nodes."

I really think completely ignoring comments (and the like) is a much better 
approach, and pretty much assumed that was the case without clarification. If 
this point needs to be clarified, I'd much rather see it clarified that there 
IS implicit merging of text nodes.

Allowing ignored elements to affect the CFI like this makes the numbering 
unnecessarily fragile! Ignored nodes should NEVER effect step reference 
numbering.

A content author (or authoring tool) should be able to add an ignored element 
e.g. `<!-- FIXME: remove this stray word in the next edition -->` without 
breaking any of the current edition's existing CFI references.

Original comment by nat...@gmail.com on 7 May 2013 at 11:15

GoogleCodeExporter commented 9 years ago

Sorry if it seems I'm belabouring this point, but here's another example of why 
I think this is important. Beyond "text" and "element", my CFI implementation 
simply handles two other types: "transparent" (e.g. a non-canonical wrapper 
span added to style a highlight) and "ignored" (non-canonical nodes added for 
other display purposes, e.g. internal anchors, marginalia)

Making ignored elements within the "canonical" content significant, would mean 
my implementation would need to handle two separate types of ignored elements: 
"runtime_ignored" and "canonical_ignored" — the former wouldn't affect the 
numbering while the latter would. And for what purpose? Would much rather let 
content authors add/remove HTML comments with as much impunity as they can 
modify other elements recognized in my runtime as ignored.

Original comment by nat...@gmail.com on 7 May 2013 at 11:34

GoogleCodeExporter commented 9 years ago

Hi natevw,

I am not proposing a new numbering mechanism, in fact I always assumed that 
interspersed XML comments effectively "split" contiguous text nodes into 
separate items. That is at least my experience with XML parsers and DOM APIs 
(nodeType==8). Are you suggesting that such physically "split" text nodes 
should be merged prior to applying CFI processing rules? If yes, then we should 
precisely define the merge heuristics in the specification, don't you think?

Dan

Original comment by daniel.weck on 7 May 2013 at 11:38

GoogleCodeExporter commented 9 years ago

Take a DOM, remove all nodes that are ignored.

Yes, you would get some split text nodes — but the spec already defines how 
to handle split text nodes in the general case.

Numbering differently based on whether ignored elements are present or not is 
tautologically incorrect, right?

Unless there's a good reason to make a major change like this (if so, I would 
recommend finding a new term to describe this class of nodes) then a CFI apply 
identically to a DOM with its ignored nodes removed as it would to the 
original. That's the only reasonable way I see to understand "ignored" in the 
current spec.

Original comment by nat...@gmail.com on 7 May 2013 at 11:56

GoogleCodeExporter commented 9 years ago

Ah, of course, I was getting confused with the allocation of a *single* odd 
index to a (possibly empty) collection of several contiguous text nodes (once 
the ignorable items are trimmed). I misread both the question and the answer! 
Thanks for putting me in the right direction again. Dan

Original comment by daniel.weck on 8 May 2013 at 12:05

GoogleCodeExporter commented 9 years ago

natevw, could you please review the updated proposal? Thanks!

<<<<< 

A step with a slash (/) followed by a positive integer refers to either a child 
element, or to inter-element character data (possibly empty), as per the rules 
defined herein: 

* [XML] content other than element, PCDATA and CDATA is ignored (e.g. comments).

* Children [XML] elements are assigned even indices (i.e. starting at 2, 
followed by 4, etc.). 

* [XML] PCDATA and CDATA content is treated as "text node" in the context of 
this specification. Collections of contiguous text nodes (possibly empty) are 
assigned odd indices (i.e. starting at 1, followed by 3, etc.), and appear 
before the first child element, after the last child element, and in between 
children elements. Character data that corresponds to insignificant whitespace 
(typically used for markup formatting/indenting) is preserved. Entities are 
expanded to their corresponding textual representations.

* If the content of an element (excluding ignored content) starts with a 
non-empty collection of text nodes (i.e. PCDATA or CDATA), then 0 is a valid 
child index that refers to a non-existing element which virtually precedes the 
first chunk of character data. If the content of an element ends with a 
non-empty collection of text nodes, then n+2 is a valid child index that refers 
to a non-existing element which virtually follows the last chunk of character 
data, where n is the even index of the last child element, or 0 if there aren't 
any children elements.
[Informative note] The "virtual first / last elements" mechanism allows certain 
text ranges to be expressed in two different ways: relative to the common 
parent element and with given start-end character offsets, or alternatively in 
terms of the non-existing first / last elements. In practice, this may 
facilitate interoperability with instances of DOM Ranges that make use of 
"in-between node" locations. 

>>>>>

Original comment by daniel.weck on 8 May 2013 at 12:35

GoogleCodeExporter commented 9 years ago

I like daniel.weck's proposal.

Original comment by kball...@apple.com on 8 May 2013 at 3:48

GoogleCodeExporter commented 9 years ago

While avoiding the ignored issue, the revision seems to regress on issue #213 � 
entity references are to PCDATA and thus may expand to element as well as text 
content! And speaking of PCDATA, why does it make an appearance here � if it's 
text that contains parseable elements shouldn't it have already been parsed for 
our purposes?

Overall the new seems much less clear than the original text. There were really 
only two points I saw needing correction (entity references, numbering 
locations vs nodes) and the rest seemed a tight/clear explanation that was 
already robust against e.g. DTD-driven whitespace removal. [Now that I approach 
this I see one more, but issue #363 is separate new can of worms that needn't 
block these fixes.]

Anyway, how does this sound for an iteration on the original? I've tried to 
maintain its original succinctness while making the "edge case" stuff a bit 
more apparent:

<<<<<
A step with a slash (/) followed by an integer refers to a child node or nodes 
in the following manner:

* Each (possibly empty) collection of text content nodes before the first 
element, between elements, and after the last element are given odd indices 
according to their position (these typically refer to the text of the 
Publication). For purposes of this specification, "text content" includes text 
and CDATA nodes (and entity references that expand wholly to same).

* Each element is assigned an even index according to its position: the first 
element is given index 2, the second element index 4, etc. The indexes 0 and 2 
plus the index of the last present element, while not resolvable to element 
nodes for further navigation, represent the location point before the first 
text node collection and after the last respectively. 

* Nodes that are neither elements nor text content are ignored and must have no 
affect on step numbering. Entity references must not be considered as distinct 
nodes, but participate in the preceding rules as they would after expansion to 
their represented text and/or element content.

This indexing method ensures that node identification is not sensitive to XML 
parser handling of whitespace text nodes, CDATA sections and entity references 
(e.g., to avoid the ambiguity that can arise depending on whether a parser 
collapses known-insignificant whitespace; keeps text, CDATA sections and entity 
references as distinct nodes or doesn't; or breaks text in multiple nodes).

Note that a path refers to a location point, yet may end in a step reference. 
Thus the last step in a path without a "terminating step" is not navigable: it 
may represent an empty collection of text content nodes or refer to an element 
position before or after a text content collection, and regardless represents 
the location between nodes and not a node (or collection of nodes) itself. This 
is consistent with other similar representations (e.g. boundary points in the 
DOM Selection and Range definitions) and facilitates .

Original comment by nat...@gmail.com on 8 May 2013 at 6:23

GoogleCodeExporter commented 9 years ago

bah, sent out before finished. Really perhaps the final paragraph should be 
struck for now and defered to issue #363.

Original comment by nat...@gmail.com on 8 May 2013 at 6:25

GoogleCodeExporter commented 9 years ago

Hi natevw,
I disagree that the new prose is less clear, but as this is subjective to a 
certain extent, so allow me to focus on actual issues with your proposed prose:

- "child node or nodes" + "text content nodes" + "CDATA nodes" + "Nodes that 
are neither" + etc. => the term "node" is undefined in XML (note that we are 
not using Xml InfoSet either, nor the XPath data model, nor DOM). Thus why I 
updated the language in my proposal, to use "element" and "character data" as 
per the XML content model (with an explicit restriction to PCDATA and CDATA 
sections, i.e. excluding comments, etc.).

- "includes text and CDATA nodes", "nor text content" => the term "text" refers 
to any markup construct in XML, so we must more accurately refer to "character 
data". This is why we explicitly refer to PCDATA, as well as CDATA sections.

- "are given odd indices according to their position" => should be "is given 
... to its position" (the subject is "each collection"). Even then, the concept 
of "position" is underspecified here. I revised the prose using the term 
"contiguous" to clarify the merge heuristics that result in a coherent 
"collection" unit, subsequently assigned an odd index.

- "not resolvable to element nodes for further navigation" => I am not sure 
what this means.

- "This indexing method ensures that node identification is not sensitive to 
XML parser handling of whitespace text nodes, CDATA sections and entity 
references (e.g., to avoid the ambiguity that can arise depending on whether a 
parser collapses known-insignificant whitespace; keeps text, CDATA sections and 
entity references as distinct nodes or doesn't; or breaks text in multiple 
nodes)."
=> from a specification point of view, this is not a satisfactory normative 
statement, and is even too loose to be a useful informative note. The content 
model that CFI relies upon is precisely defined by XML, and all CFI does is 
specify how contiguous inter-element character data (CDATA and PCDATA) gets 
collected into a single unit ("collection") for indexing purposes (including 
insignificant whitespace), and how entity references are treated when they 
expand to mixed-content, rather than character data.

PS: I think that the sentence "Entities are expanded to their corresponding 
textual representations" in my proposed prose is not technically correct, as we 
should talk about "entity references". Also, this needs to be reformulated so 
that the handling of non-character-data expansion is clarified. Thank you for 
pointing this out.

May I ask you to review my proposal once again now that I have explained the 
rationale for updating the prose like I did? Many thanks!

Regards, Daniel

Original comment by daniel.weck on 8 May 2013 at 7:23

GoogleCodeExporter commented 9 years ago

Okay. I have to admit that I didn't realize ePub is still XHTML based, I've 
been using "it" (really just this CFI spec) in the context of HTML5 and the 
original text matched closely enough with W3C DOM that I just proceeded there.

It was my intent to use "text content nodes" in a generic sense. I still don't 
see why PCDATA is included, unless I am misunderstanding what it is. Isn't 
PCDATA a part of the XML grammar that is used basically when one wishes to 
specify that there's simply "more XML data" within a node? When would an 
implementer at the CFI level ever end up with unparsed "parseable character 
data"? Shouldn't the contents of PCDATA participate in the step numbering, i.e. 
any elements within PCDATA affect the text?

Yes, didn't catch the mismatched grammar currently in the spec when revising. 
Overall, my intent was to leave things "underspecified" as you say�my concern 
was simply to fix the bugs while still preserving the section's original tone 
and (non-)precision. However I'd agree that tightening up this part of the spec 
is important � I don't think it was ambiguous in practice, but having this 
fully specified would be great.

This has more to do with issue #363 � basically gently implying that the 0/2N+2 
numbering is only used in the last step of a local_path without trying to deal 
with that whole missing distinction.

I think this is where we are coming at this from different angles. To me it 
seems that CFI is best described and specified one level higher than the raw 
XML stream; the sense I originally took was that it is describing how to label 
the *tree* represented by the underlying serialization. (Thus my consternation 
at PCDATA even being mentioned.)

I really think attempting to standardize CFI without bringing in the "seven 
kinds of Node" vocabulary from the XPath Data Model seems a difficult 
proposition:
http://www.w3.org/TR/xpath-datamodel/#Node

Except for the PCDATA question, your proposal is technically correct as I can 
understand. But talking only in terms of raw serialized XML grammar is really a 
difficult path IMO � a specification in terms of a parsed tree of different 
kinds of nodes is a lot easier to reason about.

hth,
-nvw

Original comment by nat...@gmail.com on 8 May 2013 at 8:22

GoogleCodeExporter commented 9 years ago

As I said on the mailing list, I would really like to see some text here that 
says that, while 0/n+2 numbering is legal, conforming CFI implementations 
SHOULD NOT construct a CFI that uses these unless there is a non-empty 
collection of text nodes at the start of the element (for index 0) or at the 
end of the element (for index n+2).

Original comment by kball...@apple.com on 8 May 2013 at 8:30

GoogleCodeExporter commented 9 years ago

Thanks for your reply natevw.

Regarding:

"I really think attempting to standardize CFI without bringing in the "seven 
kinds of Node" vocabulary from the XPath Data Model seems a difficult 
proposition:"

=> There are several content model definitions potentially suitable for CFI, 
such as Xml InfoSet, DOM, XPath data model, canonical XML, etc. The XML 
specification itself may indeed be the lowest common denominator (so to speak), 
but it affords all the constructs we need in order to disambiguate the CFI 
expression model (e.g. element + character data). In practice, implementors are 
likely to rely on the DOM API (i.e. via Javascript from a web-browser engine) 
so all is good as long as they can non-ambiguously map user selections 
(locations or ranges) to underlying CFI expressions, and vice-versa to some 
extent (lossless roundtrip may not be possible). I think that my attempt to 
clarify the normative prose goes in the right direction, but of course the 
proposal is very much open for discussion.

Regarding PCDATA - this term just means non-element "text" interspersed amongst 
children elements (intermingled sibling content), i.e. "mixed content":

http://www.w3.org/TR/xml/#sec-mixed-content

For the sake of clarity, perhaps we should refrain from using the term #PCDATA, 
and instead refer to the more formal definition of XML character data, and with 
an additional note: "... including the content of CDATA sections, expanded 
character references (i.e. included replacement text), and character data from 
within expanded entity references".

See:

http://www.w3.org/TR/xml/#syntax

http://www.w3.org/TR/xml/#sec-cdata-sect

Original comment by daniel.weck on 8 May 2013 at 9:47

GoogleCodeExporter commented 9 years ago

kballard,
why not use a MUST NOT conformance requirement to define the expected behaviour 
of CFI processors (e.g. reading systems / production tools) with regards to the 
handling (i.e. producing / consuming) of step expressions containing 0/n+2 
indices?
can you explain the practical benefits of SHOULD NOT? (relaxed validation?)

Original comment by daniel.weck on 8 May 2013 at 10:00

GoogleCodeExporter commented 9 years ago

Consuming step expressions containing 0/n+2 indices is obviously legal. This 
must be true even if there is no non-empty collection of text nodes at indices 
1 or n+1, because the CFI could have been created from a different version of 
the epub.

Since consuming the step expression must be legal, I don't think it's 
appropriate to make production of such an expression illegal.

That said, my objection against declaring it as MUST NOT is not strong. I would 
rather have it say MUST NOT than not say it at all.

Original comment by kball...@apple.com on 8 May 2013 at 11:22

GoogleCodeExporter commented 9 years ago

Further corrections based on feedback received so far:
- fixed prose about character/entity references, to use proper XML terminology 
(expansion / inclusion of replacement text)
- removed term "PCDATA" in favour of "character data" (which is more formally 
defined in the XML specification)
- re-wording of how character data is logically organised and indexed, in order 
to avoid the terms "collection of text nodes" which cannot easily be mapped to 
the XML data model

Please review the updated proposal:

<<<<< 

A step with a slash (/) followed by a positive integer refers to either a child 
element or a chunk of character data, as per the rules defined herein: 

* [XML] content other than element and character data is ignored. Note that as 
per the XML specification, character data inside CDATA sections is included, 
and conversely, XML comments are ignored.

* [XML] character data that corresponds to insignificant whitespace (typically 
used for markup formatting/indenting) is preserved. Character and entity 
references are considered expanded, and character data is obtained from the 
"included replacement text" (as per the XML terminology).

* [XML] character data that is interspersed amongst sibling children elements 
(i.e. "mixed content" context) is logically organised into (potentially-empty) 
chunks of contiguous character data: the first chunk is located before the 
first child element (left sibling), the last chunk is located after the last 
child element (right sibling), and there is one chunk between each pair of 
children elements. When there are no children elements, there is one 
(potentially-empty) chunk of character data. Consecutive (potentially-empty) 
chunks of character data are assigned odd indices (i.e. starting at 1, followed 
by 3, etc.).

* Children [XML] elements are assigned even indices (i.e. starting at 2, 
followed by 4, etc.). Additionally, if the content of an element (excluding 
ignored content) starts with a non-empty chunk of character data, then 0 is a 
valid index that refers to a non-existing element which virtually precedes the 
first chunk of character data (left sibling). If the content of an element ends 
with a non-empty chunk of character data, then n+2 is a valid index that refers 
to a non-existing element which virtually follows the last chunk of character 
data (right sibling), where n is the even index of the last child element, or 0 
if there are no children elements.
[Informative note] The "virtual first / last elements" mechanism may facilitate 
interoperability with certain instances of DOM Ranges, where non-existing 
elements are used to span across textual content without resorting to character 
offsets at the start/end boundaries.

>>>>>

Original comment by daniel.weck on 8 May 2013 at 11:56

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

kballard,

the current prose is:

"if [condition] then 0/n+2 is a valid index"

=> corollary: 0/n+2 are invalid indices when the condition is not met 
(condition = non-emptiness of character data chunk).

Does that work for you?

Original comment by daniel.weck on 9 May 2013 at 12:05

GoogleCodeExporter commented 9 years ago

daniel,weck,

I think that's fine.

Original comment by kball...@apple.com on 9 May 2013 at 12:05

GoogleCodeExporter commented 9 years ago

Looks generally correct to me, although still doesn't make entirely clear that 
entities may expand to elements and not just character data (issue #213).

Original comment by nat...@gmail.com on 9 May 2013 at 12:13

GoogleCodeExporter commented 9 years ago

natevw,
the XML definition of "replacement text" covers markup, not just character 
data. So this seems correct to me:
<< character data is obtained from the "included replacement text" >>

Original comment by daniel.weck on 9 May 2013 at 12:30

GoogleCodeExporter commented 9 years ago

Cool, thanks.

Original comment by nat...@gmail.com on 9 May 2013 at 12:33

GoogleCodeExporter commented 9 years ago

Fresh 72h comment window:

https://groups.google.com/forum/#!topic/epub-working-group/HC_hS7ae6mo

Original comment by daniel.weck on 14 May 2013 at 4:55

GoogleCodeExporter commented 9 years ago

Specification has been updated:

https://code.google.com/p/epub-revision/source/detail?r=4650

Please review and close this issue if correct.

Original comment by mgarrish on 24 May 2013 at 3:05

Changed state: FinalReview

GoogleCodeExporter commented 9 years ago

Specification has been updated:

https://code.google.com/p/epub-revision/source/detail?r=4650

Please review and close this issue if correct.

Original comment by mgarrish on 24 May 2013 at 3:05

GoogleCodeExporter commented 9 years ago

Reverting to proposed solution, as I jumped the gun on this update...

Original comment by mgarrish on 24 May 2013 at 3:42

Changed state: ProposedSolution

GoogleCodeExporter commented 9 years ago

Original comment by daniel.weck on 29 May 2013 at 4:44

Changed state: Verified

w3c / epub-specs

Clarification of CFI data model (DOM? XML InfoSet?) #351