Closed GoogleCodeExporter closed 9 years ago
72h proposal:
https://groups.google.com/forum/#!topic/epub-working-group/pgBJLfVcozY
Joint proposal with issue 218.
https://code.google.com/p/epub-revision/issues/detail?id=218
Original comment by daniel.weck
on 7 May 2013 at 10:37
Looks like this is the more appropriate ticket for half of my comment on the
joint issue 218:
-1 on introducing the text "ignorable items may still have an impact on text
nodes indices". Is there discussion on why this was necessary? This seems like
a major change, and the opposite of what I would expect! So a comment
introduces a gap in the numbering — i.e. text section at index E-1 and E+1
but no node at E?! (Full disclosure: our implementation is currently
incompatible with this, as it [in my opinion] "truly" ignores ignored nodes.)
Original comment by nat...@gmail.com
on 7 May 2013 at 11:05
Or in terms of the discussion here:
"Contiguous text nodes (e.g. when a run of text is separated / divided by a
comment) are preserved as individual nodes, there is no implicit merging of
text nodes."
I really think completely ignoring comments (and the like) is a much better
approach, and pretty much assumed that was the case without clarification. If
this point needs to be clarified, I'd much rather see it clarified that there
IS implicit merging of text nodes.
Allowing ignored elements to affect the CFI like this makes the numbering
unnecessarily fragile! Ignored nodes should NEVER effect step reference
numbering.
A content author (or authoring tool) should be able to add an ignored element
e.g. `<!-- FIXME: remove this stray word in the next edition -->` without
breaking any of the current edition's existing CFI references.
Original comment by nat...@gmail.com
on 7 May 2013 at 11:15
Sorry if it seems I'm belabouring this point, but here's another example of why
I think this is important. Beyond "text" and "element", my CFI implementation
simply handles two other types: "transparent" (e.g. a non-canonical wrapper
span added to style a highlight) and "ignored" (non-canonical nodes added for
other display purposes, e.g. internal anchors, marginalia)
Making ignored elements within the "canonical" content significant, would mean
my implementation would need to handle two separate types of ignored elements:
"runtime_ignored" and "canonical_ignored" — the former wouldn't affect the
numbering while the latter would. And for what purpose? Would much rather let
content authors add/remove HTML comments with as much impunity as they can
modify other elements recognized in my runtime as ignored.
Original comment by nat...@gmail.com
on 7 May 2013 at 11:34
Hi natevw,
I am not proposing a new numbering mechanism, in fact I always assumed that
interspersed XML comments effectively "split" contiguous text nodes into
separate items. That is at least my experience with XML parsers and DOM APIs
(nodeType==8). Are you suggesting that such physically "split" text nodes
should be merged prior to applying CFI processing rules? If yes, then we should
precisely define the merge heuristics in the specification, don't you think?
Dan
Original comment by daniel.weck
on 7 May 2013 at 11:38
Take a DOM, remove all nodes that are ignored.
Yes, you would get some split text nodes — but the spec already defines how
to handle split text nodes in the general case.
Numbering differently based on whether ignored elements are present or not is
tautologically incorrect, right?
Unless there's a good reason to make a major change like this (if so, I would
recommend finding a new term to describe this class of nodes) then a CFI apply
identically to a DOM with its ignored nodes removed as it would to the
original. That's the only reasonable way I see to understand "ignored" in the
current spec.
Original comment by nat...@gmail.com
on 7 May 2013 at 11:56
Ah, of course, I was getting confused with the allocation of a *single* odd
index to a (possibly empty) collection of several contiguous text nodes (once
the ignorable items are trimmed). I misread both the question and the answer!
Thanks for putting me in the right direction again. Dan
Original comment by daniel.weck
on 8 May 2013 at 12:05
natevw, could you please review the updated proposal? Thanks!
<<<<<
A step with a slash (/) followed by a positive integer refers to either a child
element, or to inter-element character data (possibly empty), as per the rules
defined herein:
* [XML] content other than element, PCDATA and CDATA is ignored (e.g. comments).
* Children [XML] elements are assigned even indices (i.e. starting at 2,
followed by 4, etc.).
* [XML] PCDATA and CDATA content is treated as "text node" in the context of
this specification. Collections of contiguous text nodes (possibly empty) are
assigned odd indices (i.e. starting at 1, followed by 3, etc.), and appear
before the first child element, after the last child element, and in between
children elements. Character data that corresponds to insignificant whitespace
(typically used for markup formatting/indenting) is preserved. Entities are
expanded to their corresponding textual representations.
* If the content of an element (excluding ignored content) starts with a
non-empty collection of text nodes (i.e. PCDATA or CDATA), then 0 is a valid
child index that refers to a non-existing element which virtually precedes the
first chunk of character data. If the content of an element ends with a
non-empty collection of text nodes, then n+2 is a valid child index that refers
to a non-existing element which virtually follows the last chunk of character
data, where n is the even index of the last child element, or 0 if there aren't
any children elements.
[Informative note] The "virtual first / last elements" mechanism allows certain
text ranges to be expressed in two different ways: relative to the common
parent element and with given start-end character offsets, or alternatively in
terms of the non-existing first / last elements. In practice, this may
facilitate interoperability with instances of DOM Ranges that make use of
"in-between node" locations.
>>>>>
Original comment by daniel.weck
on 8 May 2013 at 12:35
I like daniel.weck's proposal.
Original comment by kball...@apple.com
on 8 May 2013 at 3:48
While avoiding the ignored issue, the revision seems to regress on issue #213 �
entity references are to PCDATA and thus may expand to element as well as text
content! And speaking of PCDATA, why does it make an appearance here � if it's
text that contains parseable elements shouldn't it have already been parsed for
our purposes?
Overall the new seems much less clear than the original text. There were really
only two points I saw needing correction (entity references, numbering
locations vs nodes) and the rest seemed a tight/clear explanation that was
already robust against e.g. DTD-driven whitespace removal. [Now that I approach
this I see one more, but issue #363 is separate new can of worms that needn't
block these fixes.]
Anyway, how does this sound for an iteration on the original? I've tried to
maintain its original succinctness while making the "edge case" stuff a bit
more apparent:
<<<<<
A step with a slash (/) followed by an integer refers to a child node or nodes
in the following manner:
* Each (possibly empty) collection of text content nodes before the first
element, between elements, and after the last element are given odd indices
according to their position (these typically refer to the text of the
Publication). For purposes of this specification, "text content" includes text
and CDATA nodes (and entity references that expand wholly to same).
* Each element is assigned an even index according to its position: the first
element is given index 2, the second element index 4, etc. The indexes 0 and 2
plus the index of the last present element, while not resolvable to element
nodes for further navigation, represent the location point before the first
text node collection and after the last respectively.
* Nodes that are neither elements nor text content are ignored and must have no
affect on step numbering. Entity references must not be considered as distinct
nodes, but participate in the preceding rules as they would after expansion to
their represented text and/or element content.
This indexing method ensures that node identification is not sensitive to XML
parser handling of whitespace text nodes, CDATA sections and entity references
(e.g., to avoid the ambiguity that can arise depending on whether a parser
collapses known-insignificant whitespace; keeps text, CDATA sections and entity
references as distinct nodes or doesn't; or breaks text in multiple nodes).
Note that a path refers to a location point, yet may end in a step reference.
Thus the last step in a path without a "terminating step" is not navigable: it
may represent an empty collection of text content nodes or refer to an element
position before or after a text content collection, and regardless represents
the location between nodes and not a node (or collection of nodes) itself. This
is consistent with other similar representations (e.g. boundary points in the
DOM Selection and Range definitions) and facilitates .
Original comment by nat...@gmail.com
on 8 May 2013 at 6:23
bah, sent out before finished. Really perhaps the final paragraph should be
struck for now and defered to issue #363.
Original comment by nat...@gmail.com
on 8 May 2013 at 6:25
Hi natevw,
I disagree that the new prose is less clear, but as this is subjective to a
certain extent, so allow me to focus on actual issues with your proposed prose:
- "child node or nodes" + "text content nodes" + "CDATA nodes" + "Nodes that
are neither" + etc. => the term "node" is undefined in XML (note that we are
not using Xml InfoSet either, nor the XPath data model, nor DOM). Thus why I
updated the language in my proposal, to use "element" and "character data" as
per the XML content model (with an explicit restriction to PCDATA and CDATA
sections, i.e. excluding comments, etc.).
- "includes text and CDATA nodes", "nor text content" => the term "text" refers
to any markup construct in XML, so we must more accurately refer to "character
data". This is why we explicitly refer to PCDATA, as well as CDATA sections.
- "are given odd indices according to their position" => should be "is given
... to its position" (the subject is "each collection"). Even then, the concept
of "position" is underspecified here. I revised the prose using the term
"contiguous" to clarify the merge heuristics that result in a coherent
"collection" unit, subsequently assigned an odd index.
- "not resolvable to element nodes for further navigation" => I am not sure
what this means.
- "This indexing method ensures that node identification is not sensitive to
XML parser handling of whitespace text nodes, CDATA sections and entity
references (e.g., to avoid the ambiguity that can arise depending on whether a
parser collapses known-insignificant whitespace; keeps text, CDATA sections and
entity references as distinct nodes or doesn't; or breaks text in multiple
nodes)."
=> from a specification point of view, this is not a satisfactory normative
statement, and is even too loose to be a useful informative note. The content
model that CFI relies upon is precisely defined by XML, and all CFI does is
specify how contiguous inter-element character data (CDATA and PCDATA) gets
collected into a single unit ("collection") for indexing purposes (including
insignificant whitespace), and how entity references are treated when they
expand to mixed-content, rather than character data.
PS: I think that the sentence "Entities are expanded to their corresponding
textual representations" in my proposed prose is not technically correct, as we
should talk about "entity references". Also, this needs to be reformulated so
that the handling of non-character-data expansion is clarified. Thank you for
pointing this out.
May I ask you to review my proposal once again now that I have explained the
rationale for updating the prose like I did? Many thanks!
Regards, Daniel
Original comment by daniel.weck
on 8 May 2013 at 7:23
Okay. I have to admit that I didn't realize ePub is still XHTML based, I've
been using "it" (really just this CFI spec) in the context of HTML5 and the
original text matched closely enough with W3C DOM that I just proceeded there.
It was my intent to use "text content nodes" in a generic sense. I still don't
see why PCDATA is included, unless I am misunderstanding what it is. Isn't
PCDATA a part of the XML grammar that is used basically when one wishes to
specify that there's simply "more XML data" within a node? When would an
implementer at the CFI level ever end up with unparsed "parseable character
data"? Shouldn't the contents of PCDATA participate in the step numbering, i.e.
any elements within PCDATA affect the text?
Yes, didn't catch the mismatched grammar currently in the spec when revising.
Overall, my intent was to leave things "underspecified" as you say�my concern
was simply to fix the bugs while still preserving the section's original tone
and (non-)precision. However I'd agree that tightening up this part of the spec
is important � I don't think it was ambiguous in practice, but having this
fully specified would be great.
This has more to do with issue #363 � basically gently implying that the 0/2N+2
numbering is only used in the last step of a local_path without trying to deal
with that whole missing distinction.
I think this is where we are coming at this from different angles. To me it
seems that CFI is best described and specified one level higher than the raw
XML stream; the sense I originally took was that it is describing how to label
the *tree* represented by the underlying serialization. (Thus my consternation
at PCDATA even being mentioned.)
I really think attempting to standardize CFI without bringing in the "seven
kinds of Node" vocabulary from the XPath Data Model seems a difficult
proposition:
http://www.w3.org/TR/xpath-datamodel/#Node
Except for the PCDATA question, your proposal is technically correct as I can
understand. But talking only in terms of raw serialized XML grammar is really a
difficult path IMO � a specification in terms of a parsed tree of different
kinds of nodes is a lot easier to reason about.
hth,
-nvw
Original comment by nat...@gmail.com
on 8 May 2013 at 8:22
As I said on the mailing list, I would really like to see some text here that
says that, while 0/n+2 numbering is legal, conforming CFI implementations
SHOULD NOT construct a CFI that uses these unless there is a non-empty
collection of text nodes at the start of the element (for index 0) or at the
end of the element (for index n+2).
Original comment by kball...@apple.com
on 8 May 2013 at 8:30
Thanks for your reply natevw.
Regarding:
"I really think attempting to standardize CFI without bringing in the "seven
kinds of Node" vocabulary from the XPath Data Model seems a difficult
proposition:"
=> There are several content model definitions potentially suitable for CFI,
such as Xml InfoSet, DOM, XPath data model, canonical XML, etc. The XML
specification itself may indeed be the lowest common denominator (so to speak),
but it affords all the constructs we need in order to disambiguate the CFI
expression model (e.g. element + character data). In practice, implementors are
likely to rely on the DOM API (i.e. via Javascript from a web-browser engine)
so all is good as long as they can non-ambiguously map user selections
(locations or ranges) to underlying CFI expressions, and vice-versa to some
extent (lossless roundtrip may not be possible). I think that my attempt to
clarify the normative prose goes in the right direction, but of course the
proposal is very much open for discussion.
Regarding PCDATA - this term just means non-element "text" interspersed amongst
children elements (intermingled sibling content), i.e. "mixed content":
http://www.w3.org/TR/xml/#sec-mixed-content
For the sake of clarity, perhaps we should refrain from using the term #PCDATA,
and instead refer to the more formal definition of XML character data, and with
an additional note: "... including the content of CDATA sections, expanded
character references (i.e. included replacement text), and character data from
within expanded entity references".
See:
http://www.w3.org/TR/xml/#syntax
http://www.w3.org/TR/xml/#sec-cdata-sect
Original comment by daniel.weck
on 8 May 2013 at 9:47
kballard,
why not use a MUST NOT conformance requirement to define the expected behaviour
of CFI processors (e.g. reading systems / production tools) with regards to the
handling (i.e. producing / consuming) of step expressions containing 0/n+2
indices?
can you explain the practical benefits of SHOULD NOT? (relaxed validation?)
Original comment by daniel.weck
on 8 May 2013 at 10:00
Consuming step expressions containing 0/n+2 indices is obviously legal. This
must be true even if there is no non-empty collection of text nodes at indices
1 or n+1, because the CFI could have been created from a different version of
the epub.
Since consuming the step expression must be legal, I don't think it's
appropriate to make production of such an expression illegal.
That said, my objection against declaring it as MUST NOT is not strong. I would
rather have it say MUST NOT than not say it at all.
Original comment by kball...@apple.com
on 8 May 2013 at 11:22
Further corrections based on feedback received so far:
- fixed prose about character/entity references, to use proper XML terminology
(expansion / inclusion of replacement text)
- removed term "PCDATA" in favour of "character data" (which is more formally
defined in the XML specification)
- re-wording of how character data is logically organised and indexed, in order
to avoid the terms "collection of text nodes" which cannot easily be mapped to
the XML data model
Please review the updated proposal:
<<<<<
A step with a slash (/) followed by a positive integer refers to either a child
element or a chunk of character data, as per the rules defined herein:
* [XML] content other than element and character data is ignored. Note that as
per the XML specification, character data inside CDATA sections is included,
and conversely, XML comments are ignored.
* [XML] character data that corresponds to insignificant whitespace (typically
used for markup formatting/indenting) is preserved. Character and entity
references are considered expanded, and character data is obtained from the
"included replacement text" (as per the XML terminology).
* [XML] character data that is interspersed amongst sibling children elements
(i.e. "mixed content" context) is logically organised into (potentially-empty)
chunks of contiguous character data: the first chunk is located before the
first child element (left sibling), the last chunk is located after the last
child element (right sibling), and there is one chunk between each pair of
children elements. When there are no children elements, there is one
(potentially-empty) chunk of character data. Consecutive (potentially-empty)
chunks of character data are assigned odd indices (i.e. starting at 1, followed
by 3, etc.).
* Children [XML] elements are assigned even indices (i.e. starting at 2,
followed by 4, etc.). Additionally, if the content of an element (excluding
ignored content) starts with a non-empty chunk of character data, then 0 is a
valid index that refers to a non-existing element which virtually precedes the
first chunk of character data (left sibling). If the content of an element ends
with a non-empty chunk of character data, then n+2 is a valid index that refers
to a non-existing element which virtually follows the last chunk of character
data (right sibling), where n is the even index of the last child element, or 0
if there are no children elements.
[Informative note] The "virtual first / last elements" mechanism may facilitate
interoperability with certain instances of DOM Ranges, where non-existing
elements are used to span across textual content without resorting to character
offsets at the start/end boundaries.
>>>>>
Original comment by daniel.weck
on 8 May 2013 at 11:56
[deleted comment]
kballard,
the current prose is:
"if [condition] then 0/n+2 is a valid index"
=> corollary: 0/n+2 are invalid indices when the condition is not met
(condition = non-emptiness of character data chunk).
Does that work for you?
Original comment by daniel.weck
on 9 May 2013 at 12:05
daniel,weck,
I think that's fine.
Original comment by kball...@apple.com
on 9 May 2013 at 12:05
Looks generally correct to me, although still doesn't make entirely clear that
entities may expand to elements and not just character data (issue #213).
Original comment by nat...@gmail.com
on 9 May 2013 at 12:13
natevw,
the XML definition of "replacement text" covers markup, not just character
data. So this seems correct to me:
<< character data is obtained from the "included replacement text" >>
Original comment by daniel.weck
on 9 May 2013 at 12:30
Cool, thanks.
Original comment by nat...@gmail.com
on 9 May 2013 at 12:33
Fresh 72h comment window:
https://groups.google.com/forum/#!topic/epub-working-group/HC_hS7ae6mo
Original comment by daniel.weck
on 14 May 2013 at 4:55
Specification has been updated:
https://code.google.com/p/epub-revision/source/detail?r=4650
Please review and close this issue if correct.
Original comment by mgarrish
on 24 May 2013 at 3:05
Specification has been updated:
https://code.google.com/p/epub-revision/source/detail?r=4650
Please review and close this issue if correct.
Original comment by mgarrish
on 24 May 2013 at 3:05
Reverting to proposed solution, as I jumped the gun on this update...
Original comment by mgarrish
on 24 May 2013 at 3:42
Original comment by daniel.weck
on 29 May 2013 at 4:44
Original issue reported on code.google.com by
daniel.weck
on 17 Apr 2013 at 7:52