w3c / web-annotation

Web Annotation Working Group repository, see README for links to specs
https://w3c.github.io/web-annotation/
Other
142 stars 30 forks source link

Reference to text encoding in spec perhaps not appropriate #227

Closed nickstenning closed 8 years ago

nickstenning commented 8 years ago

222 made me aware of the following text in the model spec:

4.2.4 Text Quote Selector https://www.w3.org/TR/2016/WD-annotation-model-20160331/#text-quote-selector

The text must be normalized before recording. Thus HTML/XML tags should be removed, character entities should be replaced with the character that they encode, unnecessary whitespace should be normalized, character encoding should be turned into UTF-8, and so forth. The normalization routine may be performed automatically by a browser, and other applications should implement the DOM String Comparisons method. This allows the Selector to be used with different encodings and user agents and still have the same semantics and utility.

If all selector references are to be w.r.t. codepoint sequences (c.f. #206) then I'm not sure the spec should be referring to text encoding. (Because we're assuming that you're annotating unicode text, not some byte sequence.)

nickstenning commented 8 years ago

Argh. Actually this isn't true. This is referring to the text content of TextQuoteSelectors, which does need to be converted to UTF-8 text because it's getting dumped into a JSON document and JSON mandates UTF-8 encoding.

r12a commented 8 years ago

Except that 4.2.5 Text Position Selector says

The text MUST be normalized before counting the characters, in the same way as for Text Quote Selector.

So utf-8 conversion is required for the text position selector too, as the spec is currently written. I'm assuming that some developers will want to use UTF-16 for codepoint-based counting, so perhaps the text should say, 'the text should be converted to a Unicode character encoding', instead?

tilgovi commented 8 years ago

I think these normalization rules need some change.

Code points don't refer to a particular byte encoding. A single code point may be 1 or more bytes, depending on the encoding. I hope I'm using these terms correctly.

Counting code points, it shouldn't matter if the quote is UTF-8 and the text is UTF-16.

Why do we specify normalization of white space, though? Why strip the selector of precision? Maybe my annotation is intended to target some whitespace.

We should maybe be clearer about Unicode normalization. I'd say let's explicitly recommend against transforming the code points, removing combining codes and such, but I'm a little worried tools might do this anyway and we might best be served by suggesting that normalization is always done. That's another performance impacting demand to make, though.

r12a commented 8 years ago

I'd like to widen the ambit of this issue slightly. (Actually i just noticed that @tilgovi is doing so also.)

For white space, see also https://github.com/w3c/web-annotation/issues/221

The other thing i'm concerned about is the phrase "and so forth". The operations described as forming part of text normalisation here are aimed at achieving more interoperable comparisons of the selector text across different encodings and user agents. An open ended 'so forth' seems to leave the door open for the implementations to apply all sorts of independent, and therefore non-interoperable changes to the text.

Also, character normalization is a part of DOM String Comparisons, but doesn't appear to be required for browsers, as i read the text of the spec. The reference to DOM string comparisons points to text that to my mind (correct me if i'm wrong) is more oriented to checking the well-formedness of the text according to XML criteria, rather than achieving an interoperable normalization form. For example, full normalisation requires that content should not start with a composing character (which i think we no longer agree on in some cases), but doesn't tell us what to do if that's not the case when reading text from the target. Perhaps the link should be to the definition of Unicode-normalized instead (https://www.w3.org/TR/2004/REC-xml11-20040204/#dt-uninorm)?

r12a commented 8 years ago

I guess i'm suggesting that there should be a finite list of operations to apply to the text, which are enumerated here and described in enough detail to ensure that the resulting strings used for the text quote selector are guarranteed to be interoperable.

azaroth42 commented 8 years ago

I accept the 4.2.5 distinction. The recommendation to use UTF-8 is because it MUST be that for recording in the JSON, but that is not part of the normalization before counting characters... which I think we have agreed to use code points for.

I disagree that normalization should be NFC, as HTML/XML whitespace normalization would not be included under that description, and there's no way for a browser client to undo the normalization that the HTML parser has already done when creating the DOM. (As far as I understand, those with more experience please correct me)

I'm happy to take out "and so forth", but unless there's an existing, accepted set of normalization operations we can refer to, I hope we can put that discussion off until we have closed more pressing issues.

iherman commented 8 years ago

Discussed with the I18N WG, 2016-05-26: (a) change should -> SHOULD (b) defer the finalization for now (for another week)

aphillips commented 8 years ago

Per I18N-ACTION-527, I looked into the text in 4.2.5 and the related text elsewhere in the current ED. Here's the current text:

The text MUST be normalized before recording. Thus HTML/XML tags should be removed, character entities should be replaced with the character that they encode, and unnecessary whitespace should be normalized. The normalization routine may be performed automatically by a browser, and other applications should implement the DOM String Comparisons method. This allows the Selector to be used with different encodings and user agents and still have the same semantics and utility. The selection MUST be based on the logical order of the text, rather than the visual order, especially for bidirectional text. The normalized value MUST be recorded as UTF-8 in the JSON serialization of the Annotation.

I also agree with @r12a's comment above: rather than a random list of potential operations, there should be a clearly defined set of operations. The problem is that the text has a normative sounding must in it, but then follows with some random suggestions and some "should" text. And it includes a reference to DOM String. As an implementer, I would be confused about what exactly is required.

If the concern here is whether to apply a Unicode Normalization Form, the WG's current position, as described in Charmod-Norm is not to apply a Unicode Normalization Form to the text. In Charmod-Norm, pay particular attention to section 3.2. The annotation specs are of the "non-normalizing" type, please note.

The text in DOM Strings referenced in the current text requires NFC and, additionally, requires fully normalized and include normalized checking. In summary, these normalization requirements are meant to prevent selections from starting with a combining mark. While noble in intent, in most implementations it is difficult for the user to select text that begins with a combining mark. I'm not convinced (although I could be) that requiring fully normalized checking at the model level is helpful. If there is a reason to apply this checking, it should be explicitly stated in Annotation Model, not indirectly and obliquely through DOM Strings (where it will be misunderstood). In addition, by applying it to the text normalization step, you miss the important point: the normalizing algorithm probably cannot adjust the boundaries between the exact, prefix, and suffix text. The best it can probably do is mutate the text to have an extra non-combining mark (generally an NBSP) at the start of the given segment of text, which probably does more to break the text than doing nothing at all. In that case, you'd be better off supplying a MUST requirement on the quote or position selector locations--or just noting the potential problem for implementers to try to avoid (but permitting non-fully-normalized quotes or positions).

In the editor's copy, I note that there is an addition of text referring to logical (vs. visual) order and also one mentioning the use of UTF-8 for the JSON serialization. In my opinion, both of these requirements are superfluous and should be removed. The eventual use of UTF-8 is already a requirement of JSON serialization (and text can also be \u escaped in JSON). It presents no actual requirement for implementers of Text Quote Selector. Similarly, it would be better to introduce logical encoding globally in the document, perhaps in the discussion of principles or by reference to Charmod-Fundamentals and Charmod-Norm.

I would suggest using this as a basis for a revision:

The text MUST be normalized before recording by applying the following operations in this order to the source text:

  1. Conversion of the source text to a sequence of Unicode code points, including expansion of character entities and escapes to Unicode.
  2. Remove all markup, such as HTML or XML tags. Question: what to do about dir?
  3. Normalization of whitespace by collapsing all whitespace tokens to a single ASCII space character (U+0020). Note that the text MAY begin or end with a space character. i.e. no trim is implied
  4. Adjust boundaries between exact, prefix, and suffix such that none of the three begin with a combining mark and, if possible, to coincide with grapheme boundaries.
  5. Extract the exact and, if present, the prefix and suffix text.
iherman commented 8 years ago

@aphillips, first of all, thank you. Personally (not being an expert) I am fine with your suggested steps to replace what is in the current ED. Just some comments:

  1. in entry no. 2 above, the final text should add a qualifier of 'if applicable'. The selector can be used for pure text or even PDF files, where that entry is irrelevant.
  2. isn't it correct that the dir attribute is really relevant on the rendering of the text? If so, I do not see any issue simply remove it; after all, the end of the algorithm produces texts that are used for textual comparisons...
  3. I do not have any experience of the implementation of these steps, ie, whether it is difficult or easy in various environments. This is, after all, and issue for the CR phase. I think that it is worth, when going to CR, to
    1. call out to this explicitly and ask implementers to provide us the necessary feedback
    2. maybe (but only maybe) identify whether some of these steps may be identified as "at risk" in advance (or the whole thing?)

The procedural importance of 3.ii above is that if implementers run into major issues and we will be forced to revoke some of these steps, we can do that at the end of the CR period without being forced to re-issue a CR again.

aphillips commented 8 years ago

@iherman thanks.

Regarding the dir attribute, one of the uses of markup is to provide help to the Unicode Bidirectional Algorithm (UBA) in laying out text for presentation. When the markup is removed, reducing the content to plain text, dir attributes can be replaced with the corresponding Unicode bidirectional control characters, preserving proper presentation. Several of our articles discuss this here.

Regarding difficulty of implementation, most of my suggested text is straightforward to implement, but the boundary adjustment idea is a little hand-wavy. As I mentioned before, if a human is performing the text selection, it's difficult to select text that doesn't fall on a grapheme boundary. But programmatic access has to be taken into account as well. It would be easier on developers to say nothing and permit the boundary to fall on any character boundary, since generally the boundary won't "fall anywhere". But from a Unicode point of view, it would be better to specify grapheme boundaries or at least base-character starts.

duerst commented 8 years ago

Regarding

  1. Normalization of whitespace by collapsing all whitespace tokens to a single ASCII space character (U+0020). Note that the text MAY begin or end with a space character. i.e. no trim is implied that has the problem that it leaves spaces in East Asian texts where they may not be desired.
iherman commented 8 years ago

Regarding

Normalization of whitespace by collapsing all whitespace tokens to a single ASCII space character (U+0020). Note that the text MAY begin or end with a space character. i.e. no trim is implied that has the problem that it leaves spaces in East Asian texts where they may not be desired. I am not sure I understand. The only goal of this section is to provide a canonical version of the text for unequivocal comparison. What does "may not be desired" mean in this respect?

fsasaki commented 8 years ago

Hi all, just to emphasize one point that Ivan made: selectors are not only for HTML/XML markup. Hence, in the algorithm proposed at https://github.com/w3c/web-annotation/issues/227#issuecomment-222330988 the step "Remove all markup, such as HTML or XML tags." is not applicable for other content formats on the Web. PDF is just one example format. On the step "Normalization of whitespace by collapsing all whitespace tokens to a single ASCII space character (U+0020). " For certain markup vocabularies (and for non markup content types as well), certain types of elements want to preserve white space. E.g. for the HTML pre element you would not want to remove white space. Emphasizing again: web annotation is for any type of web content. E.g. if I am putting DocBook content on the web and want to annotate programlisting elements, their whitespace should be preserved.

IMO for above reasons, the qualifier 'if applicable' is very important. I assume that many implementers will leave white space handling to the underlying library that handles low level content parsing. For example, during the ITS 2.0 development, I developed an implementation that parsed HTML content using validator.nu . The white space handling was left to that library. I assume the same for others.

iherman commented 8 years ago

@aphillips you say:

Regarding the dir attribute, one of the uses of markup is to provide help to the Unicode Bidirectional Algorithm (UBA) in laying out text for presentation. When the markup is removed, reducing the content to plain text, dir attributes can be replaced with the corresponding Unicode bidirectional control characters, preserving proper presentation. Several of our articles discuss this here.

But, again, the normalization is not done for display, but only for a canonical form for comparison. Doesn't that mean that all this can be ignored in this case?

aphillips commented 8 years ago

I don't believe, @iherman, that this is what the snippeting is used for. Or at least that there exist a number of use cases where the text is later presented to a human and not merely used by a machine for comparison.

When thinking about this problem, I have to admit that I was thinking about use cases from my day job, where I have been involved in actually implementing annotations and snippeting (but which I can't talk about here). One snippeting process is capturing user highlights outside a document ("scrapbooking") which, along with the case shown in the document, is where you don't just want the positions, but also a copy of the text.

@fsasaki: I don't necessarily agree. Removing markup from the quoted text, regardless of the source format, is desirable, since you don't want markup in the plain text. I'm very mindful that not just *ML is a target here. While PDF content doesn't contain markup generally, other non-HTML/XML content types that might appear in a Web context would also want their markup removed when quoting. For example, you wouldn't want WebVTT, CSV, or RTF markup in the snippets either. I think the goal is to only present the user-facing text.

I also recognize that whitespace normalization would destroy "layout" such as represented by pre. I think this is expected. If one wants document fidelity, use text positions and extract the layout, not just the plain text. The problem is that, once you get into doing some whitespace normalization, you can't just leave it up to the implementation to decide. Some will send spaces while some won't. Later comparison, such as @iherman suggests, is more difficult if the normalization isn't, er, normalized.

@duerst: the algorithm doesn't introduce any spaces that weren't already there in the text. If the original text contains spaces, those spaces will remain (with collapsing) in the final text. The one exception is line/paragraph breaks: these would become a space with the proposed text as written. It's a valid question whether line breaks should be converted to space.

That said, not performing trim helps East Asian texts because it prevents implementers from introducing spaces algorithmically when reassembling text later.

I must admit that I'm curious why Text Quote Selectors exist without reference to position. If they were a special case of Text Position Selector, wouldn't that work more reliably? After all, some texts are highly repetitive. Without a position number, the quote selector might match many places in the source document.

fsasaki commented 8 years ago

I also recognize that whitespace normalization would destroy "layout" such as represented by pre. I think this is expected. If one wants document fidelity, use text positions and extract the layout, not just the plain text.

The tools I am referring to are using text to process content as part of a text analytics pipeline. Text analytics tools as of today only understand plain text. So at some point in the text analytics pipeline you have to get rid of the markup - and have to decide about what to do with the white space. That decision will always be format specific, see the DocBook programlisting example. So my point is: you can describe the steps needed only to some extend. At some point implementations have to look into the specifics of formats to process. That is why step two at https://github.com/w3c/web-annotation/issues/227#issuecomment-222330988 is hard to formulate as a MUST requirement. I am coming from an XLIFF extracting and merging point of view - which is the same as text analytics with putting the outcome into the orginal content again (= roundtripping). Specs like XLIFF wisely do not speciffy the details of such processes, but say "be careful about them" - and then there are - to some extend, not enough - format specific guidelines how to do this. I am not asking for such guidelines here, just trying to explain how big this pandora box is.

iherman commented 8 years ago

@aphillips

I don't believe, @iherman, that this is what the snippeting is used for. Or at least that there exist a number of use cases where the text is later presented to a human and not merely used by a machine for comparison.

I do not see/understand this use case. Text quote selector is defined to select a segment of a resource, and the normalization is there to make the selection process unequivocal. If the original selected segment is to be presented to the user, too, that is a disjoint process which does not necessary involves the what is used by the selector for the comparison (ie, what is stored internally by the user agent when a Selector is defined).

@fsasaki: I don't necessarily agree. Removing markup from the quoted text, regardless of the source format, is desirable, since you don't want markup in the plain text.

Why not? If the media type is plain text, then I do not believe the annotation tools should try to interpret the content. Plain text means that it should be considered as a bunch of characters without any semantics. In other words, I actually believe that any type of markup should not be removed if the format is plain text. The markup is the user facing text in that case. Ie, I believe my original comment, that the "if applicable" is very much in order; markup should be removed only if it is an XML/HTML media type.

I must admit that I'm curious why Text Quote Selectors exist without reference to position. If they were a special case of Text Position Selector, wouldn't that work more reliably? After all, some texts are highly repetitive. Without a position number, the quote selector might match many places in the source document.

Selectors may be combined and also used to refine another. One can, for example, use a range selector making use of text position selectors to define a range of the text, and then refine the results using a text quote selectors.

azaroth42 commented 8 years ago

Or at least that there exist a number of use cases where the text is later presented to a human and not merely used by a machine for comparison.

I agree that there are use cases where snipped text is presented to a human, however this particular issue is about describing the text sufficiently accurately in the annotation's representation such that a second, consuming user agent can discover the correct segment in the full textual content. As such, rendering concerns are out of scope in this situation.

Further, the Specific Resource identifies the selected content, not the Selector, which describes how to discover the selected content. The SpecificResource could have a separate property (e.g. value) that contains the content to be rendered for the user. Note the distinction with URI Fragments, which both identify (it's a URI) and describe (by means of the fragment) the content. Specific Resources intentionally pull apart these two functions.

Regarding markup, it seems like a very slippery slope. If the annotated content is markdown, such as these github comments, then some *s are just characters and some are bullets

We also previously agreed not to normalize whitespace in #221 after discussion with #i18n. There seems to be now a request to put that back. I don't want to flipflop unnecessarily, so can we identify the situations in which whitespace normalization is (a) helpful and (b) unhelpful? The "if applicable" seems to be getting back into the fuzzy rules realm that we were trying to escape from.

aphillips commented 8 years ago

@azaroth42 Actually, the normalization of whitespace I treated as your requirement. If there is no requirement to normalize whitespace on your side, then I concur that normalizing whitespace can be removed. The note about no trim being implied, however, probably needs to remain.

I'm a little concerned about how a TextQuoteSelector that contains only the exact, prefix, and suffix text can:

...describing the text sufficiently accurately in the annotation's representation such that a second, consuming user agent can discover the correct segment in the full textual content

If all you have is the sentence, then you can't tell which repetition is expected. By way of example, I give you:

Do not go gentle into that good night, Old age should burn and rave at close of day; Rage, rage against the dying of the light.

Though wise men at their end know dark is right, Because their words had forked no lightning they Do not go gentle into that good night.

Good men, the last wave by, crying how bright Their frail deeds might have danced in a green bay, Rage, rage against the dying of the light.

azaroth42 commented 8 years ago

Totally agreed, which is why prefix and suffix are important, not just the exact matching text :)

Agree also that no-trim is unintuitive enough to warrant a note.

aphillips commented 8 years ago

@azaroth42 Well, but if you want to say (using the use case in the document) that "against" is misspelled (an incorrect assertion in this case), then you might have:

exact: against prefix: Rage, rage suffix: the dying of the light

And now you can't tell which "against" is meant. Computing a long enough prefix/suffix to ensure uniqueness seems difficult for the implementation. I suppose that this calls for using it in composition as suggested by @iherman

fsasaki commented 8 years ago

I suppose that this calls for using it in composition as suggested by @iherman

See also https://www.w3.org/TR/annotation-model/#refinement-of-selection

r12a commented 8 years ago

the impression i got from talking with folks in meetings is as follows: Boundaries for user selection are handled by the implementation with which the selection is made, and are not described in this spec. We would hope that they would do something sensible, ie. not allow selections that split grapheme clusters. Any text normalization specified by the model spec, as i understood it, is intended to make it easier to match the text against alternative forms of the same document.

I'm still not entirely sure what those alternative forms would be exactly, since the target is closely defined as a specific resource.

I also find myself thinking more about why we want to normalize away the markup. I'm assuming it's because we expect the text quote selector text to match text extracted from the DOM using such things as document.body.textContent. If that's the case, i wonder whether it's appropriate to express this as a separate normalization step in the paragraph we have been talking about, or whether to just assume that it falls out of the recommendation in the following paragraph anyway (about generating the Text Position Selector values from DOM Level 3 APIs).

So i guess i'm asking the person who added the phrase about normalizing away the markup to the spec why they did so, so that we can better assess the appropriateness.

iherman commented 8 years ago

On 1 Jun 2016, at 14:11, r12a notifications@github.com wrote:

the impression i got from talking with folks in meetings is as follows: Boundaries for user selection are handled by the implementation with which the selection is made, and are not described in this spec. We would hope that they would do something sensible, ie. not allow selections that split grapheme clusters. Any text normalization specified by the model spec, as i understood it, is intended to make it easier to match the text against alternative forms of the same document.

I'm still not entirely sure what those alternative forms would be exactly, since the target is closely defined as a specific resource.

Yes, but a selector may return several instances of the selection (which may then be refined by another selector, as referred to elsewhere in the thread). And if there were copy paste-s done when putting together the text, then the representation of the same text may be slightly different within the text… Hence the normalization.

I also find myself thinking more about why we want to normalize away the markup. I'm assuming it's because we expect the text quote selector text to match text extracted from the DOM using such things as document.body.textContent. If that's the case, i wonder whether it's appropriate to express this as a separate normalization step in the paragraph we have been talking about, or whether to just assume that it falls out of the recommendation in the following paragraph anyway (about generating the Text Position Selector values from DOM Level 3 APIs).

So i guess i'm asking the person who added the phrase about normalizing away the markup to the spec why they did so, so that we can better assess the appropriateness.

That would be @azaroth42...

r12a commented 8 years ago

And if there were copy paste-s done when putting together the text, then the representation of the same text may be slightly different within the text… Hence the normalization.

@iherman too many 'text' words there for me to be sure what you're saying. The only way i can see to understand this is if the Text Position Selector values are manually created by users looking at the target text and typing what they think they see into the annotation body. Is that a valid use case?

I don't see how normalization helps distinguish between possible matches when there are mutliple alternative ranges of text in the target document that match the text position selector values. If anything, i'd have thought it would do the opposite, by removing idiosynchratic differences, which is what normailzation is about. If you want to find all possible matches, then that's fine, but i think that here we want to find the unique match where possible, no?

iherman commented 8 years ago

On 1 Jun 2016, at 14:49, r12a notifications@github.com wrote:

And if there were copy paste-s done when putting together the text, then the representation of the same text may be slightly different within the text… Hence the normalization.

@iherman https://github.com/iherman too many 'text' words there for me to be sure what you're saying. The only way i can see to understand this is if the Text Position Selector values are manually created by users looking at the target text and typing what they think they see into the annotation body. Is that a valid use case?

That is not what I meant. Imagine that File.html contains the word "Iván" twice. However, the way File.html was created is such that somebody copy-pasted text from File1.html and then from File2.html. The first contained "Iván", the other contained "Iva´n" (I mean the relevant unicode encoding are different). The end result is that the word "Iván" in File.html is there in two different internal format.

Then somebody wants to annotate File.html, and wants to annotate the various "Iván"-s. The system would put "Iván" into the Text Quote Selector for exact match. The only way the match would really work is to have the normalization...

I don't see how normalization helps distinguish between possible matches when there are mutliple alternative ranges of text in the target document that match the text position selector values. If anything, i'd have thought it would do the opposite, by removing idiosynchratic differences, which is what normailzation is about. If you want to find all possible matches, then that's fine, but i think that here we want to find the unique match where possible, no?

r12a commented 8 years ago

but don't you want to select only one of those Ivans in File.html? Isn't that why you go to the trouble of picking prefix and suffix text, to narrow down the possibilities for matching and, as the spec says, "to distinguish between multiple copies of the same sequence of characters"? If you normalize the text before matching you do the opposite: you can no longer tell whether the annotation should be associated with "Iván" or "Iva´n", whereas before, by happy accident, you had that possibility.

iherman commented 8 years ago

On 1 Jun 2016, at 15:09, r12a notifications@github.com wrote:

but don't you want to select only one of those Ivans in File.html?

Not necessarily. I may want to create an annotation saying, in English term, "Iván is a jerk". This should be an annotation on the whole list, so to say…

Ivan

Isn't that why you go to the trouble of picking prefix and suffix text, to narrow down the possibilities for matching and, as the spec says, "to distinguish between multiple copies of the same sequence of characters"? If you normalize the text before matching you do the opposite: you can no longer tell whether the annotation should be associated with "Iván" or "Iva´n", whereas before, by happy accident, you had that possibility.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/w3c/web-annotation/issues/227#issuecomment-222986781, or mute the thread https://github.com/notifications/unsubscribe/AAfyEwSW42NMsACn-XmZic_d1el_6jwOks5qHYR4gaJpZM4IhEpM.


Ivan Herman, W3C Digital Publishing Lead Home: http://www.w3.org/People/Ivan/ mobile: +31-641044153 ORCID ID: http://orcid.org/0000-0003-0782-2704

tilgovi commented 8 years ago

The clearest and most straightforward thing is to describe the text as it appears originally.

If you wanted to select both Iván, it would be fine for an implementation to either let you select both using a pointing device or perform a find over some normalized version of the text (normalization and case folding and such are common for search). In both cases, the implementation should understand that the two instances of the word are different and it can describe each as precisely as it wishes to.

Normalization as a recommendation to me is just a bad idea. It is lossy and annotation often requires precision. These are at odds.

azaroth42 commented 8 years ago

Normalization is for ease of comparison plus robustness across formats. For example, if you don't normalize whitespace from HTML and then search for it in text/plain where the whitespace is meaningful (and hence has been normalized already because users will see it), then it won't match. Similarly for entities, tags and other markup.

Of course, any search will be heuristic as to what matches and what doesn't.

I don't see a solution that will get us to a CR text by the end of this week?

tilgovi commented 8 years ago

Normalization is for ease of comparison plus robustness across formats.

I'm suggesting that some normalization actually hinders comparison unless that normalization can be very precisely specified.

I don't see a solution that will get us to a CR text by the end of this week?

I actually find the current editor's draft text totally acceptable. It's only the lingering possibility around this thread that we normalize whitespace or unicode that I oppose.

For whitespace, I oppose it because it can be very hard to determine what white space is meaningful. I may have a text/plain document that uses manual line breaks, as is common with code documentation, where only two successive line breaks are really "logical" line breaks. In HTML, it's necessary to resolve the CSS styling to know whether white space is preserved or not. And so forth.

For unicode, I oppose it because the W3C already recommends NFC for the Web [1]. We should assume documents already contain normalized unicode forms and not put the burden on implementers of annotation to do so. Furthermore, we can easily imagine trivial use cases for annotation where it would be undesirable. Perhaps I want to make an HTML validator that warns non-normalized characters. I would need to annotate the specific, non-normalized text to mark it as such.

So, I think the text has improved as a result of discussion on this issue and I find it satisfactory now. My previous comments should be taken to mean that I believe we have arrived at a reasonable description of appropriate normalization, namely the normalization that is already done automatically by browsers if you do use textContent (strip tags, convert character entities, preserve white space and unicode forms).

[1] https://www.w3.org/International/questions/qa-html-css-normalization

iherman commented 8 years ago

Taking into account that we need a week to have a review of the documents before voting on publishing, if we don't start this this Monday, we slip with our schedule again. We have one open issue right now; can we try to get specific proposals on how to close the issue, please? We have been dragging this last issue for more than two weeks now...

On 1 Jun 2016, at 19:25, Rob Sanderson notifications@github.com wrote:

Normalization is for ease of comparison plus robustness across formats. For example, if you don't normalize whitespace from HTML and then search for it in text/plain where the whitespace is meaningful (and hence has been normalized already because users will see it), then it won't match. Similarly for entities, tags and other markup.

Of course, any search will be heuristic as to what matches and what doesn't.

I don't see a solution that will get us to a CR text by the end of this week?

tilgovi commented 8 years ago

My specific proposal is to close the issue as resolved.

The most problematic text was "unnecessary whitespace should be normalized, character encoding should be turned into UTF-8, and so forth". In particular, the vagueness of "and so forth" was an issue and that has been fixed.

Addison brought up the recommendation against unicode normalization several comments back and I only kept responding because Ivan's latest comment seemed to suggest that it might be desirable.

The language as written now in the editor's draft is reasonable to me:

The text MUST be normalized before recording in the Annotation. Thus HTML/XML tags SHOULD be removed, and character entities SHOULD be replaced with the character that they encode. The normalization routine may be performed automatically by a browser, and other applications SHOULD implement the DOM String Comparisons method.

iherman commented 8 years ago

So, I think the text has improved as a result of discussion on this issue and I find it satisfactory now.

Thanks. This is a specific proposal: close the issue with what is in the ED.

In view of the dissenting views expressed in this thread it seems that no other text may get a better consensus... Work for me, thus.

aphillips commented 8 years ago

I disagree. The current text is a muddle because it requires (MUST) a (lowercase-n, non-Unicode) normalization, but only specifies a series of non-specific SHOULD operations, one of which (the DOM String Comparisons stuff) introduces Unicode Normalization Forms.

It sounds like your WG has not achieved consensus on what the requirements for the text normalization are, in which case I would tend to want to relax the MUST or improve the SHOULDs. That is, I think you'd be better changing the text to match my suggestions earlier, with changes to "all markup" as appropriate (based on @fsasaki and other's comments).

tilgovi commented 8 years ago

My mistake. If DOM String Comparisons introduces Unicode Normalization Forms then I support removing that text.

r12a commented 8 years ago

[1] I'm going to try to be careful about terminology here. When i refer to 'character normalization' i mean Unicode normalization forms (NFC, etc), as well as all the stuff referred to in the DOM String Comparisons. I am NOT referring to whitespace normalization. I believe we already agreed to remove the whitespace stuff, and it is gone from the text in the ED version. We have also dealt with the UTF-8 conversion, and the 'so forth', as @tilgovi says.

[2] CHARACTER NORMALIZATION: If Ivan's requirement to be able to match text in the Text Quote Selector against text with different character normalizations in the target document holds (and i only hear him confirming that), then i don't see the value of normalizing the text on the way in to the model framework. You'll still have to do normalization at the point of comparison to achieve a match. I'm therefore inclined to drop the requirement for text to be character normalized before recording.

That includes dropping the reference to DOM String Comparisons, which as i said before i don't think is really what you were looking for anyway, you were thinking of standard Unicode normalization forms.

[3] MARKUP/ESCAPE REMOVAL: I'm still waiting for an answer from the WA WG to the second question at https://github.com/w3c/web-annotation/issues/227#issuecomment-222973597 in order to form a view on whether or not tag and escape folding should be part of the 'normalization paragraph'. I suspect that it shouldn't, but that it is just part of the method described for DOM Level 3 APIs. You don't want to strip markup or escapes from plain text sources that contain it, because they are examples.

[4] We're trying to give you what advice we can, but we're not hearing much back that's clear and definite, and based on this thread, I share Addison's concern that perhaps the WA WG doesn't really know why that normalization stuff is there. Personally, I'd be inclined to remove the whole paragraph.

tilgovi commented 8 years ago

I suspect that it shouldn't, but that it is just part of the method described for DOM Level 3 APIs. You don't want to strip markup or escapes from plain text sources that contain it, because they are examples.

Makes sense. I only didn't take issue with it because in context with the sentence about HTML/XML it didn't imply removing markup from plain text sources, to me.

I would support removing it, though. And the DOM String Comparison part was my mistake not to notice.

I share Addison's concern that perhaps the WA WG doesn't really know why that normalization stuff is there.

If we can work through a proposal here and discuss it on our call on Friday we could stay on schedule here. I really appreciate your help clarifying these issues with us.

Personally, I'd be inclined to remove the whole paragraph.

I'm also fine with that result. It would be especially nice not to be mentioning specific formats like HTML/XML, and instead issuing guidelines here, if we say anything at all. In HTML/XML, it might be well understood that the text is, well, the text. After all, they are called "Text" nodes. In another application domain, the markup may well be part of the text.

Rob, do you think we have enough information in this thread to simplify the language further and propose something that the WG could agree to on Friday?

azaroth42 commented 8 years ago

I'm sorry, but I disagree with leaving in the markup. That's a Data Quote Selector, not a Text Quote Selector. If I select text from a PDF, following the logic of leaving in markup from HTML or other formats, I would leave in all of the binary escape codes that control how PDFs are rendered. Or a Word document, or Postscript. I don't believe that we want that!

r12a commented 8 years ago

Just so i'm clear, when i referred to leaving in markup it was only for plain text docs, ie. not HTML/XML. If that's contentious or it's difficult to separate those i'll not push it.

aphillips commented 8 years ago

We discussed this in our WG call today and I drew the action item to update this thread :-)

I18N recommends that the "normalization" paragraph be removed unless/until specific requirements are developed. In addition, we don't believe that Unicode Normalization, either directly or indirectly through DOM String Comparison, should be applied. Whether whitespace or markup normalizations are applied depend on your WG's requirements, not on any specific I18N concern.

We also suggest that a health warning about the need to Unicode Normalize on comparison (matching of the TextQuoteSelector to the source text) should be included, provided that you intend differently-encoded-but-Unicode-equivalent sequences to match (the Ivan/Ivan discussion above). If that is not your intention, then you might consider the counter health warning (that distinct sequences that represent the same "logical" character will not match each other) with a pointer to Charmod-Norm.

iherman commented 8 years ago

Discussed on 2016-06-03, accepting the I18N WG proposal. The paragraph has been reshaped essentially according to the request:

The text MUST be normalized before recording in the Annotation. Thus HTML/XML tags SHOULD be removed, and character entities SHOULD be replaced with the character that they encode. Note that this does not affect the state of the content of the document being annotated, only the way that the content is recorded in the Annotation.

RESOLUTION: Remove DOM string comparison, UTF-8, and avoid implications that comparison should be part of the normalization routine

See http://www.w3.org/2016/06/03-annotation-irc#T15-34-08

azaroth42 commented 8 years ago

Done.

duerst commented 8 years ago

I wrote: Normalization of whitespace by collapsing all whitespace tokens to a single ASCII space character (U+0020). Note that the text MAY begin or end with a space character. i.e. no trim is implied that has the problem that it leaves spaces in East Asian texts where they may not be desired.

@iherman replied: I am not sure I understand. The only goal of this section is to provide a canonical version of the text for unequivocal comparison. What does "may not be desired" mean in this respect?

East Asian text is usually written without spaces. In some contexts (e.g. HTML source), line breaks and spaces at the start of a line are present to make editing easier, but these ideally should be totally normalized away rather than collapsed to a single space.