over-specification in Conformance: Unicode normalization

aphillips commented 4 years ago

Section: Conformance: Unicode normalization https://www.w3.org/TR/webvtt/#unicode-normalization

Implementations of this specification must not normalize Unicode text during processing.

For example, a cue with an identifier consisting of the characters U+0041 LATIN CAPITAL LETTER A followed by U+030A COMBINING RING ABOVE (a decomposed character sequence), or the character U+212B ANGSTROM SIGN (a compatibility character), will not match a selector targeting a cue with an ID consisting of the character U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE (a precomposed character).

The I18N WG noticed the above conformance requirement recently and discussed it in recent teleconferences.

Unicode normalization is only one consideration that affects processing of WebVTT and its operations (such as the matching of cue identifiers). While this requirement is consistent with our recommendations and intentions, we'd suggest that you consider a more expansive approach as documented in our document Charmod-norm, particularly section 3.1. Unless there is a special reason that our WG is unaware of, WebVTT is not especially sensitive to variations, so a case-sensitive non-normalizing matching for cue identifiers makes sense to us.

The other concern we have is that this requirement forbids any and all normalization when processing a webvtt document, not just when performing operations such a cue id matching. Is there a reason to extend a processing requirement to the entire document? Or to forbid normalization when converting character encoding (although, if memory serves, webvtt doesn't support encodings other than UTF-8, so this may not apply)

gkatsev commented 4 years ago

I tried to find the history behind this and it seems like it was mostly to make sure that --> in the cue timing and settings line is easily distinguishable. But maybe @silviapfeiffer has more background on this.

Personally, I don't really know much about unicode normalization but it may make sense to allow it for the cue text? Given that WebVTT is very strict on not normalizing right now, it's a lot easier to allow rather than if it was the reverse. I'll have to read up on charmod-norm.

Finally, WebVTT does specify that the file be UTF-8 encoded.

dwsinger commented 4 years ago

I think the hint is "such as the matching of cue identifiers" in Addison's comment, and that we wanted that matching could be done by byte-equality rather than string-equivalence.

For the text of cues, I agree, I can't think why we would care.

silviapfeiffer commented 4 years ago

This was brought in quite early on before my time as editor. From my reading, it doesn't just relate to identifiers, but to all if webvtt parsing. I believe it may be that a lot of the parsing rules rely on byte equality as Dave is saying. Note that webvtt is not XML but a text format with some markup and this is quite strict.

What problems do you see arising from being this strict?

css-meeting-bot commented 4 years ago

The Timed Text Working Group just discussed WebVTT: over-specification in Conformance: Unicode normalization w3c/webvtt#483, and agreed to the following:

SUMMARY: Investigation of impact to continue.

The full IRC log of that discussion

<nigel> Topic: WebVTT: over-specification in Conformance: Unicode normalization w3c/webvtt#483
<nigel> github: https://github.com/w3c/webvtt/issues/483
<nigel> Gary: This issue came from the i18n working group, about Unicode normalisation.
<nigel> .. WebVTT specifically disallows this, and says to compare the bytes directly.
<nigel> .. The issue raised is that it is not what we want, potentially.
<nigel> .. I don't have much knowledge personally of why you would want or not want to do it.
<nigel> .. From digging around in the history, it sounds like it was mostly to make sure that
<nigel> .. things that are required in WebVTT are easy to identify like the arrow in the time
<nigel> .. signature so that we aren't matching normalised Unicode and can find it more easily.
<nigel> .. I want to ask if anyone had more knowledge about it, or if TTML or IMSC handle
<nigel> .. Unicode normalisation.
<nigel> Nigel: I think in TTML it is delegated to XML so whatever XML says, which we assume is
<nigel> .. the correct thing, is what happens.
<nigel> Gary: Yes. It's relevant that WebVTT is not XML but a text format with markup.
<nigel> .. David Singer said that for the text of the cues we could do normalisation, but even that
<nigel> .. might be a bit more complicated because HTML tags are allowed to be used.
<nigel> Nigel: Also what about metadata payload in the cues?
<nigel> .. For example if it is JSON, does that specify Unicode normalisation? I do not know.
<atsushi> https://infra.spec.whatwg.org/#json
<nigel> Atsushi: In JSON I believe that it depends on the processor for values
<gkatsev> -> https://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf json specification
<nigel> Gary: The spec is quite small and just says it is a sequence of Unicode code points.
<cyril> rrsagent, pointer
<RRSAgent> See https://www.w3.org/2020/04/30-tt-irc#T15-20-07
<nigel> Atsushi: I think currently in WebVTT, case sensitive non-normalised matching is defined.
<nigel> Nigel: The issue is that it is _not_ using that.
<nigel> Atsushi: I think the linked document was written after work on WebVTT began.
<nigel> .. The standard operation was written after WebVTT so maybe even if the result is the same
<nigel> .. but some text is over-specified in the current standard.
<nigel> Cyril: A different angle: do we have tests for this in WPT?
<nigel> Gary: For Unicode normalisation?
<nigel> Cyril: Yes, to match the MUST NOT in the spec.
<nigel> Gary: I'm not sure
<nigel> Nigel: Are you thinking about if we can establish what implementations do now via the tests?
<nigel> Cyril: Yes
<nigel> Gary: From a quick look I'm not seeing anything specific to Unicode.
<nigel> Nigel: I'm a bit confused about where the line is drawn between parsing the WebVTT
<nigel> .. document e.g. during processing, cue matching etc. and text presentation.
<nigel> .. If some payload text is passed onto a text renderer and there's a step that does
<nigel> .. normalise the text, is that broken, according to the spec text in §2.2?
<nigel> Gary: The example is about cue matching, which is very specific.
<nigel> Nigel: Is "processing" a defined term?
<nigel> Gary: It could refer to the "processing model" part of the spec.
<nigel> .. That would make sense because that's when you would be applying styling and whatnot.
<nigel> Atsushi: I am not sure that there is any case that is not covered by "case sensitive non-normalising"
<nigel> .. if there is no such case then I suppose it may be possible to write it into the standard
<nigel> .. in a simpler way.
<nigel> Gary: You mean to link to the charmod-norm spec to the section that matches what
<nigel> .. we want to do in WebVTT?
<nigel> Atsushi: Actually the character model normalisation is not a Rec track doc but a WG note
<nigel> .. so it cannot be normative. You would need to copy and paste the spec text.
<nigel> .. Recently there are several standards that say this kind of thing so having this kind of
<nigel> .. spec may be easier for readers and may not have some strange cases.
<nigel> .. The last point of the issue comment is for character encoding, but I'm not sure if we need
<nigel> .. to have this strong restriction for later processing by scripts or web browser.
<nigel> scribe: [not sure I got that very well]
<nigel> Gary: You mean from cue text?
<nigel> Atsushi: Yes
<nigel> Nigel: Does the requirement that WebVTT is always UTF-8 make some of the concern
<nigel> .. disappear here?
<nigel> Atsushi: I need to think about that more.
<nigel> .. At this moment I don't see any difference between the suggestion and the current
<nigel> .. spec text and description.
<nigel> Nigel: Not sure how we move to a resolution on this. Gary?
<nigel> Gary: I think I need to read up on the charmod-norm first and it would be good to get
<nigel> .. clarification on how WebVTT being specified as UTF-8 affects/does not affect things.
<nigel> .. It does sound like it might be okay to change how we handle the cue text normalisation
<nigel> .. but we likely don't want to do that for other parts of WebVTT.
<nigel> SUMMARY: Investigation of impact to continue.

gkatsev commented 4 years ago

I spent a bit of time looking into this. There's a blogpost from Anne van Kesteren about unicode normalization https://annevankesteren.nl/2009/02/unicode-normalization which made a lot of sense to me. The conclusion being not to normalize. Given that HTML and CSS also do so, we should do the same. I think what we have now fits that criteria.

Also, given that we have embedded CSS and also have HTML syntax in the cues, we should apply this to all the webvtt text and not just the cue settings line.

I've also re-read the original post, it sounds like the question here is around the specific language used rather than what is specifically said?

a case-sensitive non-normalizing matching for cue identifiers makes sense to us.

Seems to match what we do, though, definitely said in different words. Is the ask to update it based on language from charmod?

r12a commented 4 years ago

I think that the key points that the i18n WG wanted to make here are that:

we agree that you shouldn't normalise identifiers when matching (just like HTML class names don't match CSS selectors if the text is precomposed in one, and decomposed in the other).
if, however, one was to perform a text operation on a bit of natural language text (eg. the text displayed on screen in a cue as supplied by the author) which compares two pieces of natural language text (eg. during a search), then one should normalise (and also do case-folding, and probably various other transformations). This is because we don't think we should require authors to use a particular normalised form when authoring the text. (Generally, natural language text will be in NFC most of the time anyway, but there are situations, such as for Vietnamese text, or for careful construction of example texts, etc. where a non-NFC version of the text is likely or sometimes needed.)

The original trigger for the comment was:

Implementations of this specification must not normalize Unicode text during processing.

This just seemed too general a statement. The note that follows goes on to mention identifiers, which is good, but the normative text is not precise enough. If it said something along the lines of:

Implementations of this specification must not normalize Unicode text in identifiers during processing.

that might address the issue.

Does that help?

nigelmegitt commented 4 years ago

if, however, one was to perform a text operation on a bit of natural language text

Sounds like the key issue is that "during processing" is too broad, and apparently excludes some reasonable processing that is out of scope of the spec. Downstream text operations such as searching or indexing the natural language text might well need to do some normalisation, depending on exactly what they are intending to achieve.

Would it help to be really explicit that any hand-over of content originating in the WebVTT document from the WebVTT processor to some other downstream processor will not have had any normalisation applied, so that if they want/need to do it then they know the state of incoming data?

gkatsev commented 4 years ago

That makes sense. My read of during processing is that it's during processing of webvtt for display as defined in section 7. At least, that's what I believe it's trying to convey. However, it is vague enough that this language is too broad and could apply to any processing.

w3c / webvtt

over-specification in Conformance: Unicode normalization #483