Open aphillips opened 4 years ago
I tried to find the history behind this and it seems like it was mostly to make sure that -->
in the cue timing and settings line is easily distinguishable. But maybe @silviapfeiffer has more background on this.
Personally, I don't really know much about unicode normalization but it may make sense to allow it for the cue text? Given that WebVTT is very strict on not normalizing right now, it's a lot easier to allow rather than if it was the reverse. I'll have to read up on charmod-norm.
Finally, WebVTT does specify that the file be UTF-8 encoded.
I think the hint is "such as the matching of cue identifiers" in Addison's comment, and that we wanted that matching could be done by byte-equality rather than string-equivalence.
For the text of cues, I agree, I can't think why we would care.
This was brought in quite early on before my time as editor. From my reading, it doesn't just relate to identifiers, but to all if webvtt parsing. I believe it may be that a lot of the parsing rules rely on byte equality as Dave is saying. Note that webvtt is not XML but a text format with some markup and this is quite strict.
What problems do you see arising from being this strict?
The Timed Text Working Group just discussed WebVTT: over-specification in Conformance: Unicode normalization w3c/webvtt#483
, and agreed to the following:
SUMMARY: Investigation of impact to continue.
I spent a bit of time looking into this. There's a blogpost from Anne van Kesteren about unicode normalization https://annevankesteren.nl/2009/02/unicode-normalization which made a lot of sense to me. The conclusion being not to normalize. Given that HTML and CSS also do so, we should do the same. I think what we have now fits that criteria.
Also, given that we have embedded CSS and also have HTML syntax in the cues, we should apply this to all the webvtt text and not just the cue settings line.
I've also re-read the original post, it sounds like the question here is around the specific language used rather than what is specifically said?
a case-sensitive non-normalizing matching for cue identifiers makes sense to us.
Seems to match what we do, though, definitely said in different words. Is the ask to update it based on language from charmod?
I think that the key points that the i18n WG wanted to make here are that:
The original trigger for the comment was:
Implementations of this specification must not normalize Unicode text during processing.
This just seemed too general a statement. The note that follows goes on to mention identifiers, which is good, but the normative text is not precise enough. If it said something along the lines of:
Implementations of this specification must not normalize Unicode text in identifiers during processing.
that might address the issue.
Does that help?
if, however, one was to perform a text operation on a bit of natural language text
Sounds like the key issue is that "during processing" is too broad, and apparently excludes some reasonable processing that is out of scope of the spec. Downstream text operations such as searching or indexing the natural language text might well need to do some normalisation, depending on exactly what they are intending to achieve.
Would it help to be really explicit that any hand-over of content originating in the WebVTT document from the WebVTT processor to some other downstream processor will not have had any normalisation applied, so that if they want/need to do it then they know the state of incoming data?
That makes sense. My read of during processing
is that it's during processing of webvtt for display as defined in section 7. At least, that's what I believe it's trying to convey. However, it is vague enough that this language is too broad and could apply to any processing.
Section:
Conformance: Unicode normalization
https://www.w3.org/TR/webvtt/#unicode-normalizationThe I18N WG noticed the above conformance requirement recently and discussed it in recent teleconferences.
Unicode normalization is only one consideration that affects processing of WebVTT and its operations (such as the matching of cue identifiers). While this requirement is consistent with our recommendations and intentions, we'd suggest that you consider a more expansive approach as documented in our document Charmod-norm, particularly section 3.1. Unless there is a special reason that our WG is unaware of, WebVTT is not especially sensitive to variations, so a case-sensitive non-normalizing matching for cue identifiers makes sense to us.
The other concern we have is that this requirement forbids any and all normalization when processing a webvtt document, not just when performing operations such a cue id matching. Is there a reason to extend a processing requirement to the entire document? Or to forbid normalization when converting character encoding (although, if memory serves, webvtt doesn't support encodings other than UTF-8, so this may not apply)