Closed r12a closed 6 years ago
When a document is authored, fonts used should provide support, as a minimum, for the characters listed in the sets in B Recommended Character Sets, depending on the language of the text.
This clause is not intended to address fonts used by the authoring software, but with character sets supported by client processors.
Specifically, the intent of this clause, worded as a document requirement rather than a processor requirement, is to recommend the character set that a client processor is expected to support if it targets language X.
My reading of this statement has always been that it is advice to document authors about the code points that are safe to use, i.e. it is effectively saying "you shouldn't use characters outside this set unless you know it's safe to do so".
My reading of this statement has always been that it is advice to document authors about the code points that are safe to use, i.e. it is effectively saying "you shouldn't use characters outside this set unless you know it's safe to do so".
The reason for an author to avoid a character outside the set is to minimize the risk of a processor not supporting the character.
In other words, in order to improve rendering fidelity, an author targeting 'ar' and a device claiming to support 'ar' need to agree on a common character set for 'ar'.
But in this context, 'support "ar"' comes down to which font is being used, no?
Let me back up a bit and ask some questions that may help me understand better:
What is a 'processor'? I was assuming that it is the software that displays the text that is associated with a range of video. Is that correct?
I thought we established that authors could use any font. Is that correct?
I think we also established that processors should be able to flow text within a box, regardless of which font is used. Is that true?
How is improving 'rendering fidelity' important in a scenario where any font could be used and line breaking is determined using UAX14? If it's not, where is it important?
If we compare the set of characters specified in Recommended Character Sets and the set of codepoints for which glyphs are available in the fonts associated with the Reference Fonts, is there some correspondence between the two?
What is a 'processor'? I was assuming that it is the software that displays the text that is associated with a range of video. Is that correct?
Yes. A processor generally manipulates IMSC1 documents, a presentation processor specifically renders them.
I thought we established that authors could use any font. Is that correct?
They can specify any font.
I think we also established that processors should be able to flow text within a box, regardless of which font is used. Is that true?
Yes.
How is improving 'rendering fidelity' important in a scenario where any font could be used and line breaking is determined using UAX14? If it's not, where is it important?
The position of the line breaks and overall size of the rendered text is important, e.g. for readability and to avoid covering important parts of the scene, etc.
If we compare the set of characters specified in Recommended Character Sets and the set of codepoints for which glyphs are available in the fonts associated with the Reference Fonts, is there some correspondence between the two?
Well, if the processor supports a character and the character is included in the reference font and the reference font is used (through tts:fontFamily), then the processor uses the metrics of the character of the reference font.
How is improving 'rendering fidelity' important in a scenario where any font could be used and line breaking is determined using UAX14? If it's not, where is it important?
The position of the line breaks and overall size of the rendered text is important, e.g. for readability and to avoid covering important parts of the scene, etc.
Does the processor attempt to determine the size and placement of the box the text is rendered in, or is that just down to the author?
I asked the question about 'rendering fidelity' a little out of context. Let me try to explain a little better what i meant. You said:
In other words, in order to improve rendering fidelity, an author targeting 'ar' and a device claiming to support 'ar' need to agree on a common character set for 'ar'.
I understand the importance of the box containing the text being the right size and place. Are talking about here how to determine where a line of text will break given a bounding box, or are we talking about how to generate the bounding box?
It's not clear to me how the existence of the reference fonts and character sets is of benefit when any font can be used, for example wide fonts like Verdana or Helvetica, vs. narrow fonts like Arial for English. There are often much greater divergence in basic glyph sizes in fonts for other scripts - Khmer fonts can be of different types and very different sizes, Arabic fonts can use more or fewer ligated forms depending on font style, other complex fonts may or may not have the same contextual forms, etc.. So the processor needs to be able to work out how to flow the text on the basis of the font and language of the text it is dealing with (with some help about suitable break points). But if the processor has to be capable of determining how to flow the text for fonts other than the reference fonts, what value does the reference font provide wrt 'rendering fidelity'? Why is that needed for 'rendering fidelity'? Don't know if that makes my question clearer.
If we compare the set of characters specified in Recommended Character Sets and the set of codepoints for which glyphs are available in the fonts associated with the Reference Fonts, is there some correspondence between the two?
Well, if the processor supports a character and the character is included in the reference font and the reference font is used (through tts:fontFamily), then the processor uses the metrics of the character of the reference font.
What i was asking was more along the lines of: Is the choice of character sets that are enumerated, and the characters listed for those character sets, a result of checking what code points are supported by the reference fonts. Because i noticed a fairly close correspondence between those and the characters supported by the Arial font (i didn't check the others).
Is the choice of character sets that are enumerated, and the characters listed for those character sets, a result of checking what code points are supported by the reference fonts.
The choice of character sets that are enumerated is intended to reflect characters commonly used for subtitling in a particular language.
This choice is not related to characters supported by the reference fonts.
Apologies for misunderstanding your question.
I understand the importance of the box containing the text being the right size and place. Are talking about here how to determine where a line of text will break given a bounding box, or are we talking about how to generate the bounding box?
Oh. The recommended character sets are not intended to improve the rendering fidelity in terms of placement, line breaks, etc...
The recommended character sets are not intended to ensure that an implementation that claims to support a particular language implements supports for characters likely to be used for subtitling in that language, i.e. "rendering fidelity" means the "the character being rendered at all".
Meeting 2017-06-22: @r12a ping? The group thinks all your questions have been answered - please could you confirm, or if you still think that changes to the spec are needed, let us know what they are? Would it be possible to respond by Monday 26th June so we can close this off by Thursday 29th please, allowing time for discussion, proposals etc?
Another issue seems to have gotten lost in a parallel discussion and I've been asked to mention it again in this thread.
The "recommended characters" lists are problematic in at least two ways. First, while it is easy to justify a requirement for the characters needed to write a given language plus ASCII (because ASCII is needed to write, e.g., URIs), there does not seem to be any justification for requiring most of the rest of the Latin-based characters. Doing so is a step backward toward a time in which the web was assumed to be adequately internationalized if it supported the characters used in western European languages. Second, the list of languages is not complete, with languages and writing systems used by millions of users not listed. One fix would be to tell people where to look to find the characters needed to write a given language that is not listed, but, if that alternate source exists and is reliable, the appendix B table is probably unnecessary.
thanks.
@klensin See https://github.com/w3c/imsc/issues/243, the character sets of Annex B are a superset of the Unicode CLDR character sets, which address a wide range of languages. The long-term objective is in fact to defer to CLDR entirely, and remove Tables 1 and 2 -- see issue 8915.
Thanks to @klensin for bringing me back to this issue, and making me think about it again. I'm not at all happy with the way it is currently worded, since even though it is aimed at authors it creates a cycle which makes it appear that only the named characters should be supported by applications. (If authors are not going to use them, why would applications support more than the basic set of characters?)
Here's something closer to what i would have preferred to see, if this section remains. The uppercasing is purely to show where the changes are.
PROCESSORS ARE REQUIRED TO SUPPORT UNICODE, AND SHOULD THEREFORE SUPPORT ALL NEEDED CHARACTERS FOR ANY LANGUAGE, BUT SOME LEGACY APPLICATIONS HAVE RESTRICTED SUPPORT, AND SO WHEN authoring textual content FOR LEGACY ENVIRONMENTS, authors MAY WANT to select CHARACTERS from THE sets BELOW based on the language indicated using xml:lang. The idea is to increase the confidence that the text will be presented correctly by LEGACY implementations targeting specific locales. ... Table 1 captures the set of characters EXPECTED to be available to authors across all languages. The terms used in the table are defined in [UNICODE]. ... Table 2 LISTS supplementary character setS that have proven RELIABLE in LEGACY captioning and subtitling applications for a number of selected languages. Table 2 is non-exhaustive, and will be extended as INFORMATION BECOMES AVAILABLE.
Unicode CLDR character sets, which address a wide range of languages. The long-term objective is in fact to defer to CLDR entirely, and remove Tables 1 and 2
But defer to CLDR for what? If authors are to expect that CLDR is always safe ground, then there's a strong implication that applications will be expected conform to CLDR in terms of which characters are to be supported. As a minimum that may sound useful, but there are some issues:
I don't have any problem with the suggestion at https://github.com/w3c/imsc/issues/236#issuecomment-367713408 but I do wonder if some of the (especially legacy) implementers are clear about the difference between a processor supporting unicode (not so hard, each code point is just a number, right? [ducks for cover]) and support for fonts that define glyphs for all the code points that are needed. This is one of those situations where both the processing code and the (font) data supplied to it have to support the functionality for it to work for the end user. We don't say anything about the requirement for processors to support combining characters, for example, but it's taken as obvious I think. Also in a memory and processor constrained system, such as a low cost TV-attached device, requiring support for arbitrary fonts might be considered hard, since the size and complexity of font files is unbounded. Just a bunch of thoughts, no particular thought of adding further text to the spec.
The Working Group just discussed Should the character sets be minimum *font* requirements? imsc#236
.
Action recorded at #338
The important thing is to avoid creating bottlenecks - data formats or implementations of processors - that are only ever capable of passing through a subset of characters.
Any distributed system that contains elements that have such bottleneck function makes sharply limits overall functionality - and worse - makes it impossible to incrementally improve performance.
This is different from having a limited font. You may not be able to display all possible data, but updating the font fixes that: without change to the rest of the system.
This is different from having a limited font.
Yes. The objective of the IMSC specification is to recommend the sets of characters for which an implementation should provide glyphs, based on the language(s) that the implementation claims to support.
The IMSC specification does not allow an implementations to support only a subset of the Unicode character set, i.e. there are no provisions for rejecting a document based on the Unicode characters present within the document.
Useful discussion with i18n just now, @palemieux to draft text to address the concerns raised. Minutes: https://www.w3.org/2018/03/08-i18n-minutes.html#item02
I was going to poke @palemieux to let me know about the update, per I18N-ACTION-700, but I see the commit. Thanks!
Hello IMSC. In our teleconference last week, the I18N WG tasked me with writing back about this change. In the main we are please with these change, however we are still concerned about the characterization of these sets of characters as "safe". The WG would prefer if you used a phrase such as "widely supported" or "widely accepted".
Please let us know if you agree or would prefer to discuss (here or in teleconference)
@aphillips Personally happy to make the change from "safe" to "widely supported" or "widely accepted". Have you considered "common" instead? It would be less of a mouthful.
@aphillips See updated PR.
We discussed this in teleconference and are satisfied with this change. Thank you very much for your help!
7.2 Recommended Character Sets https://www.w3.org/TR/ttml-imsc1.0.1/#recommended-character-sets
Since UTF-8 is being used, those characters, and all the other characters in Unicode, are always available to authors.
Would it not be better to say: When a document is authored, fonts used should provide support, as a minimum, for the characters listed in the sets in B Recommended Character Sets, depending on the language of the text.