w3c / imsc

TTML Profiles for Internet Media Subtitles and Captions (IMSC)
https://w3c.github.io/imsc/
Other
31 stars 17 forks source link

Should the character sets be minimum *font* requirements? #236

Closed r12a closed 6 years ago

r12a commented 7 years ago

7.2 Recommended Character Sets https://www.w3.org/TR/ttml-imsc1.0.1/#recommended-character-sets

A Document Instance SHOULD be authored using characters selected from the sets specified in B. Recommended Character Sets.

Since UTF-8 is being used, those characters, and all the other characters in Unicode, are always available to authors.

Would it not be better to say: When a document is authored, fonts used should provide support, as a minimum, for the characters listed in the sets in B Recommended Character Sets, depending on the language of the text.

palemieux commented 7 years ago

When a document is authored, fonts used should provide support, as a minimum, for the characters listed in the sets in B Recommended Character Sets, depending on the language of the text.

This clause is not intended to address fonts used by the authoring software, but with character sets supported by client processors.

Specifically, the intent of this clause, worded as a document requirement rather than a processor requirement, is to recommend the character set that a client processor is expected to support if it targets language X.

nigelmegitt commented 7 years ago

My reading of this statement has always been that it is advice to document authors about the code points that are safe to use, i.e. it is effectively saying "you shouldn't use characters outside this set unless you know it's safe to do so".

palemieux commented 7 years ago

My reading of this statement has always been that it is advice to document authors about the code points that are safe to use, i.e. it is effectively saying "you shouldn't use characters outside this set unless you know it's safe to do so".

The reason for an author to avoid a character outside the set is to minimize the risk of a processor not supporting the character.

In other words, in order to improve rendering fidelity, an author targeting 'ar' and a device claiming to support 'ar' need to agree on a common character set for 'ar'.

r12a commented 7 years ago

But in this context, 'support "ar"' comes down to which font is being used, no?

Let me back up a bit and ask some questions that may help me understand better:

  1. What is a 'processor'? I was assuming that it is the software that displays the text that is associated with a range of video. Is that correct?

  2. I thought we established that authors could use any font. Is that correct?

  3. I think we also established that processors should be able to flow text within a box, regardless of which font is used. Is that true?

  4. How is improving 'rendering fidelity' important in a scenario where any font could be used and line breaking is determined using UAX14? If it's not, where is it important?

  5. If we compare the set of characters specified in Recommended Character Sets and the set of codepoints for which glyphs are available in the fonts associated with the Reference Fonts, is there some correspondence between the two?

palemieux commented 7 years ago

What is a 'processor'? I was assuming that it is the software that displays the text that is associated with a range of video. Is that correct?

Yes. A processor generally manipulates IMSC1 documents, a presentation processor specifically renders them.

I thought we established that authors could use any font. Is that correct?

They can specify any font.

I think we also established that processors should be able to flow text within a box, regardless of which font is used. Is that true?

Yes.

How is improving 'rendering fidelity' important in a scenario where any font could be used and line breaking is determined using UAX14? If it's not, where is it important?

The position of the line breaks and overall size of the rendered text is important, e.g. for readability and to avoid covering important parts of the scene, etc.

If we compare the set of characters specified in Recommended Character Sets and the set of codepoints for which glyphs are available in the fonts associated with the Reference Fonts, is there some correspondence between the two?

Well, if the processor supports a character and the character is included in the reference font and the reference font is used (through tts:fontFamily), then the processor uses the metrics of the character of the reference font.

r12a commented 7 years ago

How is improving 'rendering fidelity' important in a scenario where any font could be used and line breaking is determined using UAX14? If it's not, where is it important?

The position of the line breaks and overall size of the rendered text is important, e.g. for readability and to avoid covering important parts of the scene, etc.

Does the processor attempt to determine the size and placement of the box the text is rendered in, or is that just down to the author?

I asked the question about 'rendering fidelity' a little out of context. Let me try to explain a little better what i meant. You said:

In other words, in order to improve rendering fidelity, an author targeting 'ar' and a device claiming to support 'ar' need to agree on a common character set for 'ar'.

I understand the importance of the box containing the text being the right size and place. Are talking about here how to determine where a line of text will break given a bounding box, or are we talking about how to generate the bounding box?

It's not clear to me how the existence of the reference fonts and character sets is of benefit when any font can be used, for example wide fonts like Verdana or Helvetica, vs. narrow fonts like Arial for English. There are often much greater divergence in basic glyph sizes in fonts for other scripts - Khmer fonts can be of different types and very different sizes, Arabic fonts can use more or fewer ligated forms depending on font style, other complex fonts may or may not have the same contextual forms, etc.. So the processor needs to be able to work out how to flow the text on the basis of the font and language of the text it is dealing with (with some help about suitable break points). But if the processor has to be capable of determining how to flow the text for fonts other than the reference fonts, what value does the reference font provide wrt 'rendering fidelity'? Why is that needed for 'rendering fidelity'? Don't know if that makes my question clearer.

r12a commented 7 years ago

If we compare the set of characters specified in Recommended Character Sets and the set of codepoints for which glyphs are available in the fonts associated with the Reference Fonts, is there some correspondence between the two?

Well, if the processor supports a character and the character is included in the reference font and the reference font is used (through tts:fontFamily), then the processor uses the metrics of the character of the reference font.

What i was asking was more along the lines of: Is the choice of character sets that are enumerated, and the characters listed for those character sets, a result of checking what code points are supported by the reference fonts. Because i noticed a fairly close correspondence between those and the characters supported by the Arial font (i didn't check the others).

palemieux commented 7 years ago

Is the choice of character sets that are enumerated, and the characters listed for those character sets, a result of checking what code points are supported by the reference fonts.

The choice of character sets that are enumerated is intended to reflect characters commonly used for subtitling in a particular language.

This choice is not related to characters supported by the reference fonts.

Apologies for misunderstanding your question.

palemieux commented 7 years ago

I understand the importance of the box containing the text being the right size and place. Are talking about here how to determine where a line of text will break given a bounding box, or are we talking about how to generate the bounding box?

Oh. The recommended character sets are not intended to improve the rendering fidelity in terms of placement, line breaks, etc...

The recommended character sets are not intended to ensure that an implementation that claims to support a particular language implements supports for characters likely to be used for subtitling in that language, i.e. "rendering fidelity" means the "the character being rendered at all".

nigelmegitt commented 7 years ago

Meeting 2017-06-22: @r12a ping? The group thinks all your questions have been answered - please could you confirm, or if you still think that changes to the spec are needed, let us know what they are? Would it be possible to respond by Monday 26th June so we can close this off by Thursday 29th please, allowing time for discussion, proposals etc?

klensin commented 6 years ago

Another issue seems to have gotten lost in a parallel discussion and I've been asked to mention it again in this thread.

The "recommended characters" lists are problematic in at least two ways. First, while it is easy to justify a requirement for the characters needed to write a given language plus ASCII (because ASCII is needed to write, e.g., URIs), there does not seem to be any justification for requiring most of the rest of the Latin-based characters. Doing so is a step backward toward a time in which the web was assumed to be adequately internationalized if it supported the characters used in western European languages. Second, the list of languages is not complete, with languages and writing systems used by millions of users not listed. One fix would be to tell people where to look to find the characters needed to write a given language that is not listed, but, if that alternate source exists and is reliable, the appendix B table is probably unnecessary.

thanks.

palemieux commented 6 years ago

@klensin See https://github.com/w3c/imsc/issues/243, the character sets of Annex B are a superset of the Unicode CLDR character sets, which address a wide range of languages. The long-term objective is in fact to defer to CLDR entirely, and remove Tables 1 and 2 -- see issue 8915.

r12a commented 6 years ago

Thanks to @klensin for bringing me back to this issue, and making me think about it again. I'm not at all happy with the way it is currently worded, since even though it is aimed at authors it creates a cycle which makes it appear that only the named characters should be supported by applications. (If authors are not going to use them, why would applications support more than the basic set of characters?)

Here's something closer to what i would have preferred to see, if this section remains. The uppercasing is purely to show where the changes are.

PROCESSORS ARE REQUIRED TO SUPPORT UNICODE, AND SHOULD THEREFORE SUPPORT ALL NEEDED CHARACTERS FOR ANY LANGUAGE, BUT SOME LEGACY APPLICATIONS HAVE RESTRICTED SUPPORT, AND SO WHEN authoring textual content FOR LEGACY ENVIRONMENTS, authors MAY WANT to select CHARACTERS from THE sets BELOW based on the language indicated using xml:lang. The idea is to increase the confidence that the text will be presented correctly by LEGACY implementations targeting specific locales. ... Table 1 captures the set of characters EXPECTED to be available to authors across all languages. The terms used in the table are defined in [UNICODE]. ... Table 2 LISTS supplementary character setS that have proven RELIABLE in LEGACY captioning and subtitling applications for a number of selected languages. Table 2 is non-exhaustive, and will be extended as INFORMATION BECOMES AVAILABLE.

r12a commented 6 years ago

Unicode CLDR character sets, which address a wide range of languages. The long-term objective is in fact to defer to CLDR entirely, and remove Tables 1 and 2

But defer to CLDR for what? If authors are to expect that CLDR is always safe ground, then there's a strong implication that applications will be expected conform to CLDR in terms of which characters are to be supported. As a minimum that may sound useful, but there are some issues:

  1. applications shouldn't be or remain so focused on supporting specific sets of characters. They should support opentype fonts and rendering so that any set of characters can be presented appropriately if an author provides an appropriate font. Promoting the CLDR character sets in this way surely gives applications licence to continue in very old fashioned practices that are counter to good international support.
  2. while we should certainly hope that CLDR grows and grows, it is currently (after several years in the making) still very limited in coverage. I created my own character-by-language app recently, and it uses CLDR as one of its sources. CLDR provided information for only 195 out the 446 languages covered by my app, and 446 is still a small number.
  3. the CLDR data, though perhaps the best we currently have, is not always complete or reliable - neither in terms of language coverage, nor in terms of characters specified for a given language. That's another reason to be very careful about sounding prescriptive in terms of what should and shouldn't be supported.
nigelmegitt commented 6 years ago

I don't have any problem with the suggestion at https://github.com/w3c/imsc/issues/236#issuecomment-367713408 but I do wonder if some of the (especially legacy) implementers are clear about the difference between a processor supporting unicode (not so hard, each code point is just a number, right? [ducks for cover]) and support for fonts that define glyphs for all the code points that are needed. This is one of those situations where both the processing code and the (font) data supplied to it have to support the functionality for it to work for the end user. We don't say anything about the requirement for processors to support combining characters, for example, but it's taken as obvious I think. Also in a memory and processor constrained system, such as a low cost TV-attached device, requiring support for arbitrary fonts might be considered hard, since the size and complexity of font files is unbounded. Just a bunch of thoughts, no particular thought of adding further text to the spec.

css-meeting-bot commented 6 years ago

The Working Group just discussed Should the character sets be minimum *font* requirements? imsc#236.

The full IRC log of that discussion <nigel> Topic: Should the character sets be minimum *font* requirements? imsc#236
<nigel> github: https://github.com/w3c/imsc/issues/236
<nigel> Pierre: This is an issue that we seem to come back to regularly.
<nigel> Glenn: I predicted this would happen.
<nigel> Pierre: My challenge is I don't understand what the commenter wants.
<nigel> .. TTML1 and TTML2 already require support for Unicode characters, so an implementation
<nigel> .. cannot reject a document because it does not accept a unicode character.
<nigel> Glenn: That's correct. It need not have rendering support or fonts.
<nigel> Pierre: The spec is trying to be helpful to implementers who do want to support particular
<nigel> .. languages or scripts. My suggestion is to organise a call with i18n and try to get down
<nigel> .. to what they are trying to achieve. We seem to be talking past each other.
<nigel> Nigel: I can take an action to try to set that up.
<nigel> Glenn: My response would be "No. Font and unicode rendering capabilities are an implementation dependent property in TTML1."
<nigel> .. Of course in TTML2 with downloadable fonts that opens things up a bit, but generally
<nigel> .. you cannot support a complex script without adding code, so just adding a Mongolian
<nigel> .. font won't cut it.
<nigel> Glenn: I would suggest focusing on the characters aspect not the rendering aspect.
<nigel> Philippe: Can I try asking a few questions to see if I can channel Richard?
<nigel> .. Right now are you saying you don't expect implementations to support UTF-8 necessarily?
<nigel> Pierre: No, IMSC1 requires support for UTF-8 and TTML requires support for Unicode.
<nigel> Philippe: Are you saying you should not use a character outside the encoding you are using?
<nigel> Pierre: No, the purpose of the annex is to help an implementer select which particular
<nigel> .. character they should render. Noone renders all Unicode characters. They make a decision
<nigel> .. on which set of characters they render based on requirements such as territory.
<nigel> .. This section is to help implementers.
<nigel> Philippe: Right now it is worded from the point of view of the author not the implementer.
<nigel> Pierre: Originally it was a Processor recommendation, and I think Dave Singer got really
<nigel> .. upset by that, so we changed it to an authoring requirement.
<nigel> .. In my mind this is really a processor requirement.
<nigel> Philippe: r12a's suggestion is also a suggestion for implementers.
<nigel> Pierre: Yes, but if we change to a processor requirement we might get the same objections
<nigel> .. as we had before. For all intents and purposes they are equivalent.
<nigel> Philippe: I think we should invite @r12a to this conversation.
<nigel> Pierre: We're definitely not saying that some scripts do not need to be supported.
<nigel> Glenn: "Supports unicode" is an overloaded phrase. It just means support for the character
<nigel> .. semantics irrelevant of their formatting.
<nigel> Philippe: In that case the treatment would be the same for all characters. Why only those?
<nigel> Glenn: Yes, that's why I suggested not having this section in the first place.
<nigel> Nigel: I'll invite him to a discussion.
nigelmegitt commented 6 years ago

Action recorded at #338

asmusf commented 6 years ago

The important thing is to avoid creating bottlenecks - data formats or implementations of processors - that are only ever capable of passing through a subset of characters.

Any distributed system that contains elements that have such bottleneck function makes sharply limits overall functionality - and worse - makes it impossible to incrementally improve performance.

This is different from having a limited font. You may not be able to display all possible data, but updating the font fixes that: without change to the rest of the system.

palemieux commented 6 years ago

This is different from having a limited font.

Yes. The objective of the IMSC specification is to recommend the sets of characters for which an implementation should provide glyphs, based on the language(s) that the implementation claims to support.

The IMSC specification does not allow an implementations to support only a subset of the Unicode character set, i.e. there are no provisions for rejecting a document based on the Unicode characters present within the document.

nigelmegitt commented 6 years ago

Useful discussion with i18n just now, @palemieux to draft text to address the concerns raised. Minutes: https://www.w3.org/2018/03/08-i18n-minutes.html#item02

aphillips commented 6 years ago

I was going to poke @palemieux to let me know about the update, per I18N-ACTION-700, but I see the commit. Thanks!

aphillips commented 6 years ago

Hello IMSC. In our teleconference last week, the I18N WG tasked me with writing back about this change. In the main we are please with these change, however we are still concerned about the characterization of these sets of characters as "safe". The WG would prefer if you used a phrase such as "widely supported" or "widely accepted".

Please let us know if you agree or would prefer to discuss (here or in teleconference)

palemieux commented 6 years ago

Rendered at https://rawgit.com/w3c/imsc/issue-236-clarify-recommended-character-set-objectives/imsc1/spec/ttml-ww-profiles.html

palemieux commented 6 years ago

@aphillips Personally happy to make the change from "safe" to "widely supported" or "widely accepted". Have you considered "common" instead? It would be less of a mouthful.

palemieux commented 6 years ago

@aphillips See updated PR.

aphillips commented 6 years ago

We discussed this in teleconference and are satisfied with this change. Thank you very much for your help!