w3c / bp-i18n-specdev

Internationalization Best Practices for Spec Developers
https://w3c.github.io/bp-i18n-specdev/
Other
26 stars 17 forks source link

legacy grapheme clusters vs extended grapheme clusters #1

Closed frivoal closed 2 weeks ago

frivoal commented 9 years ago

"Grapheme cluster" is often the appropriate way to define "character" in a specifications (such as CSS) which care about things readers visually identify as a character.

Maybe the spec should point that out, with a link to the relevant part of unicode (http://unicode.org/reports/tr29/ I presume). There is already a mention of that in the "Indexing strings" section, but not in the "Choosing a definition of 'character'" section, where it would be particularly relevant.

Also, providing a specific definition requires picking between "legacy grapheme clusters" and "extended grapheme clusters", and I am not sure how to do that. Guidance on this topic would be appreciated.

r12a commented 9 years ago

Good points, Florian. I'll look at adding that information.

We usually recommend extended grapheme clusters only.

frivoal commented 9 years ago

That's typically been what I've guessed should be the correct answer, but without really knowing why. And this specification looks like a great place to enlighten people in my situation.

aphillips commented 2 years ago

Is this addressed by the introduction to section 4?

xfq commented 11 months ago

There is no mention of legacy grapheme clusters in specdev at the moment and I think this paragraph in UAX #29 answers Florian's question:

An extended grapheme cluster is the same as a legacy grapheme cluster, with the addition of some other characters. The continuing characters are extended to include all spacing combining marks, such as the spacing (but dependent) vowel signs in Indic scripts. For example, this includes U+093F ( ि ) DEVANAGARI VOWEL SIGN I. The extended grapheme clusters should be used in implementations in preference to legacy grapheme clusters, because they provide better results for Indic scripts such as Tamil or Devanagari in which editing by orthographic syllable is typically preferred. For scripts such as Thai, Lao, and certain other Southeast Asian scripts, editing by visual unit is typically preferred, so for those scripts the behavior of extended grapheme clusters is similar to (but not identical to) the behavior of legacy grapheme clusters.

IMHO this kind of detail should be mentioned by charmod, not in specdev.

aphillips commented 1 month ago

This doesn't belong in charmod-norm. There is some material about graphemes in charmod proper, but that document pre-dates extended grapheme clusters and there is a zero percent chance that we'll revise it 😉. I'm adding another small note to explicitly mention EDGCs.