Segmenting Brahmi-derived scripts involving conjuncts

w3c / font-text-cg

GitHub Pages

https://w3c.github.io/font-text-cg/

Other

28 stars 5 forks source link

Segmenting Brahmi-derived scripts involving conjuncts #53

Open r12a opened 3 years ago

r12a commented 3 years ago

One of the key issues for typographic handling of Indian scripts on the Web is how to handle conjuncts, since they don't map to grapheme clusters. But the situation is complicated by the fact that the exact same underlying sequence of code points may need to be handled differently, depending on what the font does with it.

I was discussing privately with @kojiishi but figured that it would be useful to open the discussion to this group. I'll include some things here from that conversation. The discussion was mostly related to initial-letter selection and letter-spacing.

I'll include some explanations in the next comment, but try to capture the issue in a nutshell here.

Currently for Devanagari and Bengali, ::first-letter in Blink and Webkit selects the whole consonant cluster (plus combining characters) as a unit, whereas Gecko selects the initial grapheme cluster instead in most conjuncts (in particular, half forms). The Blink/Webkit approach is great for handling conjuncts, but doesn't allow first-letter or letter-spacing to separate the consonants in a cluster when they don't form a conjunct. The opposite applies for Gecko's approach.

Tamil, however, is treated slightly differently. (Modern Tamil has very few conjuncts (2~3) but plenty of clusters, and uses a visible virama as the default.) Although Blink/Webkit do just select the initial grapheme cluster for Tamil, this time they break the few conjuncts that should be kept together. Gecko's intial grapheme cluster selection works well for Tamil, but actually this time they also manage to recognise and keep together the Tamil conjuncts.

I can't see a way of resolving this problem by focusing on code points. It could only be resolved by interrogating what the font is doing.

However, in the meantime, until we have a clever fix, the code points are all we have. It seems to me that, in the interim, perhaps less harm is done by preventing virama-visible clusters from splitting in certain scripts than by allowing conjuncts to split. So the Blink/Webkit approach for Devanagari/Bengali seems a useful default, as an interim approach, despite its side-effects. I'd be interested to hear whether others agree.

Yesterday i created some gap-analysis content related to initial-letter selection, that links to tests and describes the results:

First-letter: Devanagari/Bengali: https://www.w3.org/TR/deva-gap/#issue94_initials and https://www.w3.org/TR/beng-gap/#issue115_initials Tamil: https://www.w3.org/TR/taml-gap/#issue116_initials

Letter-spacing: Devanagari/Bengali: https://www.w3.org/TR/deva-gap/#issue117_spacing and https://www.w3.org/TR/beng-gap/#issue117_spacing (Tamil works fine, except for a bug with one form of the shri conjunct.)

It's clear from the results that the Blink/Webkit engine uses different algorithms for first-letter and letter-spacing.

r12a commented 3 years ago

Here's a little more background on the situation.

Consonant clusters in indic scripts like Devanagari, Bengali, and Tamil kill the inherent vowel with a virama. A consonant cluster (ie. more than one consonant without intervening vowels) can be rendered in one of two ways: a. as a conjunct, which means that the glyphs are munged together to some degree. In this case the virama is not shown (that's important). b. as a sequence of consonant characters with a visible virama.

The former is very common for Devanagari and Bengali. The latter is the default for modern Tamil, apart from a small number of exceptions.

If there's no visible virama, the consonant cluster must not be broken. For example, a vowel-sign that is placed before the base, and which is pronounced after all the consonants, must appear before the first consonant in the whole cluster.

If there is a visible virama, the consonant cluster can be broken, and a pre-base vowel-sign will appear between the consonants, just before the last.

There are 3-, and occasionally 4- consonant clusters, and the same rules apply to them.

So Chrome is treating the conjuncts as a single unit. Which is great.

But it's also treating sequences with a visible virama as a single unit, which isn't as great.

But then this is a difficult problem to solve, because (crucially) the sequence of code points for clusters that are conjuncts and those that aren't (ie. have a visible virama) is exactly identical. The difference is only produced as the font does its rendering, and decides whether or not to hide the virama. And some fonts have more conjunct glyphs than others, so it varies by font (see an example at https://r12a.github.io/scripts/devanagari/#visiblevirama). So unless we can tell, for the particular font being used in that instance, how it renders the cluster, we can't decide whether to keep everything as a single unit, or allow it to break after the virama.

For a brief introduction with pictures see Typographic character units in complex scripts.

r12a commented 3 years ago

And finally, here are some thoughts from @kojiishi:

I can suggest two possible approaches. One, since you said "Chrome's approach is probably more useful", if that's true, and (while incorrect) it's acceptable for Indic, I think it's best to change UAX#29, and we give up breaking at virama. I understand it's not desired, but no idea how bad or how much possibly it could be accepted, I hope you and Indic people can determine that.

If that's not acceptable, the other approach is a bit long.

I think this should be spec'ed at OpenType. Thanks to your support, as far as I understand at this moment, the OpenType cluster defined in the Devanagari spec does not provide the information needed for this purpose, so we need a new logic. These OpenType Script development specs are I think the best place to add the logic to. It may involve adding new data/features to OpenType, such as figuring out non-explicit Virama if it's not possible with the current OpenType, I hope developing the spec can figure out necessary data/features.

I can't speak for CoreText/DirectWrite, but HarfBuzz is a pure implementation of OpenType specs. Once the algorithm is defined in the specs, I hope the HarfBuzz community can consider implementations.

Once the necessary information is produced by shaping engines (HarfBuzz/CoreText/DirectWrite), browsers can start thinking how to switch Unicode code points/ICU-based cluster analysis to glyph/OpenType-based cluster analysis, in both specs and implementations. This won't be easy either, but I think this is technically doable.

tiroj commented 3 years ago

I second the suggestion that, in the absence of cleverer methods, the segmentation used by Chrome is a better default behaviour.

I can also suggest that if the codepoint sequence contains a virama character followed by a ZWNJ, that would indicate an explicit virama that is not an accident of a particular font behaviour, so that would suggest a safe place to segment. If, for example, I wanted to apply CSS styling in a way that would only affect the first consonant+virama in a Tamil sequence, I could add ZWNJ to make clear that the cluster can be segmented at that point.

[As an aside: I have been leaning for some time towards the opinion that all the Brahmi derived scripts in Unicode should be given explicit virama characters that are independent of the graphical conjunct forming characters. I know there are all sorts of reasons why this is unlikely to happen.]

OpenType Layout shaping engines track outcomes of some glyph substitution features and, in particular, the presence of explicit virama in a string after the orthographic unit shaping feature block has been processed, because this information is necessary to the reordering of ikar and reph (in modern convention). So the logic for determining if a semi-processed glyph string contains an explicit virama exists and is well defined. The question, I suppose, is whether browsers could apply or access that logic during segmentation?

kojiishi commented 3 years ago

Thank you @tiroj for the comment, this is very much appreciated. I work on browser layout engine (Blink), but let me admit that I'm not very familiar with Indic scripts nor with inside of shaping engines.

First, probably a novice question, but is "explicit virama" the same as "visible virama"? @r12a told me that we want to split the cluster only when there's a "visible virama", and you said shaping engines know whether an "explicit virama" exists or not.

If these two terminologies are the same, I think you're right, it's not a matter of the spec but about shaping engines to expose that information to clients. I don't think browsers can access that logic today, but we can probably move this discussion to HarfBuzz to enable it. /cc @behdad

tiroj commented 3 years ago

Explicit virama and visible virama are often the same thing, but not always. I think perhaps there is probably a better term than either explicit or visible, but I am not sure yet what it is.

By explicit virama, I mean a unique glyph in the run that singularly represents the virama character. This is what is tracked by the shaping engine, and hence plays a role in the reordering stage of Indic layout. The reason I do not like the term visible virama, is that it is possible to have a visible virama sign appear in text without it being a unique glyph singularly representing that character.

So, for example, when we make fonts that support the older deva shaping model instead of the typical implementation of dev2 shaping, we include half-form glyphs of retroflex letters that include a visible virama for e.g. ड् ढ्. In GSUB terms, these are ligatures of the letter and the virama character. These may then be further ligated into conjunct ligatures with other letters, in which case the visible virama disappears, but in some combinations or when followed by ZWNJ, these glyphs would present a visible virama in the text but not an explicit (unique, singular) virama glyph.

Now, in a deva implementation this doesn’t matter to reordering, because that shaping model ignored the presence of a virama in the run when reordering reph and ikar. However, the dev2 model was designed to enable font makers to choose how they wanted reordering to be applied, recognising that the convention not to reorder reph and ikar past a visible virama is a modern convention, and that fonts might be made that implement older conventions in which this is not the case. In other words, it is acceptable to apply the deva approach of having nominal half-forms of retroflex consonants (and other letters that do not have half-form proper representations) as ligatures with the virama sign, and because that virama would hence be visible but not explicit, it would not block reordering of reph and ikar.

So things get complicated for you, because at the font level this is all about what happens at the reordering stage of Indic layout, which might be helpful to you in determining how and where to apply some kinds of CSS text display, but won’t necessarily. The shaping engine will be tracking the presence of an explicit, unique glyph that singularly represents the virama character, but there may legitimately be situations in which a virama is visible but not tracked because it isn’t explicit, unique, etc.

r12a commented 3 years ago

Btw, just some extra background (talking about codepoints rather than glyphs now)...

There are 3 types of vowel-killer distinguished by Unicode properties, and linked to specific characters in particular scripts by https://www.unicode.org/Public/13.0.0/ucd/IndicSyllabicCategory.txt.

These are: Pure killers, which are always visible and kill inherent vowels with no conjunct behavior. We have no issue with these characters.
See a list of them at https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=[:InSC=Pure_Killer:]

Invisible Stackers, which produce conjuncts, but which are always invisible. We have no beef with these characters either. See a list at https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=[:InSC=Invisible_Stacker:]

Viramas, which may sometimes be invisible and other times visible. These are the problematic characters. For a list of such characters, and the scripts they occur in, see https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=[:InSC=Virama:]

It's useful to have a list of the 27 scripts where we need to invoke special behaviours. However, some of those scripts (like Tamil) will rarely use the virama to generate conjuncts in modern text, and need a different default approach to others (like Newa) that will use it to make conjuncts for most consonant clusters.

tiroj commented 3 years ago

Invisible Stackers, which produce conjuncts, but which are always invisible. We have no beef with these characters either. See a list at https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=[:InSC=Invisible_Stacker:]

‘Always invisible’ is an interesting intention. In my experience, this is not always the case in practice.

I have been looking at the traditional Meetei Mayek orthography recently, which is encoded in Unicode with an invisible stacker:

AAF6  MEETEI MAYEK VIRAMA
• used to form conjuncts in historical orthographies

In fonts, such characters tend to be represented as shown in the Unicode glyph charts, with a small subscript + sign. Having a visible representation of the character can aid in both making the font—especially in a graphical GSUB interface such as VOLT—and in editing text: it is nice to have visual feedback on what you are typing, even if the glyph is subsequently swallowed in some form of conjunct display (whether ligature formation, below-base form substitution, or some combination of methods in the font).

But what if it isn’t swallowed? What if the font does not contain conjunct shaping for a particular sequence of consonant + invisible stacker virama + conjunct?

Some scripts form conjuncts in systematic ways, e.g. using subscribed below-base or post-base forms for secondary conjuncts, and these generally are handled okay with an invisible stacker virama, as the name suggests. But Meetei Mayek as an example of a script in which conjunct formation is not systematic, but instead used an evolving set of conventional ligature forms for conjuncts, which differed across time and locale. These conventions are not well documented, and I suspect that making a traditional Meetei Mayek font is currently impossible: the necessary information about conjunct ligature sets is not available, and will require significant research involving original manuscripts in Manipur and collections elsewhere. But even if one were to document and make a font that represented some standardised form of the traditional orthography, Unicode text may still present U+AAF6 in a sequence that the font cannot represent graphically as a conjunct.

This is, fundamentally, an issue of the script: the traditional Meetei Mayek orthography had no secondary method to represent conjuncts, no visible virama option (the modern, reformed orthography introduced a distinctive visible virama mechanism, which is exclusively used to indicate conjuncts now, but this cannot be incorporated in the traditional orthography).

All of which is a long and not especially helpful way to say that ‘always invisible’ may be the intention, but the reality is that these characters can easily show up as visible entities in text.