w3c / alreq

Documenting gaps and requirements for support of Arabic Script languages on the Web and in eBooks.
Other
62 stars 31 forks source link

U+FDF2 'ARABIC LIGATURE ALLAH ISOLATED FORM' not always rendered correctly #125

Open Manishearth opened 7 years ago

Manishearth commented 7 years ago

U+FDF2 'ARABIC LIGATURE ALLAH ISOLATED FORM' (ﷲ) is supposed to render as alef-lam-lam-meem (with diacritics), but in some fonts, including Courier New, the Alef is missing.

http://www.fileformat.info/info/unicode/char/fdf2/fontsupport.htm

The code point could conceivably mean "the main l-l-m ligature in 'allah'", however the spec decomposes it as a-l-l-h, so all fonts should render the leading alef.

behnam commented 7 years ago

Screenshot on my system, with buggy fonts marked highlighted red:

screen shot 2017-06-13 at 5 07 38 pm screen shot 2017-06-13 at 5 08 27 pm

Creating these kinds of ligatures, specially RIAL and ALLAH are very common in fonts.

The bug here seams to be the font assigning U+FDF2 to a ligature glyph for the second joining segment of the word ALLAH (which is LLAH), instead of creating a composed glyph for U+FDF2 using the ligature.

CLDR data, which is our primary source for character support, misses any kind of information about ligatures (and their possible codepoints). Seeing this bug being common, specially in the more open-source fonts, I think we can cover the topic in ALReq and, even, maybe, provide an Annex with some details about the important ligatures and their implementation details in fonts (like the detail here that the ligature doesn't get U+FDF2 codepoint, but U+FDF2 uses the ligature.)

What do you think?

khaledhosny commented 7 years ago

Since U+FDF2 is a presentation form character, I think we shouldn’t say much more than discouraging the use of presentation forms in text input. As for the fonts, though they indeed break the glyph for U+FDF2, the ligatures for الله and لله still work correctly.

behnam commented 7 years ago

Right, @khaledhosny. True that we want to discourage them in text. So, the question is, do we want to cover the issue for the sake of improving font development processes and font products for the script?

Since the topic is not exactly text layout, I think it could be a separate (wiki) document, or maybe an annex on font development.

khaledhosny commented 7 years ago

I agree this does not belong to the main document, an annex on Arabic font development best practices might be a good idea.

ntounsi commented 7 years ago

My thinking is :

Html code to test your fonts: <p>&nbsp;&#xFDF2; &#x627;&#x644;&#x644;&#x64E;&#x651;&#x647; &#x627;&#x644;&#x644;&#x647; </p>

@behnam and @khaled, +1 to cover font development best practices.

moyogo commented 6 years ago

The Unicode Standard 11.0.0 says the following in section 9.2 Arabic Presentation Forms-A: U+FB50–U+FDFF, Word Ligatures (this was added in Unicode 7.0.0):

U+FDF2 ARABIC LIGATURE ALLAH ISOLATED FORM is a very common ligature, used to diplay the name of God. When the formation of the allah ligature is desired, the recommended way to represent the word would be <alef, lam, lam, shadda, superscript alef, heh> <0627, 0644, 0644, 0651, 0670, 0647>. In non-Arabic languages, other forms of heh, such as heh goal (U+06C1), may also form the ligature. Extra care should be taken not to form the ligature in the absence of the shadda and the superscript alef, as the sequence <alef, lam, lam, heh> and <alef, lam, lam, shadda, heh> exist in Persian and other languages with different meanings or pronunciations, where the formation of the ligature would be incorrect and inappropirate.

r12a commented 6 years ago

I decided it was time for me to explore this a little more deeply. Here are some other results. I created a test page at: https://w3c.github.io/alreq/gap-analysis/tests/ligation/ligation_000.html

Here are some results i screen-captured on my Mac. Grey backgrounds from a v quick scan indicate things i think are probably incorrect.

screen shot 2018-06-21 at 17 55 31

Essentially, this whole thing is quite broken, it seems. (Which is surprising given the content involved.)

Manishearth commented 6 years ago

Arial overcompensating by adding a double shadda/alif is very surprising (and somewhat hilarious) to me given how commonly that font is used.

Then again, I guess very little about non-latin text not working on computers should surprise me anymore 😩

khaledhosny commented 6 years ago

My perception is that, contrary to what Unicode suggests, Arabic users expect bare [alef] lam lam heh to ligate and that is what almost all Arabic fonts do. Arabic non-God name words that would match the same sequence of letters are very uncommon to the extent that I never encountered any of them until I was researching this very issue. In Amiri I approached this from the other end; actively matching sequences that are unlikely to be the name of God and unligating them, e.g. خالله does not ligate, but فالله ligates while فالَله does not.

Manishearth commented 6 years ago

When I discussed this issue with @roozbehp he had some examples of Persian words that do this, IIRC.

Just to lay it out, there are multiple issues here, of varying severity:

moyogo commented 6 years ago

As @r12a notes in https://r12a.github.io/scripts/arabic/block#charFDF2 the compatibility decomposition for FDF2 is <alif, lam, lam, heh> (“≈ [isolated] 0627 0644 0644 0647”).

While the (non normative) reference glyph is a ligature <alif, lam, lam, shadda, superscript alif, heh>, this hasn’t always been the case. In the Appendix H. New Characters of the Unicode Standard 1.1, the reference glyph used is a ligature <alif, lam, lam, heh> without shadda nor superscript alif. This may explain where the compatibility decomposition of FDF2 comes from. capture d ecran 2018-06-22 a 10 18 25

asmusf commented 6 years ago

The production process changed between Unicode 2.x and 3.0. From that point on, different custom software was used with an entirely new collection of TrueType fonts. With many upgrades, both to the software and the font collection, that process is still very much in place today.

Every update of the font collection bears the risk of unintentional changes, and not all of them are caught be reviewers. Therefore, it would take some digging to find out whether the change from a glyph matching the decomposition to a glyph adding shadda and alif was indeed intentional at the time.

moyogo commented 6 years ago

I was curious to see if any fonts have FDF2 as alif, lam, lam, heh without shadda and superscript alif.

I managed to find a handful:

There are most probably more.

Including these, there are also more typefaces that do not ligate <lam, lam, heh> (regardless of what FDF2 they have). Some of these do have an optional discretionary ligature feature that does the ligature.

There may also be fonts that do FDF2 with shadda but no s. alif like https://www.linotype.com/1079191/hasan-alquds-unicode-regular-product.html?site=webfonts&format=ot-ttf&branding=std or there may also be fonts that do FDF2 with shadda and fatha like https://fonts.google.com/specimen/Harmattan.