opentypejs / opentype.js

Read and write OpenType fonts using JavaScript.
https://opentype.js.org/
MIT License
4.45k stars 475 forks source link

Arabic text is not rendered correctly #364

Open teleranek opened 5 years ago

teleranek commented 5 years ago

Expected Behavior

Let's take the following arabic sentence: بِسْمِ الله الرَّحْمٰنِ الرَّحِيمِ' Natively it renders to the following form: reference

Current Behavior

With current version of opentype.js (and even with merged PR #361 and #362 from @solomancode it looks like this: test_scheh

Steps to Reproduce

Create following js file in opentype.js directory:

var opentype = require('./dist/opentype');
var fs = require("fs");
var Canvas = require("canvas").Canvas;

opentype.load('fonts/Scheherazade-Bold.ttf', function(err, font) {
    if (err) {
         alert('Font could not be loaded: ' + err);
    } else {
        var canvas = new Canvas(1000, 500, "png");
        var ctx = canvas.getContext('2d');
        var path = font.getPath('بِسْمِ الله الرَّحْمٰنِ الرَّحِيمِ', 0, 200, 100);
        path.draw(ctx);
        var buf = canvas.toBuffer();
        fs.writeFileSync("test.png", buf);
    }
});

(additionally, use npm install canvas to be able to render font to png))

Your Environment

solomancode commented 5 years ago

Hi @teleranek ,

Thanks for reporting that issue,

First you need to know that there are two types of characters in Arabic script. see image below.

issue-tele

1. Arabic Tashkeel/diacritics. ( marked in red )

These marks are supplementary but they provide information about the correct pronunciation of an Arabic word. in most cases an Arabic native speaker can read Arabic text without confusing the meanings of the words and it's so unlikely that someone would use them when writing by hand. But they're so important to prevent misinterpretation of some text like the Holy Quran. for example if you follow the link in the previous paragraph you'll find that the second verse tells that Byzantium has been vanquished/defeated but if we remove the diacritics as you can see in the image below. The meaning of the verse becomes totally ambiguous. It could mean that they've been defeated or that the are victorious depending on the pronunciation.

roman

2. Arabic Abjad ( marked in Black )

The basic Arabic alphabet ( 28 letters ). In terms of lettering the three lines of text has the same letters ( no issue here ) and for an Arabic native speaker, there'll be no confusion in reading and understanding the meaning convoyed in these texts even without diacritics these three lines of text will convey the same meaning:

بسم الله الرحمن الرحيم In the Name of Allah, the Most Beneficent, the Most Merciful

even this form of Arabic Calligraphy

has the same meaning.


To conclude there are two issues we haven't been able to solve yet!

1. Rendering of special word ( Allah : الله )

I found that font includes a composed glyph of the Arabic word Allah as seen in the image below. but the rule for applying this glyph is missing from the font. I already contacted the developers of Scheherazade to learn more about this issue and I hope to hear from them soon. but until this issue is fixed you can insert the composed character manually by hitting ctrl + shift + u then enter fdf2 if you're on linux to get . if you're using windows hold alt and enter the same sequence fdf2 that of course after selecting Scheherazade as your writing font.

lafdzualjalalah

2. Incorrect positioning of Arabic diacritics

I already discussed this issue with @Jolg42 in my last PR.

by supporting diacritics composition we're not missing any of the essential features for properly rendering Arabic text except correcting the positioning of diacritics and I found that we're missing the required GPOS lookups to fix this issue.

I hope you find my response useful. I also would like to note that Arabic text rendering is not very well supported in many libraries/platforms not saying that as an excuse but computer fonts — Opentype in our case was originally made to support basic writing scripts like Latin but it offers limited support for complex writing scripts like Arabic. I hope one day we'll be able to change that. so until then stay tuned for my awesome future contributions :blush:

For more information please check my PR #359 . also Scheherazade is a cursive type font so vertical lines was designed to be a little bit round unlike Arial that has sharp edges — it's not a fault in anyway it's just the designer's stylistic decision.

moyogo commented 5 years ago

Regarding the Allah word, Scheherazade is following the Unicode specification. Fonts should only form the allah ligature when the shadda and the superscript alef are present in the character sequence. See the note about FDF2 in https://www.unicode.org/versions/Unicode11.0.0/ch09.pdf#page=32:

U+FDF2 ARABIC LIGATURE ALLAH ISOLATED FORM is a very common ligature, used to display the name of God. When the formation of the allah ligature is desired, the recommended way to represent the word would be <alef, lam, lam, shadda, superscript alef, heh><0627, 0644, 0644, 0651, 0670, 0647>. In non-Arabic languages, other forms of heh, such as heh goal (U+06C1), may also form the ligature. Extra care should be taken not to form the ligature in the absence of the shadda and the superscript alef, as the sequences <alef, lam, lam, heh> and <alef, lam, lam, shadda, heh> exist in Persian and other languages with different meanings or pronunciations, where the formation of the ligature would be incorrect and inappropriate.

However many fonts, including widely used system fonts, do form the ligature without the shadda and superscript alef being present in the character sequence. So it is understandable that there is such a user expectation but it is problematic for the reasons given in the Unicode standard.

solomancode commented 5 years ago

Hi @moyogo ,

Thanks for your feedback.

Technically Speaking

Scheherazade font developer has provided the rules for composing one or more glyphs in certain contexts and I used these rules as a part of my implementation for fixing Arabic text rendering in my PR #359.

for example the rule for composing a Shadda with superscript alef ( alif khanjariyya ) is present in the GSUB table and if you test them now you'll get the composed glyph rather than two separate glyphs.

shadda-dagger Shadda composed with superscript alef rendered by opentype.js

also if you entered the sequence <alef, lam, lam, shadda, superscript alef, heh> you'll get the expected render result from opentype.js. but you'll get <alef, lam, lam, ( shadda and superscript alef in a single glyph ), heh>

allah_opentype

If you inspect Scheherazade font file, you'll find that there is a single glyph that has the composition of <alef, lam, lam, shadda, superscript alef, heh> unicode \ufdf2 .

lafdzualjalalah

I looked up the GSUB table for the rule for applying this glyph but I couldn't find any rule that describes this relation. I can't create the mapping manually between the sequence <alef, lam, lam, shadda, superscript alef, heh> and the glyph \ufdf2 . If I did that I might break the rendering of this glyph in other fonts.

Practically speaking

I don't agree with what's mentioned in the spec.

Extra care should be taken not to form the ligature in the absence of the shadda and the superscript alef, as the sequences <alef, lam, lam, heh> and <alef, lam, lam, shadda, heh> exist in Persian and other languages with different meanings or pronunciations, where the formation of the ligature would be incorrect and inappropriate.

It's almost always that the Shadda is present when writing Allah الله. even writing the word without Tashkeel the word conveys the same meaning for Arabic native speakers Muslims and Non-Muslims which is The one and only God the word refers to the same deity for Muslims, Christians, Jews, etc...

I haven't encountered not even once a situation when someone needed to write the word without its diacritics. I don't even know what does it mean without diacritics :smile: So it's reasonable and more practical to form the ligature even without the presence of Shadda.

solomancode commented 5 years ago

Hi @teleranek

I just received this email from one of the developers of Scheherazade font.

reply

We're sorry there is nothing we can do for now, Please don't be discouraged by our response. supporting Arabic features is a great deal that requires a lot of time and effort to achieve.

Thank you. and stay tuned.

StevenEWright commented 1 year ago

I don't agree with what's mentioned in the spec.

[...]

It's almost always that the Shadda is present when writing Allah الله. even writing the word without Tashkeel the word conveys the same meaning for Arabic native speakers Muslims and Non-Muslims which is The one and only God the word refers to the same deity for Muslims, Christians, Jews, etc...

FWIW, I was trying to write واللـهِ the other day and was very frustrated that unless I added the tatweel, there was nothing I could do to stop it from becoming والله

solomancode commented 1 year ago

I don't understand your question, واللـهِ That's 'Kasra' not Shadda, and it's not related to the spec.