justinpenner / TalkingLeaves

A GlyphsApp plugin to help you explore the world’s languages and writing systems
MIT License
27 stars 2 forks source link

Improve Arabic support (include positional forms and ligatures) #16

Open justinpenner opened 2 months ago

justinpenner commented 2 months ago

There are a few tricky issues to overcome for TalkingLeaves to fully support Arabic. I do not have much background knowledge of the Arabic script, so any help/comments/feedback/corrections will be greatly appreciated.

Hyperglot defines character sets, not glyph sets

Hyperglot is the core dataset used by TalkingLeaves to define required Unicode characters for any given language. It does not currently define unencoded "alternate" glyphs that are required by many languages, especially in complex scripts such as Arabic.

But, as I discovered recently, Unicode includes many Arabic ligatures containing various positional forms as "compatibility characters". These are in fact defined in Hyperglot's Arabic language definitions, but I have no idea how complete they are, since I'm sure there must be many possible ligatures and positional forms that aren't in Unicode and are needed for some languages. When these characters are added to a font in Glyphs, Glyphs strips the codepoint and gives them a "nice name" that follows the naming conventions that will auto-generated the necessary feature code.

Are there minimum glyph sets for Arabic languages?

Is it even possible to define a minimum glyph set (including alternates) for any Arabic languages? Or does it entirely vary based on the type designer's preference and the project they're working on? Since Arabic relies heavily on shaping and OpenType programming, there must be many ways to define an Arabic glyph set. But I think it still might be possible to define a "recommended" glyph set for an Arabic language, that covers all the minimum requirements, and also allows Glyphs to auto-generate the OpenType feature code.

What would a glyph set definition for an Arabic language look like?

Hopefully, just a list of glyph names, following Glyphs naming conventions for alternates with underscore _ separators indicating ligatures, and .init .medi .fina and .isol suffixes indicating positional forms.

Where will the glyph set definitions come from?

I don't know of any data sources that would have clearly defined required glyphs for Arabic languages. But, I haven't done a lot of searching yet, as I'm only just beginning to wrap my head around how the Arabic script works. I'll try to search for more info on possible data sources that would be helpful for defining Arabic glyph sets.

Shaperglot has some shaping checks, which essentially check an input string to see what changes in the output string after feeding the input string and the font through a shaping engine (Harfbuzz). So, if Shaperglot requires certain characters to be substituted for certain Arabic languages, then I think it might be possible to infer which positional forms or ligatures are needed in order to pass those checks. I'm in the process of digging into this.

justinpenner commented 2 months ago

Can we get a minimum list of unencoded glyphs for an Arabic language from Shaperglot checks?

To understand this issue better, and whether Shaperglot can be useful in solving it, I performed the following test.

Open GF_Arabic_Core.glyphs in Glyphs, then immediately export a binary .ttf to test in Shaperglot. No need to draw any glyphs as Shaperglot doesn't care about outlines.

shaperglot check GF_Arabic_Core.ttf ar_Arab to check if it supports Standard Arabic, which seems like a good language to test on, as it's sort of the lingua franca of the Arabic world. It fails, of course, since there are no OpenType features yet, and outputs a long list of failed checks like this:

 * FAIL: .medi version of ARABIC LETTER YEH WITH HAMZA ABOVE; both buffers returned space=1+0|uni0626=1+600|space=0+0
 * FAIL: .init version of ARABIC LETTER YEH WITH HAMZA ABOVE; both buffers returned space=0+0|uni0626=0+600
 * FAIL: .fina version of ARABIC LETTER ALEF; both buffers returned uni0627=1+600|space=0+0
 * FAIL: .fina version of ARABIC LETTER ALEF WITH MADDA ABOVE; both buffers returned uni0622=1+600|space=0+0

That looks promising. Next, use some regex search/replace and a few manual edits to translate those errors into nice glyph names that Glyphs will recognize, like these:

yehHamzaabove-ar.medi
yehHamzaabove-ar.init
alef-ar.fina
alefMaddaabove-ar.fina

Add them all to GF_Arabic_Core.glyphs, then Font Info > Features > Update to generate all the OpenType features. Export a new binary and run it through Shaperglot again:

Font supports language 'ar_Arab'

Success! This means Shaperglot can indeed be used to infer a minimum glyph set for an Arabic language.

simoncozens commented 2 months ago

yehHamzaabove-ar.medi

I'm glad you found this - it's the perfect example for why you should not use glyph names to detect language support, and why the shaperglot approach (looking at font behaviour) is needed instead.

Suppose I have a font which decomposes the diacritics such that yehHamzaabove-ar becomes behDotless-ar hamzaabove-ar. Then I can correctly shape the medial form of yehHamza using behDotless-ar.medi, without needing a precomposed yehHamzaabove-ar.medi glyph. If yehHamzaabove-ar.medi is part of your "minimal glyph set" then your approach will incorrectly report that my font is missing glyphs needed to support Arabic.

justinpenner commented 2 months ago

Thanks, that's also a perfect example of the type of problem I suspected I would run into, but I wasn't able to identify it thus far due to my limited experience with Arabic and shaping in general.

So, the distinction I should make is: it would be relatively easy for TalkingLeaves to tell the user, "here's a glyph set that will work for the Arabic language you selected." But it would be much harder to tell the user, "given your current glyph set, you could add these additional glyphs in order to support the Arabic language you selected." Do you think the latter is even possible?

If it's not possible, I'll look for another approach. The first purpose for TalkingLeaves is to tell the user which encoded characters are definitely needed for a given language. That was a relatively easy problem to solve. The second purpose is to inform the user when other things are needed (or might be needed) to support a given language. If that means leaving the advice open-ended and pointing them to a book or an online resource for Arabic rather than recommending specific glyphs, that's fine if it's the best solution.

simoncozens commented 2 months ago

But it would be much harder to tell the user, "given your current glyph set, you could add these additional glyphs in order to support the Arabic language you selected." Do you think the latter is even possible?

To be honest I think the entire concept is cursed.

Type designers, particularly those who have been spoiled by Glyphs, confuse "adding glyphs" with "adding support". For a lot of scripts, there are many things you need to do to a font to add language support; adding glyphs is just "level one" of them. (Even in Latin this is true. Thinking holistically about the font is important: is idotless all you need to add to a font to support Turkish? Not if you have a small caps set it isn't. And then there are combinations which are not encoded atomically in Unicode: if you've got edotbelow and gravecomb and all the codepoints you need for Yoruba covered, do you have Yoruba support? Well, that rather depends on your edotbelow anchors!)

This is why I am not a fan of tools which tell designers which glyphs or codepoints they need to add to their font to make languages work - at worst they encourage in-fill-ism ("10 more glyphs and I've made a Sindhi font!"), but even at best they can mislead people into thinking they have added language support when in fact they haven't.

In short a glyph-based approach isn't enough. You have to think about font behaviour.

justinpenner commented 2 months ago

You've given me a lot to think about. I think I need to start by changing the communication in the UI to make it more clear that character sets are only part of what you need to support many languages.

I don't expect I'll be able to build a magic tool that sorts out everything you need to support a given language, but I'd be extremely happy if I could get to the point where TalkingLeaves visually highlights any languages that have known requirements beyond codepoints, and wherever possible gives the user some general notes about what they need to look at. Hyperglot has notes on some languages, like explaining the two Eng forms for Sami and African languages, and Shaperglot's checks can be used to flag things like anchors that might be required in a given language.

kontur commented 1 month ago

For Arabic positional forms you have most (?) of them encoded as unicodes anyways. You can also rely on unicode joining types (Hyperglot includes the unicode data and has a convenience method to access it) so you know which Arabic letters need init/medi/fina forms. In Hyperglot we are checking this for font checks where we test actual shaping akin to Shaperglot, e.g. from here on. In Glyphs you don't have the compiled font to check actual shaping for a string, but you can get all the encoded positional forms' unicodes, and you got access to GlyphData to get the glyph names for all positional forms to check if they are in the font.

simoncozens commented 1 month ago

For Arabic positional forms you have most (?) of them encoded as unicodes anyways.

Nope.

Legacy presentation forms aren't always encoded in Arabic fonts. Noto Nastaliq Urdu, for example, doesn't bother, nor does SIL Scheherazade.

florianpircher commented 1 month ago

You don’t need to assign those Unicode code points in your font. But them being in Unicode can act as a repository of important glyphs (encoded as characters) and their joining behavior. I don’t know how complete or reliable that repository is, though.

kontur commented 1 month ago

Ah yes, indeed, Glyphs adds them just with name but without unicode, so mostly likely the unicodes are not set for those in the Glyphs file.

I also assumed GlyphInfo would have a better mapping back to the root of a positional form, but that too seems a bit tricky. Probably needs matching the glyph name suffix, or x.glyphInfo.desc (I think that matches the python's unicodedata.name(), minus the positional text, e.g. "ARABIC LETTER BEH" and "ARABIC LETTER BEH MEDIAL FORM").

kontur commented 1 month ago

A naive example of how to get the Arabic positional forms (assuming the default GlyphsData.xml with those specific suffixes):

from hyperglot.language import Language
from hyperglot.orthography import Orthography
from hyperglot.parse import get_joining_type

def get_required_glyphs(char):

    info = Glyphs.glyphInfoForUnicode(ord(char))
    joining = get_joining_type(char)

    required = [info.name]

    if joining == "D":
        required.extend([info.name + ".init", info.name + ".medi", info.name + ".fina", info.name + ".isol"])
    elif joining == "R":
        required.extend([info.name + ".init", info.name + ".isol"])

    return required

arabic_base = Orthography(Language("arb").get_orthography()).base_chars
required = []

for char in arabic_base:
    required.extend(get_required_glyphs(char))

print("\n".join(required))

Of course you could also just ignore any finessed directionality check and simply brute force and see if Glyphs.glyphInfoForName("whatever.init") != Glyphs.glyphInfoForName("whatever") indicates the existance of a positional form by that name in the GlyphData (let's say, do this for Arabic chars only, for all positional variants. The != check because Glyphs.glyphInfoForName is lenient and returns "the next best thing", e.g. Glyphs.glyphInfoForName("A.foobar") will return the GlyphInfo for "A").

simoncozens commented 1 month ago
        required.extend([info.name + ".init", info.name + ".medi", info.name + ".fina", info.name + ".isol"])

Now any font which doesn't follow the Glyphs naming convention (again, Noto Nastaliq Urdu) "doesn't support Arabic".

When I say the whole thing is cursed, I mean it. You should not try doing it this way.

kontur commented 1 month ago

Point taken. I suppose the only avenue to at least some kind of certainty would be to compile a font on the fly with the features and glyphnames from the file, then check if it does shape anything for given sequences.

justinpenner commented 4 weeks ago

@simoncozens makes an extremely important point that checking language support is really independent of any glyph naming conventions. But TalkingLeaves is not meant to be a tool for checking language support, and I'd like to make that clearer to the user somehow. I think a small improvement would be to change these checkbox labels to "Show complete" and "Show incomplete".

image

Then, as @kontur suggested, I might eventually add a feature that allows the user to run a language support check via Hyperglot or Shaperglot, on their last exported font file. This would help solidify the intended workflow of using TalkingLeaves to build and expand your glyph set, and then running a check to confirm that your font indeed supports your target languages.

@kontur I couldn't get your example code using joining types to produce a font that passes Hyperglot or Shaperglot checks, but your second idea of brute-forcing works beautifully with some minor modifications. Comparing the two GSGlyphInfo objects directly didn't work for me, but comparing their index property does the trick.

def get_required_glyphs(char):

  info = Glyphs.glyphInfoForUnicode(ord(char))
  required = [info.name]

  positions = ['.init', '.medi', '.fina', '.isol']

  for pos in positions:
    if Glyphs.glyphInfoForName(info.name).index != Glyphs.glyphInfoForName(info.name+pos).index:
      required.append(info.name+pos)

  return required

arabic_ort = Orthography(Language("arb").get_orthography())
arabic_chars = arabic_ort.base_chars + arabic_ort.base_marks
required = []

for char in arabic_chars:
  required.extend(get_required_glyphs(char))

for char in required:
  glyph = GSGlyph(char)
  Font.glyphs.append(glyph)

The above script adds all the glyphs that are needed to pass Hyperglot's check for Standard Arabic (arb). Shaperglot's charset for Standard Arabic is a little different, so if I add those missing codepoints and run the script on them, the resulting font passes Shaperglot's check for Standard Arabic.

TLDR; we now have a way to generate a set of glyph names, to be used in GlyphsApp, that will result in an Arabic font that passes Hyperglot/Shaperglot checks.

justinpenner commented 3 weeks ago

Re: @simoncozens’ earlier comments, c681157 adds a small section What does it mean to “support” a language? near the top of README.md, and some other small changes to help inform users that TalkingLeaves is not a tool for checking language support, and it currently deals only with character sets, which are only one piece of adding language support.