gravitystorm / openstreetmap-carto

A general-purpose OpenStreetMap mapnik style, in CartoCSS
Other
1.54k stars 823 forks source link

Recommendation to include additional Arabic Naskh fallback font #4644

Open bgo-eiu opened 2 years ago

bgo-eiu commented 2 years ago

Carto currently uses the Noto set of font families, which typically provide quite comprehensive coverage of Unicode characters used by a variety of languages. However, there are a handful of recently added Unicode characters which are not covered by the Noto Naskh Arabic fonts. An additional fallback font would be able to fill in these characters wherever they may occur.

For example, ࣇ is "lam with small tah above," a character added in 2020 which is not covered by Noto Naskh. It is described here https://en.wikipedia.org/wiki/Lam_with_tah_above

I have identified a few open source fonts which have more comprehensive character coverage which could be used to fill in the handful of characters which Carto currently does not render. These are each designed to fit the model of a "Naskh" font style, and examples can be seen on their respective web pages. They also seem to be relatively popular and used within a number of existing applications, so their use in Carto would likely be uncontroversial. In order of preference (based on coverage and style):

  1. PakType Naskh Basic, GPL 2.0 license, available at http://paktype.sourceforge.net/index.html This one is very well documented and is clearly informed by in depth knowledge of various Arabic-based scripts used in South Asia which may include characters which are not always supported by other fonts, while also conforming to stylistic conventions that fit with those used in different parts of the world. I find it to be quite readable and the proportions to be more balanced than Noto. Note that the developers have a similarly named font called PakType Naqsh, but I think that one would look more out of place alongside the Noto letters or when considered against the larger Carto style. Looking at the samples here, the "Wide" version of Naskh Basic may also be viable, particularly if readability at smaller font sizes is a concern: https://svn.code.sf.net/p/paktype/code/Fonts/Deployment/Sura-Fatiha.html

  2. Scheherazade New, Open Font License, available at https://software.sil.org/scheherazade/ The Unicode character coverage here is impressive and Google apparently includes this one internally in their font set if that says anything about its comparability to the Noto fonts. Some of the character proportions are decidedly awkward though, and I am admittedly a bit more skeptical of this one knowing that its designers seem to be largely unfamiliar with any language which uses an Arabic-based script.

Honorable mention: Amiri Regular, Open Font License, available at https://www.amirifont.org/ I will be opinionated here and say I found this to be one of the most attractive options I came across, but unfortunately it has at least a few outstanding bugs or missing characters that disqualify it at the moment. It's a bit more ornate than the Noto fonts, but clearly very carefully considered and the sample renderings on the website look quite clear and readable. It was also funded in part by the Google Fonts project, so it has that in common with Noto. The developer(s) seem responsive enough on the GitHub page that it seems like the current problems could be fixed in the near future.

It is possible that the use of a fallback from a different family of fonts may end up making strings of text involving combinations of fonts look awkward. If that is the case, it may be worth considering just switching to one of these fonts for anything currently rendered in the Arabic Naskh style.

Please note that this is different from the issue I made a few months ago at #4547 or the older issue #2208 which concern the application of region or country specific fonts. That is an issue which is much more technically challenging than I had realized when I made it (and one I realized later is further complicated by the fact that some may prefer that Sindhi and languages using related variants of the Arabic script). With this new issue, I am just proposing what to do about representing characters which may occur anywhere on the map, within the Naskh style which Carto currently uses and which represents a reasonable compromise between the various script styles used in different parts of the world.

sommerluk commented 2 years ago

Thanks for your suggestion.

We are always interested in having good font support. However, I'm not sure this is the direction we should take:

pnorman commented 2 years ago

Wikipedia reports that Noto Urdu Nastaleeq supports U+08C7.

sommerluk commented 2 years ago

Looking at the proof sheet, it seems so. However, I could not test this locally, because the font comes with Ubuntu 22.04, and Ubuntu's version does not have this character. And the new one, I cannot install in parallel to test.

Noto Sans Arabic does not have it. There is an issue that covers the missing characters: https://github.com/notofonts/arabic/issues/30

pnorman commented 2 years ago

We could add Noto Urdu Nastaleeq to the list fairly easily since we're no longer dependent on system fonts, but I would want some examples of names with the relevant characters in them so we can test. Also, are there characters other than lam with small tah above that Nastaleeq would add, and any that would still be missing?

Implementation-wise, it would probably belong below Noto Sans Arabic UI.

bgo-eiu commented 2 years ago

Yes, I am aware that Carto uses Noto Sans Arabic but my wording could have been better. All I meant by Naskh was more broadly the set of Arabic script fonts that use a more "regular" character positioning as opposed to Nastaliq type settings in which the letters stack on to each other slightly vertically. Nastaliq fonts tend to do this:

image

People outside of Pakistan find this quite hard to read (even Iranians were opposed to the idea of using Nastaliq on the Persian Wikipedia like the Pakistani Wikipedias do, despite the fact that Nastaliq originates in Persia for some context on this). Letters in these fonts also would not really work well mixed into a word which is not fully rendered in Nastaliq. That is all I meant by Naskh, Noto Sans is more similar to that in how it works even if it is not the same font.

U+08C7 is just one specific character I happen to familiar with because I add Punjabi to OSM often and this letter is usable in Punjabi. There are zero current uses of it on the map though because I intentionally do not include characters which do not have font support - I would say number of uses on the map is not really relevant here because people know which characters are and aren't available and tend to use a substitute character where it doesn't render. I use U+08C7 all the time on Wikidata because I can use a custom font in there. It represents a common sound that historically lacked a representation, it's not particularly obscure but not essential, more of a "would be nice to have." Some languages have very recently developed writing systems and have started adopting newer characters from others; for example, Shina now uses some letters mixed in from Punjabi and Pashto's Perso-Arabic alphabets, so to a certain extent its hard to say what a character could become useful for. Anyway, there is no need to focus so much on that specific letter, I just used that as an example I find myself wanting to use often. The more general point is that there are several extended Arabic script characters that seem like they would become viable to use if a fallback font including them (such as the ones above) were made available. You can see a full list of the Noto character coverage here: https://notofonts.github.io/overview/

In that link, you can see that the squares in red are not covered by any Noto font, whereas the squares in light green are only covered in one font. If you look at the "Arabic Extended-A" unicode block, 13 characters in it have no font support in Noto, and ten are Nastaliq only. The "Arabic Extended-B" block is entirely red and has 41 characters unsupported by Noto. Then there are several unsupported characters in Arabic Presentation Forms A, and possibly some number scattered in other blocks which are used in some Arabic-based scripts but are not exclusive to them. The "original" Arabic block has a single missing character which appears to be some kind of specific spacing or punctuation character. (Unfortunately, it is impossible to tell where and how people might be using this. Search engines and query services typically can't do anything with Arabic spacing and punctuation, and often the inclusion of a character like this can make an entire word unsearchable. Kind of an interesting tangent, if you look at this google search result for example, you will see that there are zero results and a box that says "did you like this album?" with a musician but no album title: https://www.google.com/search?kgmid=/g/1z2v7wq5_ Google generates these "ghost" search results from Persian search strings which have the ZWNJ spacing character which it is somehow able to match to certain data some of the time but isn't able to produce actual search results for.)

You could use the website above to match the blocks against the block sheets provided for the fonts I linked to and see what they look like. Part of why I'm hesitant to say in too much detail anything about these individual characters is just because I only know a language and a half (English and Punjabi) and I don't want to misrepresent how any of these might be used in languages I don't know as much about like Sundanese or Javanese which have characters in Arabic Extended B according to Wikipedia. I do think just having more options to use characters like this would be good though as there's no way to know who might want to use them if they aren't available.

It is also not necessarily a given that other fonts will include every Arabic script character in the future - the requirements for Carto are actually quite specific in this regard. To explain:

(To be clear, if you ever rendered Nastaliq in actual Arabic-speaking countries you would likely start receiving complaints about this immediately, so that is something to be mindful as well)

bgo-eiu commented 2 years ago

Also as a side note, I added the "font support" section of the lam with tah above article. It's possible that there are other fonts which support it that I am not aware of, but I made that list just to remember what fonts to look at to make this issue so it's kind of the same information that's here rather than secondary information

bgo-eiu commented 2 years ago

Actually here is an example using more widely supported characters that can hopefully be illustrative in showing why the minutiae of extended Arabic characters can be more important than they seem at first glance.

The Sindhi alphabet has the letter ڻ which was added to Unicode in 1993 which represents a sound in a similar family of sounds to what ن is for. Most languages in Pakistan, including Urdu, Punjabi, and Saraiki, use a different letter ٹ. This letter is very common but I think not used in any languages outside of Pakistan; it's in the same family of sounds as ت and other letters shaped like this where you move your tongue to the front of your mouth but with this one it goes on the roof of your mouth and it's really a sound that's typical of Indic languages but isn't easy to pronounce for people who don't use it (it's this, sorry for the loud audio https://en.wiktionary.org/wiki/File:Pa-%E0%A8%9F.ogg). The problem is ڻ and ٹ look exactly the same in the middle of a word but don't represent similar sounds at all:

تٹت تڻت Sindhi saw this coming before Unicode was even a thing and ٹ was removed from the script and they use a different letter for this sound. Other languages can't use this replacement letter because they are too used to ٹ. However, other languages which use both sounds were stuck in this confusing situation. Around 2005 or so Unicode added a Saraiki extension which included ݨ with a dot in the middle there so you can now tell it's a different letter when connected to others. Before then, people could have used ڻ or just not represented this sound separately in writing by using the same as ن. Neither would have been ideal though unless you were writing in Sindhi. It's slightly more complicated than that, the sound ڻ/ݨ is supposed to be is extremely common in most Indic languages, equally so between Sindhi, Saraiki, Punjabi, and Urdu, but Sindhi and Saraiki writers were more annoyed about the confusing situation than Punjabi writers who still don't often use this, and Urdu writers would get annoyed at the suggestion that ݨ or ڻ should be represented as an independent sound in writing at all even though anybody who writes in Urdu makes this sound all the time when they talk. There's not really any logical explanation for why that is, it just is what it is.

You should be able to see all these characters now, but just knowing about that kind of thing can be a problem makes me inclined to think broad character support is generally important. On a related note, I'm reminded of the fact that the fonts used on GitHub actually make certain very common Arabic script letters look identical, so these issues didn't really go away in the early 2000's. If you look at this 'ہ' 'ه' you might think they are they same letter. Now if you add a letter attached somewhere 'ہٹ' 'هٹ'` - secretly these were different letters. I did not notice this until it was causing problems in a Punjabi regular expression I was working on. This may explain better what I mean when I say I can't say to much about how much other languages use a given character because with those, ه is never used at the beginning of a word in Punjabi but is used all the time for this in Arabic, and ہ is not really used in Arabic but is used in any position in Punjabi and can act as a vowel or a consonant. Then Punjabi uses ھ which can go after certain letters but is not the same as ه despite looking just like that character looks when it's attached to another letter. So not only is it the case that these letters looking the same can cause confusion, the type of confusion it causes is a bit different for every language. It follows that lots of extra slightly different characters keep getting added to the extended Arabic Unicode set because these types of issues are very common but occur in such a way that every language requires a slightly different solution to them.