gravitystorm / openstreetmap-carto

A general-purpose OpenStreetMap mapnik style, in CartoCSS
Other
1.53k stars 819 forks source link

Enable Noto Nastaliq Urdu as the default font for locations in Pakistan #4547

Closed bgo-eiu closed 2 years ago

bgo-eiu commented 2 years ago

Expected behavior

For locations in Pakistan which have the default 'name' in an Arabic-based script (could be Urdu, Pashto, Punjabi, etc.) to show in Nastaliq script. In most countries which use an Arabic-based script, people are familiar with other type settings and this is less of an issue, even in Afghanistan and Iran. However, Pakistanis almost exclusively read and write using the Nastaliq style and generally find it to be much more readable. Any site or media source which specifically has a Pakistani audience in mind will show the script this way.

Actual behavior

Currently, the fonts.mss file includes this comment:

  1. Noto provides various variants of Arabic: Noto Kufi Arabic, Noto Naskh Arabic, Noto Nastaliq Urdu and Noto Sans Arabic. Kufi and Urdu styles are not widespread in use. Noto Sans Arabic (a Naskh-style low-contrast “Sans” font) and Noto Naskh Arabic are the fonts with the greatest coverage and provide an UI variant. This style uses Noto Sans Arabic UI because it’s consistent with the other Sans fonts and legible. The Arabic fonts are placed behind Sans fonts because they might re-define some commonly used signs like parenthesis or quotation marks, and the arabic design should not overwrite the standard design. The list still includes Noto Naskh Arabic UI for compatibility on systems without Noto Sans Arabic UI.

These statements may have been true at some point in the past, but the status of these fonts and expectations have definitely shifted to a widespread support of Noto Nastaliq Urdu as the most accessible and legible font for Pakistani readers. While it is true that Noto Naskh and Noto Sans Arabic are considered the most legible in a larger number of countries, these are definitely not considered the most legible options in Pakistan. See the links/screenshots section below for specific documentation regarding the use of Noto Nastaliq in Pakistan. The shift to Noto Nastaliq as an expected default generally became clear in 2017, when a number of larger tech companies switched to using it in response to feedback from Pakistani users.

Links and screenshots illustrating the problem

** Note that Microsoft Urdu Typesetting was added to Windows in the first place for Pakistani users, it just happens that in this case their version of the font was not good enough.

** This new wiki template places Noto Nastaliq and the Microsoft Urdu Typesetting at the top above Jameel Noori and asks that the first two fonts be at the top of the priority list no matter what. The reason for this being that the vast majority of Pakistani users on an operating system from 2017 or more recently has at least one of these two fonts on their system. Noto is more legible, Urdu Typesetting is a compromise.

image This is a screenshot of what Karachi looks like at the moment for reference

image This is an Urdu map of Karachi and its surrounding subdivisions in Sindh province by Pakistani Wikimedia Commons uploader Tahir mq.

imagico commented 2 years ago

This is technically a part of #2208 (the title is somewhat misleading there - this is generally about scripts where the same unicode characters are used for different variants of the same script or different closely related scripts, CJK as well as Cyrillic and Arabic).

Thanks for pointing out that this is an issue with languages using Arabic script as well.

There are different problems that make this an issue that is hard to solve, primarily the fact that there is no agreement in OSM to record the language of the name tag (or to agree to abolish use of the name tag in favor of separately recording language specific names and the local naming conventions - see http://blog.imagico.de/you-name-it-on-representing-geographic-diversity-in-names/).

What is a different issue is that the font used on https://ur.wikipedia.org/wiki/%D8%B5%D9%81%D8%AD%DB%82_%D8%A7%D9%88%D9%84 seems unsuitable for being used at small font sizes - like those we use for POI or road labels, see for example https://www.openstreetmap.org/#map=18/24.85637/66.99087. Is something like this:

https://upload.wikimedia.org/wikipedia/commons/thumb/0/03/%D9%81%D9%88%D9%86_%DA%A9%D8%B1%DB%8C%DA%BA.svg/200px-%D9%81%D9%88%D9%86_%DA%A9%D8%B1%DB%8C%DA%BA.svg.png

actually readable to an Urdu speaker? The font used on

https://ur.wikipedia.org/static/images/project-logos/urwiki.png

seems better - but i don't know which it is.

bgo-eiu commented 2 years ago

I did see the Han unification / CJK thread, though this particular situation is definitely simpler and potentially more actionable. There would be no requirement to know anything about the language of the tagged name, as the Romanized script text and Arabic script text do not have any overlapping characters like different CJK characters can have, and people expect to see any Arabic-based script written the same way as they expect to see Urdu no matter what the original etymology is. If a place name is identical in Urdu and Sindhi, it does not really matter if the reader speaks Sindhi, the place name would look just the same either way. A closer analogy for how place names are seen in Pakistan would be Canada, where English speakers and French speakers expect to see Romanized script presented the same way regardless of language, and whether or not your native tongue is English or French does not really matter when it comes to reading the place names in the other language.

The way to implement this would be the same regardless of how name tagging is approached, it would be just to prioritize the Nastaliq font over Naskh within Pakistan - just that country, no need anywhere else. I checked regarding Afghanistan and Iran, and users on their respective Wiki sites for example actually rejected the idea of making it default there. They may use Nastaliq in some specific situations, but Pakistan is really the only country where this typesetting is used as the default for everything; on signs, in the media, at school, etc. Nothing with a Romanized script name would be affected because it doesn't overlap, and essentially everything else would be legible in Nastaliq.

That first example of the small wikimedia upload with the phone icon does look crowded, but I think that is just an artefact of the source file, which is an SVG that has the actual paths and nodes of the lettering explicitly defined rather than referring to a font, so it does not scale down like a font is supposed to. It is also not clear what font was used as the basis for the SVG. Noto Nastaliq is a bit magic in the way it does font scaling, which is probably why it has quickly become the preferred font for this purpose. If you look at a one-to-one example of Noto Nastaliq vs. Noto Naskh at the same font size, I think it looks at least as legible. It is arbitrary at the end of the day but note that the "tails" which make the Nastaliq script taller help to keep the letters looking distinct from each other at smaller sizes.

Noto Naskh 10px image

Noto Nastaliq 10px image

Noto Naskh 8px image

Noto Nastaliq 8px image

I think the 8px is even smaller than the size I am seeing on the roads of the OSM.org link you provided, so I think Nastaliq could work at that scale.

I am not familiar with the technical details of how font styling works in Carto - if the Noto fonts are in the project itself, even better.

Edit: I realize these scaled lossy raster screenshots are not the ideal way to view this either. Github also doesn't retain the scale of the images. It would probably be more informative to look directly at the links where you can try the font at different sizes: https://fonts.google.com/noto/specimen/Noto+Nastaliq+Urdu https://fonts.google.com/noto/specimen/Noto+Naskh+Arabic?subset=arabic

imagico commented 2 years ago

it would be just to prioritize the Nastaliq font over Naskh within Pakistan

Which is exactly the same approach as the most frequently suggested solution to the Han unification problem. This however makes the assumption that languages are strictly separated by administrative unit - which is not the case, neither in China/Korea/Japan nor in Pakistan and Middle East countries.

Note we will most likely not want to federate the style based on design preferences in different countries. If we come up with a solution to #2208 we will still want to render names in a certain language in a unified form globally and not differently based on location. So all Urdu names would be rendered in one font globally and not differently depending in what country the feature with that name is located. If the Pakistan OSM community would like to use a certain font for any Arabic script names independent of language the way to go would be to set up a distinct map for that (which could obviously be a derivative of OSM-Carto).

Regarding your samples - note that nominal text size is often not that meaningful. Your Noto Naskh 10px example has a total height of 18px, the Noto Nastaliq 10px example has a total height of 25px.

bgo-eiu commented 2 years ago

Well, it doesn't really make that assumption about languages themselves, just about typesetting. Keep in mind most Pakistanis don't speak Urdu natively - it's a lingua franca. They learn Urdu writing a certain way through the education system, media etc. and then that has resulted in people writing their mother tongues the same way. Urdu is also mutually intelligible with Hindi, so you are already seeing many of the same words and names printed in a different font on either side of Pakistan and India. If you were to actually enforce rendering the same language the same way in every country, you would have to alter the labels in large parts of India as they appear currently, which would not be popular. For example, the majority of Punjabi speakers live in Pakistan, but you will never see Punjabi in the same form in India even though it is the same language. Part of how this has happened without Carto accounting for it is that the ISO language codes tend to duplicate the same language for different countries in South Asia, because those codes were made people who were ignorant of languages in the region.

The point here though is that as far as the style of writing is concerned, that does begin and end at the country borders - carto wouldn't be seen as an arbiter of that, because it has to do with how people learn to write, and school systems are government run and end and begin at those borders. India is actually the only other country where you will see Urdu names and while it's unlikely you will see them written in the Arabic script, if you do, the Nastaliq script would be more appropriate there too. The main language that would look different around the border that doesn't already is actually Pashto - that is shared with Afghanistan. If you would rather it be tied to language than border, you could have Nastaliq on Urdu, Punjabi, Sindhi, and Baloch, and leave Pashto the same.

I would also say this isn't quite the same as design preferences, as this has a significant impact on legibility, and within the current policy to show things in the language of the area it makes sense. Even though these are languages with Arabic based scripts, none of them are particularly close to Arabic, and there are different sounds referred to and even letters or modifications that are never used in Arabic. (Unicode/font developers have tended to group these with Arabic, kind of like how dotless ı is part of Romanized scripts but only used for Turkic languages.)

Imagine if, for example, the font used for Romanized scripts did not include support for the range of diacritics used in French, so names in France did not include them as if they were written in an English a speaking country. This would be seen as inaccurate even by English-speakers, who expect to see French written that way, even when French-origin names in English speaking countries are written without them. The same goes for all of the languages spoken in Pakistan. Any person familiar with any of the languages spoken in Pakistan would expect to see them written in the Nastaliq form there, in the same way anybody in America or England would expect Romanized script to be to include features they don't use in France or Turkey.

It may also help to know that the convention in Arabic scripts is to admit "short" vowels, so someone from an actual Arabic speaking country isn't necessarily going to have an easy time reading Urdu correctly even if it's in a familiar typeface. It requires some working familiarity with the words/names themselves to know the pronunciations as these are essentially cursive strings of consonants. So this wouldn't be seen as making the map less legible to people using Arabic scripts in different languages, because they're not necessarily that interchangeable to begin with. The only language where you would see the script change around the border as I mentioned, is Pashto. Even with that though, the official lingua franca in Afghanistan is Persian (called Dari there), not Urdu, so people there are familiar with non-Nastaliq script through that in a way Pakistanis are not.

bgo-eiu commented 2 years ago

https://en.m.wikipedia.org/wiki/Lam_with_tah_above

This is a letter that is only used in Pakistan. Noto Nastaliq is one of the only fonts that includes it. If rendered in the Nastaliq font, people who speak other languages would at least be able to see what it is, rather than an empty character box as it shows in most fonts.

The two languages mentioned in the article are Punjabi and Kalasha. Kalasha is only used in Pakistan, and Punjabi is shared between Pakistan and India, but tagged differently as name:pnb and name:pa respectively because people in the two countries don't understand each others' scripts. You could tie the font to the languages, or you could tie them to the country, the outcome would in practice be exactly the same because of the reality on the ground.

bgo-eiu commented 2 years ago

As far as sizing goes, if there's a reference for how the smaller sizes are selected in the renderer code that would be helpful. The point still stands in any case that the font does look a lot better at smaller scales than the rasterized vector path image, if you make the two fonts the same pixel height it still seems fine.

Oh and also in case it wasn't implied - Pakistanis don't expect to see Arabic scripts written in Nastaliq in Arabic-speaking countries either. So there would not be much reason to make a map that substitutes the font on any Arabic script.

1ec5 commented 2 years ago

The font used on https://ur.wikipedia.org/static/images/project-logos/urwiki.png seems better - but i don't know which it is.

An earlier version of the Urdu Wikipedia logo used Jameel, Noori, and Nastaleeq. A later revision uses Nafees Web Naskh based on a community consensus.

bgo-eiu commented 2 years ago

I want to point out the obvious here as well for the sake of clarity - the name of Wikipedia transliterated to Arabic script is not a great reference point for legibility or recognizability because it's not a recognizable word.

bgo-eiu commented 2 years ago

Might be helpful to link some reference points for this from government primary sources as a supplement, especially since those are the kind of sources mappers (from anywhere) may come across, and can show some more variation of what legible/recognizable writing might look like in different contexts.

Due to the historical lack of support for the Nastaliq typesetting and Pakistani languages more broadly, using static images in conjunction or instead of textual content is commonplace.

A "retweet" from the official Azad Kashmir account, where a tweet from the territory's head minister has been shared as an image file to display it in the script for readers rather than using the actual retweet function: https://twitter.com/GovtofAJK/status/1524671152022564866/photo/1

Most local government websites follow a template which involves a combination of English text content for site navigation/descriptions, and embeds or links to image files of public notices in Urdu or another language. You can see there are a number of stylized script variants that may get used for headings and titles, but the text in the main content/body of these pages is typically presented the same way for legibility.

List of wards and their contact numbers on the site for the city of Jaranwala: http://www.mcjaranwala.lgpunjab.org.pk/Wards.html

Public notice regarding the installation of LED street lights in Sialkot: http://tmalayyah.lgpunjab.org.pk/tmasialkot.com/wwwroot/Download-files/expressionofinterestforledlights.jpg

Contact information for a Balochistan government office: https://allpaknotifications.com/office-of-the-accountant-general-balochistan-complaint-contacts-details/ (This is a site which hosts mirrors of these kind of government notices with textual captions, making them easier to find through a search engine.)

pnorman commented 2 years ago

it would be just to prioritize the Nastaliq font over Naskh within Pakistan - just that country, no need anywhere else.

This is not possible. The fonts used are worldwide, and we do not have the ability to use different fontstacks per country.

imagico commented 2 years ago

This is not possible. The fonts used are worldwide, and we do not have the ability to use different fontstacks per country.

Well - theoretically we could filter the data in SQL by location and style it differently (using different font lists) depending on if the feature is within a certain admin boundary or not. But this would be technically rather complex (and would make the whole style quite a bit more complex) and we have always had in the past a clear consensus that we do not want to federate the style.

There are a number of things i think that would be good to know that are not clear to me at the moment. Regarding real wold use of writing styles - since the information i have seems to be conflicting on that:

And regarding legibility at small font sizes:

bgo-eiu commented 2 years ago

Yes, I have realized now that scoping a font to a geographic region in the existing Carto infrastructure is a much more challenging task than I had initially assumed. I still think it is worth considering, especially since this would not be the only use case for it, but it is likely I will also spend some time looking into how to achieve this through a vector tiles library that has some existing features in place for geographic queries.

What is common as a writing style for Urdu using Arabic script in various countries where Urdu is used significantly? In general everyday life but specifically of course also in maps? The English Wikipedia page on Urdu shows two examples from India and the UAE

Pakistan is the only country where Urdu written in Arabic script is in widespread use. Urdu is also spoken widely in India, but because it is mutually intelligible with Hindi, and most Indians use the Devanagari script, most Urdu speaking Indians would be fine to use the Devanagari script for practical purposes because that is what others locally understand. Historically, when India and Pakistan were one country, this was something that differed depending on whether you were Muslim or Hindu, but in present day situation, Indian Muslims are increasingly less likely to have much context or reason to use the Arabic script. Some references which discuss this:

If you do see Arabic script written in India, it will be in Nastaliq, since writing that way is something people have been doing since before Pakistan and India split apart.

The reason the UAE has Urdu signage is because in large parts of the UAE, the majority of the population is comprised of Pakistani immigrants. Only about 1 in 10 people living in the UAE is Arab, and there are more Pakistanis there than Arabs, so even though the government is operated by Arab nationals they have to provide documents, signage, etc. in Urdu for all the Pakistanis that live and work there. If you see Urdu in the UAE, it is for that purpose specifically as people of other nationalities there would not understand it. There are parts of Bahrain, Oman, Kuwait, Saudi Arabia, and Qatar additionally where Pakistani immigrants are numerous enough that you may encounter examples like this.

What styling is typically used for Arabic language text in Pakistan? That is practically probably not that common so it might be difficult to assess reliably.

Most Pakistanis are actually exposed to some Arabic language text through religion, as Muslims are expected to read and memorize religious texts in their original language even if they do not understand the meaning of the words. However, readability is not necessarily an objective of religious texts, and they are most typically styled in elaborate calligraphic forms that are considered more suited for their significance. Arabic in a Quran would look like this: image

In some older Qurans, the calligraphy may be complex in such a way that no two instances of a letter look quite the same. Often the way this is taught is not even by breaking down the letters and sounds that make up words, but just how to say the corresponding passage or phrase as a whole and the general form it takes.

There is actually a set of Quranic ligature characters in unicode that comes from this context. ﷽‎ is a whole phrase condensed to a single character that is practically impossible to read if you try to look at the original words. Someone familiar with the religion though would recognize that general shape/form as being associated with the Islamic phrase it references just from looking at it and would not need to actually read the individual letters. None of that is really helpful or relevant for a map, and does not really make it easy for Pakistanis to read non-religious Arabic (which also does not resemble Arabic for every day use that closely), but that would be the literal answer to this question.

it would be very useful if someone would do some tests with Noto Nastaliq - both for horizontal point labels (populated places, POIs) and for line labels (e.g. roads) to see if at a size that is legible this works with the label offsets and the road rendering widths we use so far.

I do want to try this, I will share if I have time for it at some point soon.

imagico commented 2 years ago

If i understand you right you essentially seem to indicate that Urdu is - when written in Arabic script - overwhelmingly written in Nastaliq style independent of the country. That would mean if we knew what the language in the name tag is we could select the font based on language (which would still technically challenging - but that would be a general problem of solving #2208).

I understand that this issue is not exclusively about Urdu - but solving it for Urdu would already be an improvement obviously.

bgo-eiu commented 2 years ago

Yes, if that is easier to do it would definitely be an improvement. If it were implemented based on language it would have to follow some kind of rule system to avoid conflating languages.

Some scenarios from the map...

Jeddah, Saudi Arabia has a three letter name that does not have any sounds which are incompatible between Arabic and Urdu. The most reliable way I can think of to discern that this is an Arabic name is the Wikidata item, which is tagged, which contains the "native label" property that states the origin language of the name. For any name which is tagged with multiple identical language labels, some of which are not written with Nastaliq, the renderer would have to consult an index based on the wikidata in the absence of geographic context. Tagging the "native label" in OSM itself has historically been unpopular, and at this point would likely be seen as redundant to the wikidata tag if proposed again.

Peshawar, Pakistan starts with the letter پ which is not used in Arabic, so the Arabic name does not match. Persian (name:fa) does have the letter پ, but the Persian localization also omits the vowel ا, so it does not match. The only matches to the name tag here are the Urdu and Sindhi (name:sd) localizations - Sindhi is a language which is only written in an Arabic-based script in Pakistan, so the presence of an identical name:ur and name:sd can be used to infer that this name is safe to display in Nastaliq without referencing the wikidata tag.

This has no other name tags so we can safely infer this is Urdu. Road names do not often have Wikidata tags, but it is also uncommon for them to be translated outside of a local language.

imagico commented 2 years ago

As your examples show determining the language of the name tag is highly non-trivial and very often not possible without additional information. I would be very reluctant to use external data sources for that unless there is clear consensus in the OSM community that this information is not to be recorded (which there quite clearly is not). I also doubt it would be good if we require mappers to redundantly record name=* and name:<lang>=* to ensure proper label rendering - that is inefficient mapping practice and most local communities with a single dominant language do not see this favorably. And as you illustrated this is not without ambiguities either. A real solution would be if the OSM community could either agree on a method to record the language of the name tag or to abolish the generic name tag and record the locally used language(s) instead in some way.

I would suggest to move any general discussion how to practically determine the language of the name tag to #2208 because it is a generic issues for many languages that make shared use of the same unicode characters.

pnorman commented 2 years ago

Duplicate of #2208.

Both issues are a case of a unicode codepoint having different renderings depending on language and/or location. If we ever figure out a solution to Han unification, it will also apply here.

sommerluk commented 2 years ago

I've updated the description of #2208 to cover this issue.