Language localization - Githubissues

wipfli commented 1 month ago

Currently, the basemap does not have any language localization capabilities. Country, state, and place names are taken from the OSM name tag and contain information in the local language or languages. For example, the country label of Germany is "Deutschland" whereas the country label for Italy is "Italia". In this Issue I would like to propose a scheme for displaying names localized to a specific user language which should make the basemap more accessible to a wider audience.

Assumptions

Let us make the following assumptions:

A typical user speaks and reads primarily one language, their first language.
A typical user expects to see map labels in their first language.
We have a database with labels where
- name contains the local name(s) ~~using a single script~~.
- name:<language-code> contains the name in a specific language.
- For each <language-code> we know the script(s).
- There is a defined list of supported <language-code>s.

Definitions

Language localization: display labels in the first or preferred language of a user.
Language fallback chain: If a label is not available in the target language, try another language which is similar to the first language. All languages in the fallback chain use the same script.

Proposed Supported Languages

Below is a proposed list of roughly 80 supported languages. The languages are grouped by script and some languages may use more than one script. Note that for some scripts such as Telugu or Khmer we need to create a positioned glyph font first.

The structure is:

Language: <language-code>, number of nodes/ways/relations in name:<language-code> in OSM's taginfo

Latin

AFRIKAANS: af, 10k
ALBANIAN: sq, 10k
AZERBAIJANI: az, 10k
AZERBAIJANI (Arabic script): az-Arab, 2k
BASQUE: eu, 70k
BOSNIAN: bs, 6k
CATALAN: ca, 600k
CROATIAN: hr, 20k
CZECH: cs, 50k
DANISH: da, 10k
DUTCH: nl, 80k
ENGLISH: en, 6M
ESTONIAN: et, 10k
FINNISH: fi, 400k
FILIPINO: fil, 600
FRENCH: fr, 600k
GALICIAN: gl, 10k
GERMAN: de, 500k
HUNGARIAN: hu, 60k
ICELANDIC: is, 3k
INDONESIAN: id, 10k
ITALIAN: it, 100k
LATVIAN: lv, 10k
LITHUANIAN: lt, 50k
MALAY (Latin script): ms, 70k
MALAY (Arabic script): ms-Arab, 3k
NORWEGIAN: no, 10k
Norwegian Nynorsk: nn, 4k
POLISH: pl, 300k
PORTUGUESE: pt, 50k
ROMANIAN: ro, 50k
SLOVAK: sk, 20k
SLOVENIAN: sl, 10k
SPANISH: es, 100k
SWAHILI: sw, 20k
SWEDISH: sv, 100k
TURKISH: tr, 30k
UZBEK: uz, 10k
UZBEK (Latin script): uz-Latn, 1k
UZBEK (Cyrillic script): uz-Cyrl, 1k
UZBEK (Arabic script): uz-Arab, 900
VIETNAMESE: vi, 30k
ZULU: zu, 1k

Arabic

ARABIC: ar, 1M
FARSI: fa, 50k
URDU: ur, 80k

Cyrillic

BELARUSIAN: be, 400k
BULGARIAN: bg, 30k
KAZAKH: kk, 40k
KAZAKH (Latin script): kk-Latn, 1k
KAZAKH (Arabic script): kk-Arab, 8k
KAZAKH (Cyrillic script): kk-Cyrl, 1k
KYRGYZ: ky, 5k
MACEDONIAN: mk, 40k
RUSSIAN: ru, 1M
SERBIAN (Cyrillic script): sr, 300k
SERBIAN (Latin script): sr-Latn, 200k
UKRAINIAN: uk, 1M

Han

CHINESE: zh, 1M
CHINESE (SIMPLIFIED): zh-Hans, 100k
CHINESE (TRADITIONAL): zh-Hant, 300k

Devanagari

GUJARATI: gu, 4k
HINDI: hi, 60k
MARATHI: mr, 10k
NEPALI: ne, 10k

One Language Per Script

AMHARIC: am, 8k
ARMENIAN: hy, 40k
KOREAN: ko, 700k
KOREAN (Latin script): ko-Latn, 100k
KOREAN (Hanja script): ko-Hani, 50k
JAPANESE: ja, 1M
JAPANESE (Hiragana script): ja-Hira and ja_kana, 200k
JAPANESE (Latin script): ja_rm and ja-Latn, 100k
GEORGIAN: ka, 60k
GREEK: el, 100k
MONGOLIAN mn, 10k
MONGOLIAN (Traditional script): mn-Mong, 1k
MONGOLIAN (Cyrillic script): mn-Cyrl, 1k
HEBREW: he, 100k
KANNADA: kn, 90k
BENGALI: bn, 10k
BURMESE: my, 40k
KHMER: km, 8k
LAO: lo, 3k
MALAYALAM: ml, 30k
PUNJABI: pa, 30k
SINHALESE: si, 2k
TAMIL: ta, 20k
TELUGU: te, 20k
THAI: th, 100k

Proposed Rules

If the target language is not available, follow a language fallback chain. End in the name tag only if the script of the target language and the script of the name tag are the same.
Display country labels only in the target language.
Display state labels only in the target language.
Display place labels in one or two lines.
1. One line: The target language uses the same script as the name tag. In this case only show the label in the target language in a single line label.
2. Two lines: The target language uses a different script than the name tag. In this case show two lines. First the target language, second the name.
Street labels follow the same logic as place labels.

Examples

Localized to English

Country example 1:

Switzerland

City Example 1:

Geneva

Country Example 2:

Greece

City Example 2:

Athens
Αθήνα

Localized to Greek

Country example 1:

Ελβετία

City Example 1:

Γενεύη
Genève

Country example 2:

Ελλάδα

City example 2:

Αθήνα

bdon commented 1 month ago

All looks like good assumptions

There may be some complication in name:<language-code> with zh-Hans and zh-Hant. I believe Tilezen had some special logic for this related to one or the other missing, to fill in the zh slot. It seems out of scope to perform any automated conversion between them. @nvkelso any lessons learned from Tilezen here?

wipfli commented 1 month ago

Chinese is actually an interesting case, because there quite a lot of entries in osm:

1 520 666, name:zh
372 153, name:zh-Hant
179 410, name:zh-Hans
80 436, name:zh-Latn-pinyin
17 783, name:zh_pinyin
9 800, name:zh_zhuyin

wipfli commented 1 month ago

Here is a list of all OSM name tag values that use more than one script: LINK (15 MB). It has something like 500k entries.

wipfli commented 1 month ago

Update: The assumption that we have a database where the name tag always contains only one script is wrong. For example names in Morocco come often in 3 scripts: Latin, Arabic, and Tifinagh.

wipfli commented 1 month ago

The rule proposed here are roughly implemented in this demo:

https://wipfli.github.io/maplibre-feature-properties-transform-example/language?script=Latin&language=en#map=8/38/24

bdon commented 1 month ago

Update: The assumption that we have a database where the name tag always contains only one script is wrong. For example names in Morocco come often in 3 scripts: Latin, Arabic, and Tifinagh.

This should also be the case in Hong Kong and a few other places due to mapping conventions.

In these situations we could ignore name completely because the parse is unreliable (unless it very consistently breaks on / etc?)

If I look at Hong Kong in English -> the 2nd label should be Chinese if I look at Hong Kong in Chinese -> the 2nd label should be English what happens if I look at it in Russian though? should it show 3 labels?

wipfli commented 1 month ago

Hong Kong is an interesting example because there are two languages (English, Chinese) and two scripts (Latin, Han).

Let us assume for a moment that we have a database where up to local names of a city can be stored separately

name_1 = Hong Kong
name_2 = 香港

The ordering of the names has a meaning, maybe number of people speaking the language or administrative/cultural use.

If we had this dataset of listed names, we could do the following rule:

Display place labels in one, two, or three lines.
- One line: The target language uses the same script as the name_1 and name_2 tag. In this case only show the label in the target language in a single line label.
- Two lines: The target language uses a different script than the name_1 or the name_2 tag. In this case show two lines. First the target language, second the name_1 if the script is different, else name_2.
- Three lines: The target language uses a different script than the name_1 and the name_2 tag, and name_1 and name_2 use different scripts. In this case show three lines. First the target language, second the name_1, the third name_2.

With this rule we would get the following for Hong Kong:

English:

Hong Kong
香港

Chinese:

香港
Hong Kong

Russian:

Гонконг
Hong Kong
香港

nvkelso commented 1 month ago

Special cases for country and region (state) labels:

Display country labels only in the target language Display state labels only in the target language

Do you mean to say the county and state labels would not be "stacked" by default, and the value of the "single line" label would follow the fallback chain in rule 1, like (modified):

If the target language is not available, follow a language fallback chain. End in the name tag only if the localized values in fallback chain are unavailable. (Possible variation to only take the 1st element in / separated multi-value?)

Or do you mean to say no name would be displayed at all if the localization data is unavailable?

Street labels:

For street labels, are you proposing the labels be stacked or delimitated (concatenated)?

Multiple alphabet option languages:

Might be worth adding more details around the languages which can be represented in multiple alphabets. Does each writing system belong in a different fallback chain?

Chinese

For Chinese, Tilezen v1.9 added basic detection of Chinese simplified versus traditional for each name, and localized name key-value pairs and modified the tile output to better annotate that. The raw input data in OSM is sometimes quite messy.

Overall:

On the display side, our recommendations and best practices match and are exceeded by what @wipfli is already suggesting here, nice! The demo matches my expectations :)

wipfli commented 1 month ago

Thanks for your questions @nvkelso.

Country labels should appear only in the target language. For example, if the target language is French, then the country labels should be "Allemagne", "Suisse", or "Autriche". So here no stacking is needed. From my experience, country labels have great language coverage so we should be able to not need a fallback chain at all. My preference would be to define a set of supported languages and then make sure that for each supported language we have a name:<language-code> for the country labels.

State labels are currently used too much in the Protomaps basemap in my opinion. In the US it might make sense to have them on the map because the country is huge and most states are huge too, but in smaller countries the state labels are not needed. I will open a separate issue at some point to propose to only have state/province labels for these countries: US, Canada, Mexico, Brazil, China, India, Australia. For now I propose to ignore the problem of state labels and treat them like city labels or like country labels, we can see which one works better.

Street labels I honestly have not thought much about yet.

Some languages use multiple scripts like for example Kazakh or Uzbek, however, there is very limited coverage in OSM for the different script (only around 1k names) and so I don't see the value at the moment of adding special logic for these languages. Japanese might have larger coverage and also uses multiple scripts, but so far my impression is that the sample code works quite well in Japan. I want to reach out to some Japanese friends and ask them for input.

nvkelso commented 1 month ago

@wipfli For the tiles, is anymore work needed besides what was already merged in https://github.com/protomaps/basemaps/pull/254 (already in a tagged release)? It seems like the remaining work in this issue is mostly about display business logic, is that right?

wipfli commented 1 month ago

Thanks for asking! With the current tiles we have information in the pmap:script tag about the script used in the name tag. Here are some examples:

Athens:
- name = Αθήνα
- pmap:script = Greek
Zürich:
- name = Zürich
- pmap:script is absent and that implies the script is Latin
Hong Kong:
- name = Hong Kong 香港
- pmap:script = Mixed

So you see that in the case of Zürich and Athens, where only one script is used in the name tag, we can build everything we want with MapLibre style expressions.

However, if the name tag contains more than one script, like for example in Hong Kong, then we are a bit in a tricky situation.

One option when we have pmap:script = Mixed is to display only the name in the target language. For example, if the target language is English we just show name:en. It would look like this:

Hong Kong

Another option when we have pmap:script = Mixed is to display the target language and the name value. However, it can then happen that the name is duplicated in a label which looks bad. For example {name:en}\n{name} would look like this:

Hong Kong
Hong Kong 香港

Yet another option would be when we have pmap:script = Mixed would be to ignore the target language altogether and just show the name value. This is what the basemap currently does. For Hong Kong, it would look like this:

Hong Kong 香港

Tiles modification

Question to @nvkelso and @bdon: Do you think any of the above 3 options is good enough for now?

If yes, I can start implementing the frontend styles.

If no, I suggest we do a bit more thinking around how mixed-script name tags can be broken up at tile generation time.

I am leaning towards the second, i.e., breaking up Hong Kong 香港 to Hong Kong and 香港 when we generate the tiles because the map just looks so much better. What do you think?

wipfli commented 1 month ago

I did some java prototyping for splitting the name tag into segments with different scripts.

Here is the result (1.6 MB): https://github.com/wipfli/multi-script-names/blob/main/list.txt

Overall I am quite happy with this segmentation. The data is place=* from OSM with name containing more than one script, e.g., Latin and Greek would be included, but a Latin-only name would not be included.

We have to deal with some typos coming from confusion between similar looking letters in Cyrillic, Latin, and Greek. Also, sometimes Latin letters are used for numbering purposes so there we should not segment.

Some numbers:

number of segments=count
2=13620
3=2222
4=1

wipfli commented 1 month ago

Regarding the entries that use 3 scripts, we have

TIFINAGH: 2029
MONGOLIAN: 98
ETHIOPIC: 84

Now do we want to support the languages that use these scripts? Because if we don't then, we can get away with 2 segments for the name tag, otherwise we will need 3.

wipfli commented 1 month ago

I made some tiles with segmented name tags. Here is a demo using MapLibre GL JS v4.5.0 with a style localized to Arabic:

Morocco

https://pub-cf7f11e26ace447db8f7215b61ac0eae.r2.dev/segment.html#map=8.87/33.7469/-7.1911

Note how Arabic is in the top line because it is localized to Arabic. In OSM, I think Arabic is mostly the last entry in Morocco.

Hong Kong

https://pub-cf7f11e26ace447db8f7215b61ac0eae.r2.dev/segment.html#map=9.22/22.3113/114.2289

Athens

https://pub-cf7f11e26ace447db8f7215b61ac0eae.r2.dev/segment.html#map=10.35/37.9577/23.7035

Note how it falls back to name:en if no Arabic name is available.

Cairo

https://pub-cf7f11e26ace447db8f7215b61ac0eae.r2.dev/segment.html#map=10.4/30.0417/31.2211

If the name is Arabic, only show the name.

wipfli commented 4 weeks ago

More demo maps now with a language switch that includes 7 languages are available here: https://pub-cf7f11e26ace447db8f7215b61ac0eae.r2.dev/lang-style/lang.html?script=Devanagari&language=hi#map=6.84/50.777/1.48, code see #275

nvkelso commented 4 weeks ago

Demo looks great, we seem to be really close!

The name segmentation seems to be effective when browsing... though @wipfli could you point to a specific examples with screenshots to verify?

Now do we want to support the languages that use these scripts? Because if we don't then, we can get away with 2 segments for the name tag, otherwise we will need 3.

Given that those triplets are so rare, I think it's OK to ignore them for now?

Some other comments:

I'm curious to see the same technique applied to street labels, but at higher zooms like zoom 15+? (where there is more room for text layout, else many labels would just collide out / not fit). At earlier zooms it'd just be the default name?
At the NE and OSM zoom transition it looks like some NE places have more name localizations than in raw OSM, because NE has names also from Wikidata. For example, https://pub-cf7f11e26ace447db8f7215b61ac0eae.r2.dev/lang-style/lang.html?script=Greek&language=el#map=6.62/38.715/-120.825 around Ukiah, Mendocino, Santa Rosa. Do we at least need to coalesce names across NE and OSM (and really prefer OSM but bring over the min_zoom and the names? Do we want to do more to bring over even more Wikidata names?

Zoom 6

Zoom 7

wipfli commented 4 weeks ago

I am happy to provide more screenshots with descriptions in #275. Regarding bringing in more name translations, let me open a separate issue.

wipfli commented 1 week ago

The demo is now available at https://pub-cf7f11e26ace447db8f7215b61ac0eae.r2.dev/lang-style/index.html?script=Cyrillic&language=ru#map=9.84/33.5273/-7.4407

nvkelso commented 6 days ago

Demo is so rad, congrats!

Do you know why the 2nd language line is centered under the 1st line, even when the text is protected (anchored) to the left or right? Best practice would be to make sure 1st and 2nd lines are justified to the same edge instead.

wipfli commented 3 days ago

Do you know why the 2nd language line is centered under the 1st line, even when the text is protected (anchored) to the left or right? Best practice would be to make sure 1st and 2nd lines are justified to the same edge instead.

The anchor position is the same as it currently is on main, i.e., the symoblizer is to the left of the text. There is one interesting situation if the label contains two lines, say first line Latin second Line Cyrillic, but then the place does not have a Latin label. In that case the first line becomes and empty string:

First Line
Second Line

(empty)
Second Line

The bounding box of the text label is the same in both cases. Example below is in Armenia.

wipfli commented 2 days ago

Edit: Removed the demo in favor of the built-in app.

protomaps / basemaps

Language localization #270

Assumptions

Definitions

Proposed Supported Languages

Latin

Arabic

Cyrillic

Han

Devanagari

One Language Per Script

Proposed Rules

Examples

Localized to English

Localized to Greek

Tiles modification

Morocco

Hong Kong

Athens

Cairo