Open wipfli opened 1 month ago
All looks like good assumptions
There may be some complication in name:<language-code>
with zh-Hans and zh-Hant. I believe Tilezen had some special logic for this related to one or the other missing, to fill in the zh
slot. It seems out of scope to perform any automated conversion between them. @nvkelso any lessons learned from Tilezen here?
Chinese is actually an interesting case, because there quite a lot of entries in osm:
Here is a list of all OSM name
tag values that use more than one script: LINK (15 MB). It has something like 500k entries.
Update: The assumption that we have a database where the name
tag always contains only one script is wrong. For example names in Morocco come often in 3 scripts: Latin, Arabic, and Tifinagh.
The rule proposed here are roughly implemented in this demo:
Update: The assumption that we have a database where the
name
tag always contains only one script is wrong. For example names in Morocco come often in 3 scripts: Latin, Arabic, and Tifinagh.
This should also be the case in Hong Kong and a few other places due to mapping conventions.
In these situations we could ignore name
completely because the parse is unreliable (unless it very consistently breaks on /
etc?)
If I look at Hong Kong in English -> the 2nd label should be Chinese if I look at Hong Kong in Chinese -> the 2nd label should be English what happens if I look at it in Russian though? should it show 3 labels?
Hong Kong is an interesting example because there are two languages (English, Chinese) and two scripts (Latin, Han).
Let us assume for a moment that we have a database where up to local names of a city can be stored separately
name_1 = Hong Kong
name_2 = 香港
The ordering of the names has a meaning, maybe number of people speaking the language or administrative/cultural use.
If we had this dataset of listed names, we could do the following rule:
With this rule we would get the following for Hong Kong:
English:
Hong Kong
香港
Chinese:
香港
Hong Kong
Russian:
Гонконг
Hong Kong
香港
Special cases for country and region (state) labels:
Display country labels only in the target language Display state labels only in the target language
Do you mean to say the county and state labels would not be "stacked" by default, and the value of the "single line" label would follow the fallback chain in rule 1, like (modified):
If the target language is not available, follow a language fallback chain. End in the name tag only if the localized values in fallback chain are unavailable. (Possible variation to only take the 1st element in
/
separated multi-value?)
Or do you mean to say no name would be displayed at all if the localization data is unavailable?
Street labels:
For street labels, are you proposing the labels be stacked or delimitated (concatenated)?
Multiple alphabet option languages:
Might be worth adding more details around the languages which can be represented in multiple alphabets. Does each writing system belong in a different fallback chain?
Chinese
For Chinese, Tilezen v1.9 added basic detection of Chinese simplified versus traditional for each name, and localized name key-value pairs and modified the tile output to better annotate that. The raw input data in OSM is sometimes quite messy.
Overall:
On the display side, our recommendations and best practices match and are exceeded by what @wipfli is already suggesting here, nice! The demo matches my expectations :)
Thanks for your questions @nvkelso.
Country labels should appear only in the target language. For example, if the target language is French, then the country labels should be "Allemagne", "Suisse", or "Autriche". So here no stacking is needed. From my experience, country labels have great language coverage so we should be able to not need a fallback chain at all. My preference would be to define a set of supported languages and then make sure that for each supported language we have a name:<language-code>
for the country labels.
State labels are currently used too much in the Protomaps basemap in my opinion. In the US it might make sense to have them on the map because the country is huge and most states are huge too, but in smaller countries the state labels are not needed. I will open a separate issue at some point to propose to only have state/province labels for these countries: US, Canada, Mexico, Brazil, China, India, Australia. For now I propose to ignore the problem of state labels and treat them like city labels or like country labels, we can see which one works better.
Street labels I honestly have not thought much about yet.
Some languages use multiple scripts like for example Kazakh or Uzbek, however, there is very limited coverage in OSM for the different script (only around 1k names) and so I don't see the value at the moment of adding special logic for these languages. Japanese might have larger coverage and also uses multiple scripts, but so far my impression is that the sample code works quite well in Japan. I want to reach out to some Japanese friends and ask them for input.
@wipfli For the tiles, is anymore work needed besides what was already merged in https://github.com/protomaps/basemaps/pull/254 (already in a tagged release)? It seems like the remaining work in this issue is mostly about display business logic, is that right?
Thanks for asking! With the current tiles we have information in the pmap:script
tag about the script used in the name
tag. Here are some examples:
name = Αθήνα
pmap:script = Greek
name = Zürich
pmap:script
is absent and that implies the script is Latin
name = Hong Kong 香港
pmap:script = Mixed
So you see that in the case of Zürich and Athens, where only one script is used in the name tag, we can build everything we want with MapLibre style expressions.
However, if the name tag contains more than one script, like for example in Hong Kong, then we are a bit in a tricky situation.
One option when we have pmap:script = Mixed
is to display only the name in the target language. For example, if the target language is English we just show name:en
. It would look like this:
Hong Kong
Another option when we have pmap:script = Mixed
is to display the target language and the name
value. However, it can then happen that the name is duplicated in a label which looks bad. For example {name:en}\n{name}
would look like this:
Hong Kong
Hong Kong 香港
Yet another option would be when we have pmap:script = Mixed
would be to ignore the target language altogether and just show the name
value. This is what the basemap currently does. For Hong Kong, it would look like this:
Hong Kong 香港
Question to @nvkelso and @bdon: Do you think any of the above 3 options is good enough for now?
If yes, I can start implementing the frontend styles.
If no, I suggest we do a bit more thinking around how mixed-script name
tags can be broken up at tile generation time.
I am leaning towards the second, i.e., breaking up Hong Kong 香港
to Hong Kong
and 香港
when we generate the tiles because the map just looks so much better. What do you think?
I did some java prototyping for splitting the name
tag into segments with different scripts.
Here is the result (1.6 MB): https://github.com/wipfli/multi-script-names/blob/main/list.txt
Overall I am quite happy with this segmentation. The data is place=*
from OSM with name
containing more than one script, e.g., Latin and Greek would be included, but a Latin-only name would not be included.
We have to deal with some typos coming from confusion between similar looking letters in Cyrillic, Latin, and Greek. Also, sometimes Latin letters are used for numbering purposes so there we should not segment.
Some numbers:
Regarding the entries that use 3 scripts, we have
Now do we want to support the languages that use these scripts? Because if we don't then, we can get away with 2 segments for the name tag, otherwise we will need 3.
I made some tiles with segmented name tags. Here is a demo using MapLibre GL JS v4.5.0 with a style localized to Arabic:
https://pub-cf7f11e26ace447db8f7215b61ac0eae.r2.dev/segment.html#map=8.87/33.7469/-7.1911
Note how Arabic is in the top line because it is localized to Arabic. In OSM, I think Arabic is mostly the last entry in Morocco.
https://pub-cf7f11e26ace447db8f7215b61ac0eae.r2.dev/segment.html#map=9.22/22.3113/114.2289
https://pub-cf7f11e26ace447db8f7215b61ac0eae.r2.dev/segment.html#map=10.35/37.9577/23.7035
Note how it falls back to name:en
if no Arabic name is available.
https://pub-cf7f11e26ace447db8f7215b61ac0eae.r2.dev/segment.html#map=10.4/30.0417/31.2211
If the name is Arabic, only show the name.
More demo maps now with a language switch that includes 7 languages are available here: https://pub-cf7f11e26ace447db8f7215b61ac0eae.r2.dev/lang-style/lang.html?script=Devanagari&language=hi#map=6.84/50.777/1.48, code see #275
Demo looks great, we seem to be really close!
The name segmentation seems to be effective when browsing... though @wipfli could you point to a specific examples with screenshots to verify?
Now do we want to support the languages that use these scripts? Because if we don't then, we can get away with 2 segments for the name tag, otherwise we will need 3.
Given that those triplets are so rare, I think it's OK to ignore them for now?
Some other comments:
Zoom 6
Zoom 7
I am happy to provide more screenshots with descriptions in #275. Regarding bringing in more name translations, let me open a separate issue.
Demo is so rad, congrats!
Do you know why the 2nd language line is centered under the 1st line, even when the text is protected (anchored) to the left or right? Best practice would be to make sure 1st and 2nd lines are justified to the same edge instead.
Do you know why the 2nd language line is centered under the 1st line, even when the text is protected (anchored) to the left or right? Best practice would be to make sure 1st and 2nd lines are justified to the same edge instead.
The anchor position is the same as it currently is on main, i.e., the symoblizer is to the left of the text. There is one interesting situation if the label contains two lines, say first line Latin second Line Cyrillic, but then the place does not have a Latin label. In that case the first line becomes and empty string:
First Line
Second Line
(empty)
Second Line
The bounding box of the text label is the same in both cases. Example below is in Armenia.
Edit: Removed the demo in favor of the built-in app.
Currently, the basemap does not have any language localization capabilities. Country, state, and place names are taken from the OSM
name
tag and contain information in the local language or languages. For example, the country label of Germany is "Deutschland" whereas the country label for Italy is "Italia". In this Issue I would like to propose a scheme for displaying names localized to a specific user language which should make the basemap more accessible to a wider audience.Assumptions
Let us make the following assumptions:
name
contains the local name(s)using a single script.name:<language-code>
contains the name in a specific language.<language-code>
we know the script(s).<language-code>
s.Definitions
Proposed Supported Languages
Below is a proposed list of roughly 80 supported languages. The languages are grouped by script and some languages may use more than one script. Note that for some scripts such as Telugu or Khmer we need to create a positioned glyph font first.
The structure is:
Language:
<language-code>
, number of nodes/ways/relations inname:<language-code>
in OSM's taginfoLatin
Arabic
Cyrillic
Han
Devanagari
One Language Per Script
Proposed Rules
name
tag only if the script of the target language and the script of the name tag are the same.name
tag. In this case only show the label in the target language in a single line label.name
tag. In this case show two lines. First the target language, second thename
.Examples
Localized to English
Country example 1:
City Example 1:
Country Example 2:
City Example 2:
Localized to Greek
Country example 1:
City Example 1:
Country example 2:
City example 2: