Support both traditional and simplified Chinese name localizations

nvkelso commented 3 years ago

In support of https://github.com/nvkelso/natural-earth-vector/issues/302 and https://github.com/tilezen/vector-datasource/issues/1955, Natural Earth needs to include name localization for both traditional and simplified Chinese. Now we just have an ambiguous name_zh property.

In the case of Chinese (and some other languages), the "spoken" language has multiple "written" character sets (Traditional and Simplified) and is spoken and written in multiple countries using different configs (eg zh-CN implies zh-Hans).

When we harvest localized names from Wikidata we need to source Traditional Chinese separately from Simplified Chinese, and put them in two different properties like name_zh-hs or name_zhs(Chinese simplified irrespective of country) and name_zh-ht or name_zht(Chinese traditiional irrespective of country). Normally these might be name_zh-hans (Chinese simplified) and name_zh-hant but shapefile's DBF has a 10 character limit on the column names.

There should also be some consideration and compatibility with the point-of-view / worldview being introduced in v5.

nvkelso commented 3 years ago

@ImreSamu can you give me any tips on how to modify the Wikidata names script to harvest separate Chinese simplified versus Chinese traditional characters, please? Usually a language code is simple 1:1 match but this one has multiple variants (more than just these 2!). Thanks :)

ImreSamu commented 3 years ago

IMHO:

complicated: https://en.wikipedia.org/wiki/Chinese_Wikipedia
probably a zh-hans, zh-hant should be added ( I see values in the FREQ report ) so just a simple 1:1

Based on this table: https://www.wikidata.org/wiki/Help:Languages

wikimedia language codes	Language
`zh-hans`	Simplified Chinese (Q13414913)
`zh-hant`	Traditional Chinese (Q18130932)

Simple FREQ stat - on wikidata "locality" related items :

--   wikidata "locality" related items 
--   count  zh* labels
whosonfirst=# 
WITH locality_wdlabels as 
(
 SELECT
   wd.wd_id 
  ,clean_wdlabel( wd.data->'labels'->'en'->>'value')           as name_en             
  ,clean_wdlabel( wd.data->'labels'->'zh'->>'value')           as name_zh          
  ,clean_wdlabel( wd.data->'labels'->'zh-classical'->>'value') as name_zh_classical
  ,clean_wdlabel( wd.data->'labels'->'zh-hans'->>'value')      as name_zh_hans     
  ,clean_wdlabel( wd.data->'labels'->'zh-hant'->>'value')      as name_zh_hant     
  ,clean_wdlabel( wd.data->'labels'->'zh-hk'->>'value')        as name_zh_hk        
  ,clean_wdlabel( wd.data->'labels'->'zh-min-nan'->>'value')   as name_zh_min_nan   
  ,clean_wdlabel( wd.data->'labels'->'zh-yue'->>'value')       as name_yue          
 FROM wd.wd_ok as wd 
 WHERE
  (a_wof_type  @> ARRAY['locality' ,'hasP625'] ) and not iscebuano
)
select 
  count(*) AS _cnt_base_wikidata
 ,count(*) FILTER (WHERE name_en           IS NOT NULL) AS _cnt_name_en          
 ,count(*) FILTER (WHERE name_zh           IS NOT NULL) AS _cnt_name_zh          
 ,count(*) FILTER (WHERE name_zh_classical IS NOT NULL) AS _cnt_name_zh_classical
 ,count(*) FILTER (WHERE name_zh_hans      IS NOT NULL) AS _cnt_name_zh_hans     
 ,count(*) FILTER (WHERE name_zh_hant      IS NOT NULL) AS _cnt_name_zh_hant     
 ,count(*) FILTER (WHERE name_zh_hk        IS NOT NULL) AS _cnt_name_zh_hk       
 ,count(*) FILTER (WHERE name_zh_min_nan   IS NOT NULL) AS _cnt_name_zh_min_nan  
 ,count(*) FILTER (WHERE name_yue          IS NOT NULL) AS _cnt_name_yue                
from locality_wdlabels
;
+-[ RECORD 1 ]-----------+--------+
| _cnt_base_wikidata     | 987949 |
| _cnt_name_en           | 788403 |
| _cnt_name_zh           | 197732 |
| _cnt_name_zh_classical | 0      |
| _cnt_name_zh_hans      | 91037  |
| _cnt_name_zh_hant      | 68299  |
| _cnt_name_zh_hk        | 51997  |
| _cnt_name_zh_min_nan   | 0      |
| _cnt_name_yue          | 0      |
+------------------------+--------+

Normally these might be name_zh-hans (Chinese simplified) and name_zh-hant but shapefile's DBF has a 10 character limit on the column names.

I don't have good solutions. just a brainstoring :

name_zhans ; name_zhant
namezhhans ; namezhhant
name_hans ; name_hant

nvkelso commented 3 years ago

Nice, thanks for the stats and Wikidata tips @ImreSamu!

The related Tilezen PR is https://github.com/tilezen/vector-datasource/pull/1956, which has some logic for how to detect and backfill against the various options. I'll apply similar changes to the Python script in this repo.

nvkelso commented 3 years ago

Hey @ImreSamu do you have any tips on how to extend the existing (well, branch) script to include the two new language variants you mentioned?

I tried the following, but it barfs in Python, and using https://query.wikidata.org/ it compains about the - in the new label variants name_zh-hans and name_zh-hant in the SELECT and OPTIONAL sections.

I also tried searching for https://www.wikidata.org/wiki/Q62 which is San Francisco since I know it has all 3 variants but no dice.

        SELECT
            ?e ?i ?r ?population
            ?name_ar
            ?name_bn
            ?name_de
            ?name_el
            ?name_en
            ?name_es
            ?name_fa
            ?name_fr
            ?name_he
            ?name_hi
            ?name_hu
            ?name_id
            ?name_it
            ?name_ja
            ?name_ko
            ?name_nl
            ?name_pl
            ?name_pt
            ?name_ru
            ?name_sv
            ?name_tr
            ?name_uk
            ?name_ur
            ?name_vi
            ?name_zh
            ?name_zh-hans
            ?name_zh-hant
        WHERE {
            {
                SELECT DISTINCT  ?e ?i ?r
                WHERE{
                    VALUES ?i { wd:Q2102493 wd:Q1781    }
                    OPTIONAL{ ?i owl:sameAs ?r. }
                    BIND(COALESCE(?r, ?i) AS ?e).
                }
            }
            SERVICE wikibase:label {bd:serviceParam wikibase:language "en".}
            OPTIONAL{?e wdt:P1082 ?population .}
            OPTIONAL{?e rdfs:label ?name_ar FILTER((LANG(?name_ar))="ar").}
            OPTIONAL{?e rdfs:label ?name_bn FILTER((LANG(?name_bn))="bn").}
            OPTIONAL{?e rdfs:label ?name_de FILTER((LANG(?name_de))="de").}
            OPTIONAL{?e rdfs:label ?name_el FILTER((LANG(?name_el))="el").}
            OPTIONAL{?e rdfs:label ?name_en FILTER((LANG(?name_en))="en").}
            OPTIONAL{?e rdfs:label ?name_es FILTER((LANG(?name_es))="es").}
            OPTIONAL{?e rdfs:label ?name_fa FILTER((LANG(?name_fr))="fa").}
            OPTIONAL{?e rdfs:label ?name_fr FILTER((LANG(?name_fr))="fr").}
            OPTIONAL{?e rdfs:label ?name_he FILTER((LANG(?name_he))="he").}
            OPTIONAL{?e rdfs:label ?name_hi FILTER((LANG(?name_hi))="hi").}
            OPTIONAL{?e rdfs:label ?name_hu FILTER((LANG(?name_hu))="hu").}
            OPTIONAL{?e rdfs:label ?name_id FILTER((LANG(?name_id))="id").}
            OPTIONAL{?e rdfs:label ?name_it FILTER((LANG(?name_it))="it").}
            OPTIONAL{?e rdfs:label ?name_ja FILTER((LANG(?name_ja))="ja").}
            OPTIONAL{?e rdfs:label ?name_ko FILTER((LANG(?name_ko))="ko").}
            OPTIONAL{?e rdfs:label ?name_nl FILTER((LANG(?name_nl))="nl").}
            OPTIONAL{?e rdfs:label ?name_pl FILTER((LANG(?name_pl))="pl").}
            OPTIONAL{?e rdfs:label ?name_pt FILTER((LANG(?name_pt))="pt").}
            OPTIONAL{?e rdfs:label ?name_ru FILTER((LANG(?name_ru))="ru").}
            OPTIONAL{?e rdfs:label ?name_sv FILTER((LANG(?name_sv))="sv").}
            OPTIONAL{?e rdfs:label ?name_tr FILTER((LANG(?name_tr))="tr").}
            OPTIONAL{?e rdfs:label ?name_uk FILTER((LANG(?name_uk))="uk").}
            OPTIONAL{?e rdfs:label ?name_ur FILTER((LANG(?name_ur))="ur").}
            OPTIONAL{?e rdfs:label ?name_vi FILTER((LANG(?name_vi))="vi").}
            OPTIONAL{?e rdfs:label ?name_zh FILTER((LANG(?name_zh))="zh").}
            OPTIONAL{?e rdfs:label ?name_zh-hans FILTER((LANG(?name_zh-hans))="zh-hans").}
            OPTIONAL{?e rdfs:label ?name_zh-hant FILTER((LANG(?name_zh-hant))="zh-hant").}

ImreSamu commented 3 years ago

it compains about the - in the new label variants name_zh-hans and name_zh-hant in the SELECT and OPTIONAL sections.

ouch ... https://stackoverflow.com/questions/11075261/special-characters-in-sparql-variables

IMHO:

try using name_zh_hans variable name in SPARQL
and if the '-' char is important then you should rename the variable name in python ( name_zh_hans -> name_zh-hs ? )

minimal SPARQL example

https://w.wiki/3dBR ( Short URL of Wikidata Query Service - with the minimal sparql example )

backup of the minimal SPARQL example :

SELECT
?e ?i ?r ?population
?name_de
?name_en
?name_zh
?name_zh_hans
?name_zh_hant
WHERE {
{
    SELECT DISTINCT  ?e ?i ?r
    WHERE{
        VALUES ?i { wd:Q2102493 wd:Q1781 }
        OPTIONAL{ ?i owl:sameAs ?r. }
        BIND(COALESCE(?r, ?i) AS ?e).
    }
}
SERVICE wikibase:label {bd:serviceParam wikibase:language "en".}
OPTIONAL{?e wdt:P1082 ?population .}
OPTIONAL{?e rdfs:label ?name_de FILTER((LANG(?name_de))="de").}
OPTIONAL{?e rdfs:label ?name_en FILTER((LANG(?name_en))="en").}
OPTIONAL{?e rdfs:label ?name_zh FILTER((LANG(?name_zh))="zh").}
OPTIONAL{?e rdfs:label ?name_zh_hans FILTER((LANG(?name_zh_hans))="zh-hans").}
OPTIONAL{?e rdfs:label ?name_zh_hant FILTER((LANG(?name_zh_hant))="zh-hant").}
}

EDIT:

example with San Francisco / Q62: --> https://w.wiki/3dBu VALUES ?i { wd:Q62 wd:Q2102493 wd:Q1781 }

nvkelso commented 3 years ago

That works, thanks!

Now to determine when someone has put in English text into one of the Chinese values, oy vey.

nvkelso commented 3 years ago

I got this working locally and will push a branch soon with support for Simplified and Traditional Chinese names, thanks @ImreSamu ! I also fixed Italian and unborked the new Farsi so 2x win.

nvkelso commented 3 years ago

This work is reflected in https://github.com/nvkelso/natural-earth-vector/pull/446 (which is too crazy big for a PR)

nvkelso commented 3 years ago

Fixed via https://github.com/nvkelso/natural-earth-vector/pull/446.

nvkelso / natural-earth-vector

Support both traditional and simplified Chinese name localizations #533

Simple FREQ stat - on wikidata "locality" related items :

minimal SPARQL example