Closed nvkelso closed 3 years ago
@ImreSamu can you give me any tips on how to modify the Wikidata names script to harvest separate Chinese simplified versus Chinese traditional characters, please? Usually a language code is simple 1:1 match but this one has multiple variants (more than just these 2!). Thanks :)
IMHO:
zh-hans
, zh-hant
should be added ( I see values in the FREQ report ) so just a simple 1:1Based on this table: https://www.wikidata.org/wiki/Help:Languages
wikimedia language codes | Language |
---|---|
zh-hans |
Simplified Chinese (Q13414913) |
zh-hant |
Traditional Chinese (Q18130932) |
-- wikidata "locality" related items
-- count zh* labels
whosonfirst=#
WITH locality_wdlabels as
(
SELECT
wd.wd_id
,clean_wdlabel( wd.data->'labels'->'en'->>'value') as name_en
,clean_wdlabel( wd.data->'labels'->'zh'->>'value') as name_zh
,clean_wdlabel( wd.data->'labels'->'zh-classical'->>'value') as name_zh_classical
,clean_wdlabel( wd.data->'labels'->'zh-hans'->>'value') as name_zh_hans
,clean_wdlabel( wd.data->'labels'->'zh-hant'->>'value') as name_zh_hant
,clean_wdlabel( wd.data->'labels'->'zh-hk'->>'value') as name_zh_hk
,clean_wdlabel( wd.data->'labels'->'zh-min-nan'->>'value') as name_zh_min_nan
,clean_wdlabel( wd.data->'labels'->'zh-yue'->>'value') as name_yue
FROM wd.wd_ok as wd
WHERE
(a_wof_type @> ARRAY['locality' ,'hasP625'] ) and not iscebuano
)
select
count(*) AS _cnt_base_wikidata
,count(*) FILTER (WHERE name_en IS NOT NULL) AS _cnt_name_en
,count(*) FILTER (WHERE name_zh IS NOT NULL) AS _cnt_name_zh
,count(*) FILTER (WHERE name_zh_classical IS NOT NULL) AS _cnt_name_zh_classical
,count(*) FILTER (WHERE name_zh_hans IS NOT NULL) AS _cnt_name_zh_hans
,count(*) FILTER (WHERE name_zh_hant IS NOT NULL) AS _cnt_name_zh_hant
,count(*) FILTER (WHERE name_zh_hk IS NOT NULL) AS _cnt_name_zh_hk
,count(*) FILTER (WHERE name_zh_min_nan IS NOT NULL) AS _cnt_name_zh_min_nan
,count(*) FILTER (WHERE name_yue IS NOT NULL) AS _cnt_name_yue
from locality_wdlabels
;
+-[ RECORD 1 ]-----------+--------+
| _cnt_base_wikidata | 987949 |
| _cnt_name_en | 788403 |
| _cnt_name_zh | 197732 |
| _cnt_name_zh_classical | 0 |
| _cnt_name_zh_hans | 91037 |
| _cnt_name_zh_hant | 68299 |
| _cnt_name_zh_hk | 51997 |
| _cnt_name_zh_min_nan | 0 |
| _cnt_name_yue | 0 |
+------------------------+--------+
Normally these might be name_zh-hans (Chinese simplified) and name_zh-hant but shapefile's DBF has a 10 character limit on the column names.
I don't have good solutions. just a brainstoring :
name_zhans
; name_zhant
namezhhans
; namezhhant
name_hans
; name_hant
Nice, thanks for the stats and Wikidata tips @ImreSamu!
The related Tilezen PR is https://github.com/tilezen/vector-datasource/pull/1956, which has some logic for how to detect and backfill against the various options. I'll apply similar changes to the Python script in this repo.
Hey @ImreSamu do you have any tips on how to extend the existing (well, branch) script to include the two new language variants you mentioned?
I tried the following, but it barfs in Python, and using https://query.wikidata.org/ it compains about the -
in the new label variants name_zh-hans
and name_zh-hant
in the SELECT and OPTIONAL sections.
I also tried searching for https://www.wikidata.org/wiki/Q62 which is San Francisco since I know it has all 3 variants but no dice.
SELECT
?e ?i ?r ?population
?name_ar
?name_bn
?name_de
?name_el
?name_en
?name_es
?name_fa
?name_fr
?name_he
?name_hi
?name_hu
?name_id
?name_it
?name_ja
?name_ko
?name_nl
?name_pl
?name_pt
?name_ru
?name_sv
?name_tr
?name_uk
?name_ur
?name_vi
?name_zh
?name_zh-hans
?name_zh-hant
WHERE {
{
SELECT DISTINCT ?e ?i ?r
WHERE{
VALUES ?i { wd:Q2102493 wd:Q1781 }
OPTIONAL{ ?i owl:sameAs ?r. }
BIND(COALESCE(?r, ?i) AS ?e).
}
}
SERVICE wikibase:label {bd:serviceParam wikibase:language "en".}
OPTIONAL{?e wdt:P1082 ?population .}
OPTIONAL{?e rdfs:label ?name_ar FILTER((LANG(?name_ar))="ar").}
OPTIONAL{?e rdfs:label ?name_bn FILTER((LANG(?name_bn))="bn").}
OPTIONAL{?e rdfs:label ?name_de FILTER((LANG(?name_de))="de").}
OPTIONAL{?e rdfs:label ?name_el FILTER((LANG(?name_el))="el").}
OPTIONAL{?e rdfs:label ?name_en FILTER((LANG(?name_en))="en").}
OPTIONAL{?e rdfs:label ?name_es FILTER((LANG(?name_es))="es").}
OPTIONAL{?e rdfs:label ?name_fa FILTER((LANG(?name_fr))="fa").}
OPTIONAL{?e rdfs:label ?name_fr FILTER((LANG(?name_fr))="fr").}
OPTIONAL{?e rdfs:label ?name_he FILTER((LANG(?name_he))="he").}
OPTIONAL{?e rdfs:label ?name_hi FILTER((LANG(?name_hi))="hi").}
OPTIONAL{?e rdfs:label ?name_hu FILTER((LANG(?name_hu))="hu").}
OPTIONAL{?e rdfs:label ?name_id FILTER((LANG(?name_id))="id").}
OPTIONAL{?e rdfs:label ?name_it FILTER((LANG(?name_it))="it").}
OPTIONAL{?e rdfs:label ?name_ja FILTER((LANG(?name_ja))="ja").}
OPTIONAL{?e rdfs:label ?name_ko FILTER((LANG(?name_ko))="ko").}
OPTIONAL{?e rdfs:label ?name_nl FILTER((LANG(?name_nl))="nl").}
OPTIONAL{?e rdfs:label ?name_pl FILTER((LANG(?name_pl))="pl").}
OPTIONAL{?e rdfs:label ?name_pt FILTER((LANG(?name_pt))="pt").}
OPTIONAL{?e rdfs:label ?name_ru FILTER((LANG(?name_ru))="ru").}
OPTIONAL{?e rdfs:label ?name_sv FILTER((LANG(?name_sv))="sv").}
OPTIONAL{?e rdfs:label ?name_tr FILTER((LANG(?name_tr))="tr").}
OPTIONAL{?e rdfs:label ?name_uk FILTER((LANG(?name_uk))="uk").}
OPTIONAL{?e rdfs:label ?name_ur FILTER((LANG(?name_ur))="ur").}
OPTIONAL{?e rdfs:label ?name_vi FILTER((LANG(?name_vi))="vi").}
OPTIONAL{?e rdfs:label ?name_zh FILTER((LANG(?name_zh))="zh").}
OPTIONAL{?e rdfs:label ?name_zh-hans FILTER((LANG(?name_zh-hans))="zh-hans").}
OPTIONAL{?e rdfs:label ?name_zh-hant FILTER((LANG(?name_zh-hant))="zh-hant").}
it compains about the - in the new label variants name_zh-hans and name_zh-hant in the SELECT and OPTIONAL sections.
ouch ... https://stackoverflow.com/questions/11075261/special-characters-in-sparql-variables
IMHO:
name_zh_hans
variable name in SPARQLname_zh_hans
-> name_zh-hs
? ) SELECT
?e ?i ?r ?population
?name_de
?name_en
?name_zh
?name_zh_hans
?name_zh_hant
WHERE {
{
SELECT DISTINCT ?e ?i ?r
WHERE{
VALUES ?i { wd:Q2102493 wd:Q1781 }
OPTIONAL{ ?i owl:sameAs ?r. }
BIND(COALESCE(?r, ?i) AS ?e).
}
}
SERVICE wikibase:label {bd:serviceParam wikibase:language "en".}
OPTIONAL{?e wdt:P1082 ?population .}
OPTIONAL{?e rdfs:label ?name_de FILTER((LANG(?name_de))="de").}
OPTIONAL{?e rdfs:label ?name_en FILTER((LANG(?name_en))="en").}
OPTIONAL{?e rdfs:label ?name_zh FILTER((LANG(?name_zh))="zh").}
OPTIONAL{?e rdfs:label ?name_zh_hans FILTER((LANG(?name_zh_hans))="zh-hans").}
OPTIONAL{?e rdfs:label ?name_zh_hant FILTER((LANG(?name_zh_hant))="zh-hant").}
}
EDIT:
VALUES ?i { wd:Q62 wd:Q2102493 wd:Q1781 }
That works, thanks!
Now to determine when someone has put in English text into one of the Chinese values, oy vey.
I got this working locally and will push a branch soon with support for Simplified and Traditional Chinese names, thanks @ImreSamu ! I also fixed Italian and unborked the new Farsi so 2x win.
This work is reflected in https://github.com/nvkelso/natural-earth-vector/pull/446 (which is too crazy big for a PR)
In support of https://github.com/nvkelso/natural-earth-vector/issues/302 and https://github.com/tilezen/vector-datasource/issues/1955, Natural Earth needs to include name localization for both traditional and simplified Chinese. Now we just have an ambiguous
name_zh
property.In the case of Chinese (and some other languages), the "spoken" language has multiple "written" character sets (Traditional and Simplified) and is spoken and written in multiple countries using different configs (eg
zh-CN
implieszh-Hans
).When we harvest localized names from Wikidata we need to source Traditional Chinese separately from Simplified Chinese, and put them in two different properties like
name_zh-hs
orname_zhs
(Chinese simplified irrespective of country) andname_zh-ht
orname_zht
(Chinese traditiional irrespective of country). Normally these might bename_zh-hans
(Chinese simplified) andname_zh-hant
but shapefile's DBF has a 10 character limit on the column names.There should also be some consideration and compatibility with the point-of-view / worldview being introduced in v5.