osmus / tileservice

Central repo for the OSM US vector tile service
Creative Commons Zero v1.0 Universal
7 stars 1 forks source link

Add names in indigenous languages of the United States #14

Closed 1ec5 closed 3 months ago

1ec5 commented 3 months ago

We include a lot of languages in the vector tiles, but many indigenous languages of the United States are missing:

https://github.com/osmus/tileservice/blob/edcdedfe078b5e25560e863b96687e3b72a51bd1/renderer/render_once.sh#L51

Here’s a list I cobbled together of extant indigenous languages and their dialects that are spoken in the U.S., that are spoken by more than 500 people as of 2008, and that have any coverage at all in OSM:

Language ISO 639
Yupik ypk
O’odham ood
Western Apache apw
Mescalero-Chiricahua apm
Jicarilla apj
Eastern Keres kee
Western Keres kjq
Zuni zun
Ottawa otw
Hopi hop
Iñupiaq ik
Northwest Alaska Iñupiatun esk
Tewa tew
Chickasaw cic
Muscogee mus
Alabama akz
Koasati cku
Crow cro
Shoshoni shh
Southern Tiwa tix
Taos twf
Jemez tow
Blackfoot bla
Umatilla uma
Yakima yak
Northern Paiute pao
Mono mnr
Colorado River Numic ute
Ahtna aht
Denaʼina tfn
Deg Xinag ing
Holikachuk hoi
Koyukon koy
Upper Kuskokwim kuu
Tanacross tcb
Upper Tanana tau
Gwichʼin gwi
Hän haa
Hupa hup
Seneca see
Kiowa kio
Aleut ale
Salishan sal
Lushootseed lut
North Straits Salish str
Cowlitz cow
Okanagan oka
Flathead fla
Fox and Sauk sac
Kickapoo kic
Arapaho arp
Tlingit tli
Central Siberian Yupik ess
Maliseet-Passamaquoddy pqm
Nez Perce nez
Hidatsa hid
Karuk kyh
Oneida one
quincylvania commented 3 months ago

@1ec5 Thanks! I'll make a little script to add these in so we don't have to do it by hand.

Do you happen to know how much performance cost there is for supporting a low-coverage language? I'm inclined to support as many as we can, no matter how few speakers, assuming that there's no significant penalty. People will be more likely to add language data if they can see it on a map.

quincylvania commented 3 months ago

Here's the script if anyone is interested.

let langs = "ab,ace,af,als,am,an,ar,arz,as,ast,az,az-Arab,az-cyr,azb,ba,bar,bat-smg,be,be-tarask,ber,bg,bm,bn,bo,bpy,br,bs,bxr,ca,cdo,ce,ceb,cho,chr,chy,ckb,co,cr,crh,crh-cyr,crk,cs,csb,cv,cy,da,dak,de,dsb,dv,dz,ee,egl,el,en,eo,es,et,eu,fa,fi,fil,fit,fo,fr,frr,full,fur,fy,ga,gag,gan,gcf,gd,gl,gn,gr,grc,gsw,gu,gv,ha,hak,hak-HJ,haw,he,hi,hif,hr,hsb,ht,hu,hur,hy,ia,id,ie,ilo,int,io,is,it,iu,ja,ja_kana,ja_rm,ja-Hira,ja-Latn,jv,ka,kab,kbd,ki,kk,kk-Arab,kl,km,kn,ko,ko-Hani,ko-Latn,krc,krl,ks,ku,kv,kw,ky,la,lb,left,lez,li,lij,lld,lmo,ln,lo,lrc,lt,lv,lzh,md,mdf,mez,mg,mhr,mi,mia,mk,ml,mn,mo,moh,mr,mrj,ms,ms-Arab,mt,mwl,my,myv,mzn,nah,nan,nan-HJ,nan-POJ,nan-TL,nds,ne,nl,nn,no,nov,nv,oc,oj,old,or,os,ota,pa,pam,pcd,pfl,pl,pms,pnb,pot,ps,pt,pt-BR,pt-PT,qu,right,rm,ro,ru,rue,rw,sah,sat,sc,scn,sco,sd,se,sh,si,sju,sk,sl,sma,smj,so,sq,sr,sr-Latn,su,sv,sw,syc,szl,ta,te,TEC,tg,th,th-Latn,ti,tk,tl,tr,tt,tt-lat,udm,ug,uk,ur,uz,uz-Arab,uz-cyr,uz-Cyrl,uz-Latn,vec,vi,vls,vo,wa,war,win,wiy,wo,wuu,xmf,yi,yo,yue,yue-Hant,yue-Latn,za,zgh,zh,zh_pinyin,zh_zhuyin,zh-Hans,zh-Hant,zh-Latn-pinyin,zu,zza";
let langs2 = "ypk,ood,apw,apm,apj,kee,kjq,zun,otw,hop,ik,esk,tew,cic,mus,akz,cku,cro,shh,tix,twf,tow,bla,uma,yak,pao,mnr,ute,aht,tfn,ing,hoi,koy,kuu,tcb,tau,gwi,haa,hup,see,kio,ale,sal,lut,str,cow,oka,fla,sac,kic,arp,tli,ess,pqm,nez,hid,kyh,one"

console.log(langs.split(',').concat(langs2.split(',')).sort(function (a, b) {
    return a.toLowerCase().localeCompare(b.toLowerCase());
}).join(","));
iandees commented 3 months ago

Do you happen to know how much performance cost there is for supporting a low-coverage language?

If a tile has no features with the attribute values in a particular language, then it shouldn't get written to the tile at all, so it should be "free" to add as many languages as you want. As soon as one feature in a tile has data for one of these languages, then it gets a few bytes for the attribute name and a few bytes for the value.

It might be a good idea to double-check that the code used to generate the tiles is actually skipping blanks/nulls.

1ec5 commented 3 months ago

It might be a good idea to double-check that the code used to generate the tiles is actually skipping blanks/nulls.

Yes, Planetiler skips unset name fields; this is what makes OSM Americana’s bilingual labeling work. So it should have negligible impact on tile size realistically.

Interestingly, there doesn’t seem to be a very good correspondence between language speakers and number of tagged features for indigenous languages in general. I guess it would be easier to maintain a language list more automatically based on whatever is currently tagged in OSM. But at the same time, invalid language codes should be filtered out, like TEC and left, which are currently in the list somehow.