Add names in indigenous languages of the United States

1ec5 commented 3 months ago

We include a lot of languages in the vector tiles, but many indigenous languages of the United States are missing:

https://github.com/osmus/tileservice/blob/edcdedfe078b5e25560e863b96687e3b72a51bd1/renderer/render_once.sh#L51

Here’s a list I cobbled together of extant indigenous languages and their dialects that are spoken in the U.S., that are spoken by more than 500 people as of 2008, and that have any coverage at all in OSM:

Language	ISO 639
Yupik	`ypk`
O’odham	`ood`
Western Apache	`apw`
Mescalero-Chiricahua	`apm`
Jicarilla	`apj`
Eastern Keres	`kee`
Western Keres	`kjq`
Zuni	`zun`
Ottawa	`otw`
Hopi	`hop`
Iñupiaq	`ik`
Northwest Alaska Iñupiatun	`esk`
Tewa	`tew`
Chickasaw	`cic`
Muscogee	`mus`
Alabama	`akz`
Koasati	`cku`
Crow	`cro`
Shoshoni	`shh`
Southern Tiwa	`tix`
Taos	`twf`
Jemez	`tow`
Blackfoot	`bla`
Umatilla	`uma`
Yakima	`yak`
Northern Paiute	`pao`
Mono	`mnr`
Colorado River Numic	`ute`
Ahtna	`aht`
Denaʼina	`tfn`
Deg Xinag	`ing`
Holikachuk	`hoi`
Koyukon	`koy`
Upper Kuskokwim	`kuu`
Tanacross	`tcb`
Upper Tanana	`tau`
Gwichʼin	`gwi`
Hän	`haa`
Hupa	`hup`
Seneca	`see`
Kiowa	`kio`
Aleut	`ale`
Salishan	`sal`
Lushootseed	`lut`
North Straits Salish	`str`
Cowlitz	`cow`
Okanagan	`oka`
Flathead	`fla`
Fox and Sauk	`sac`
Kickapoo	`kic`
Arapaho	`arp`
Tlingit	`tli`
Central Siberian Yupik	`ess`
Maliseet-Passamaquoddy	`pqm`
Nez Perce	`nez`
Hidatsa	`hid`
Karuk	`kyh`
Oneida	`one`

quincylvania commented 3 months ago

@1ec5 Thanks! I'll make a little script to add these in so we don't have to do it by hand.

Do you happen to know how much performance cost there is for supporting a low-coverage language? I'm inclined to support as many as we can, no matter how few speakers, assuming that there's no significant penalty. People will be more likely to add language data if they can see it on a map.

quincylvania commented 3 months ago

Here's the script if anyone is interested.

let langs = "ab,ace,af,als,am,an,ar,arz,as,ast,az,az-Arab,az-cyr,azb,ba,bar,bat-smg,be,be-tarask,ber,bg,bm,bn,bo,bpy,br,bs,bxr,ca,cdo,ce,ceb,cho,chr,chy,ckb,co,cr,crh,crh-cyr,crk,cs,csb,cv,cy,da,dak,de,dsb,dv,dz,ee,egl,el,en,eo,es,et,eu,fa,fi,fil,fit,fo,fr,frr,full,fur,fy,ga,gag,gan,gcf,gd,gl,gn,gr,grc,gsw,gu,gv,ha,hak,hak-HJ,haw,he,hi,hif,hr,hsb,ht,hu,hur,hy,ia,id,ie,ilo,int,io,is,it,iu,ja,ja_kana,ja_rm,ja-Hira,ja-Latn,jv,ka,kab,kbd,ki,kk,kk-Arab,kl,km,kn,ko,ko-Hani,ko-Latn,krc,krl,ks,ku,kv,kw,ky,la,lb,left,lez,li,lij,lld,lmo,ln,lo,lrc,lt,lv,lzh,md,mdf,mez,mg,mhr,mi,mia,mk,ml,mn,mo,moh,mr,mrj,ms,ms-Arab,mt,mwl,my,myv,mzn,nah,nan,nan-HJ,nan-POJ,nan-TL,nds,ne,nl,nn,no,nov,nv,oc,oj,old,or,os,ota,pa,pam,pcd,pfl,pl,pms,pnb,pot,ps,pt,pt-BR,pt-PT,qu,right,rm,ro,ru,rue,rw,sah,sat,sc,scn,sco,sd,se,sh,si,sju,sk,sl,sma,smj,so,sq,sr,sr-Latn,su,sv,sw,syc,szl,ta,te,TEC,tg,th,th-Latn,ti,tk,tl,tr,tt,tt-lat,udm,ug,uk,ur,uz,uz-Arab,uz-cyr,uz-Cyrl,uz-Latn,vec,vi,vls,vo,wa,war,win,wiy,wo,wuu,xmf,yi,yo,yue,yue-Hant,yue-Latn,za,zgh,zh,zh_pinyin,zh_zhuyin,zh-Hans,zh-Hant,zh-Latn-pinyin,zu,zza";
let langs2 = "ypk,ood,apw,apm,apj,kee,kjq,zun,otw,hop,ik,esk,tew,cic,mus,akz,cku,cro,shh,tix,twf,tow,bla,uma,yak,pao,mnr,ute,aht,tfn,ing,hoi,koy,kuu,tcb,tau,gwi,haa,hup,see,kio,ale,sal,lut,str,cow,oka,fla,sac,kic,arp,tli,ess,pqm,nez,hid,kyh,one"

console.log(langs.split(',').concat(langs2.split(',')).sort(function (a, b) {
    return a.toLowerCase().localeCompare(b.toLowerCase());
}).join(","));

iandees commented 3 months ago

Do you happen to know how much performance cost there is for supporting a low-coverage language?

If a tile has no features with the attribute values in a particular language, then it shouldn't get written to the tile at all, so it should be "free" to add as many languages as you want. As soon as one feature in a tile has data for one of these languages, then it gets a few bytes for the attribute name and a few bytes for the value.

It might be a good idea to double-check that the code used to generate the tiles is actually skipping blanks/nulls.

1ec5 commented 3 months ago

It might be a good idea to double-check that the code used to generate the tiles is actually skipping blanks/nulls.

Yes, Planetiler skips unset name fields; this is what makes OSM Americana’s bilingual labeling work. So it should have negligible impact on tile size realistically.

Interestingly, there doesn’t seem to be a very good correspondence between language speakers and number of tagged features for indigenous languages in general. I guess it would be easier to maintain a language list more automatically based on whatever is currently tagged in OSM. But at the same time, invalid language codes should be filtered out, like TEC and left, which are currently in the list somehow.

osmus / tileservice

Add names in indigenous languages of the United States #14