lexibank / lexibank-analysed

Study on lexibank data (presenting the lexibank dataset).
Creative Commons Attribution 4.0 International
10 stars 3 forks source link

Wrong Language Families or Glottocodes in CLDF Datasets #47

Closed LinguList closed 8 months ago

LinguList commented 1 year ago

By checking for the family we provide and the one which Glottolog gives, we can find interesting errors:

ID Glottocode Family (Glottolog) Family (Lexibank)
castrozhuang-XinchengBeigeng east2326 Uralic Tai-Kadai
castrozhuang-XinchengGuosui east2326 Uralic Tai-Kadai
chindialectsurvey-RawngtuWeilong-A wela1234 Bookkeeping Sino-Tibetan
chindialectsurvey-RawngtuRamtim-A wela1234 Bookkeeping Sino-Tibetan
constenlachibchan-Arhuaco arhu1242 Chibchan Chibcha
constenlachibchan-Bari bari1297 Chibchan Chibcha
constenlachibchan-Boruca boru1252 Chibchan Chibcha
constenlachibchan-Bribri brib1243 Chibchan Chibcha
constenlachibchan-Buglere bugl1243 Chibchan Chibcha
constenlachibchan-Cabecar cabe1245 Chibchan Chibcha
constenlachibchan-CentralTunebo cent2150 Chibchan Chibcha
constenlachibchan-Chibcha chib1270 Chibchan Chibcha
constenlachibchan-Chimila chim1309 Chibchan Chibcha
constenlachibchan-Cogui cogu1240 Chibchan Chibcha
constenlachibchan-Malayo mala1522 Chibchan Chibcha
constenlachibchan-MalekuJaika male1297 Chibchan Chibcha
constenlachibchan-Ngabere ngab1239 Chibchan Chibcha
constenlachibchan-Pech pech1241 Chibchan Chibcha
constenlachibchan-Rama rama1270 Chibchan Chibcha
constenlachibchan-SanBlasKuna sanb1242 Chibchan Chibcha
constenlachibchan-Teribe teri1250 Chibchan Chibcha
constenlachibchan-Cacaopera caca1247 Misumalpan Misumalpam
constenlachibchan-Mayangna maya1285 Misumalpan Misumalpam
constenlachibchan-Ulwa ulwa1239 Misumalpan Misumalpam
constenlachibchan-Miskito misk1235 Misumalpan Misumalpam
hantganbangime-Fulfulde maas1239 Atlantic-Congo Atlantic
hubercolumbian-Jupda hupd1244 Naduhup Nadahup
hubercolumbian-Saliba sali1298 Saliban Jodi-Saliban
ivanisuansu-Suansu suan1234 Sino-Tibetan Isolate
johanssonsoundsymbolic-Aguaruna agua1253 Chicham Jivaroan
johanssonsoundsymbolic-Ahtena ahte1237 Athabaskan-Eyak-Tlingit Athapaskan-Eyak-Tlingit
johanssonsoundsymbolic-Ainu ainu1240 Ainu Ainu (Japan)
johanssonsoundsymbolic-Aymara cent2142 Aymaran Aymara
johanssonsoundsymbolic-Bambassi bamb1262 Blue Nile Mao Mao
johanssonsoundsymbolic-Cavinena cavi1250 Pano-Tacanan Tacanan
johanssonsoundsymbolic-Guahibo guah1255 Guahiboan Guahibo
johanssonsoundsymbolic-Hupde hupd1244 Naduhup Nadahup
johanssonsoundsymbolic-Kamula kamu1260 Kamula-Elevala Kamula
johanssonsoundsymbolic-Kunimaipa kuni1267 Kunimaipan Goilalan
johanssonsoundsymbolic-Lencasalvador lenc1244 Bookkeeping Lencan
johanssonsoundsymbolic-Limilngan nucl1327 Limilngan-Wulna Limilngan
johanssonsoundsymbolic-Mairasi nucl1594 Mairasic Mairasi
johanssonsoundsymbolic-Mongolian halh1238 Mongolic-Khitan Mongolic
johanssonsoundsymbolic-Moro moro1285 Heibanic Heiban
johanssonsoundsymbolic-Nimboran nucl1633 Nimboranic Nimboran
johanssonsoundsymbolic-Ninam nina1238 Yanomamic Yanomam
johanssonsoundsymbolic-PanoanKatukina pano1254 Pano-Tacanan Panoan
johanssonsoundsymbolic-Sanapanaangaite sana1281 Bookkeeping Lengua-Mascoy
johanssonsoundsymbolic-Sentani nucl1632 Sentanic Sentani
johanssonsoundsymbolic-Shatt shat1244 Dajuic Daju
johanssonsoundsymbolic-Toaripi toar1246 Eleman Nuclear Eleman
johanssonsoundsymbolic-Warembori ware1253 Austronesian Austronesian (Malayo-Polynesian: Central-Eastern Malayo-Polynesian: Eastern Malayo-Polynesian: South Halmahera-West New Guinea)
johanssonsoundsymbolic-Yawa nucl1454 Yawa-Saweru Yawa
joophonosemantic-Kabardian kaba1278 Abkhaz-Adyge N Caucasian
joophonosemantic-Wayuu wayu1243 Arawakan Maipurean
joophonosemantic-ShipiboConibo ship1254 Pano-Tacanan Panoan
northeuralex-khk halh1238 Mongolic-Khitan Mongolic
northeuralex-bua buri1258 Mongolic-Khitan Mongolic
northeuralex-xal kalm1243 Mongolic-Khitan Mongolic
LinguList commented 1 year ago

This list should be handled by us.

FredericBlum commented 9 months ago

Is this issues superseeding #46 ? I'd assign this to @MuffinLinwist for early March, once some other tasks are done.

FredericBlum commented 9 months ago

@MuffinLinwist We can now start working on this issue.

  1. Please create a fresh virtual environment with a clean install of the most recent cldfbench version
  2. Go through all the datasets mentioned in this issue and fix the wrong Family names and glottocodes
  3. Create a PR that fixes the glottocodes/families in etc/languages.csv and re-runs cldfbench
  4. Tag me on the PR so I can review and merge

Some of those cases might already be solved, but most will not. Please first re-check the Glottocode cases described in #35 and answer in the respective issue once you have finished all the cases.

MuffinLinwist commented 9 months ago

@LinguList and @FredericBlum, I addressed all the errors on the datasets in this issue. @chrzyki is in the process of reviewing the final PRs and merging. If everything is fit, @chrzyki, we can go and consider this issue fix.

chrzyki commented 9 months ago

@LinguList and @FredericBlum, I addressed all the errors on the datasets in this issue. @chrzyki is in the process of reviewing the final PRs and merging. If everything is fit, @chrzyki, we can go and consider this issue fix.

Everything looks good and is merged. Thank you very much for taking the time to fix these issues!

MuffinLinwist commented 8 months ago

@LinguList, @chrzyki, and @FredericBlum just a kindly reminder that this issue is fixed and can be closed.

LinguList commented 8 months ago

Cool.