lexibank / robbeetstriangulation

CLDF dataset derived from Robbeets et al.'s "Triangulation of the Transeurasian Languages" from 2021
Creative Commons Attribution 4.0 International
0 stars 0 forks source link

First coverage statistics with Python example #2

Closed LinguList closed 1 week ago

LinguList commented 3 years ago
Name Words Proportion
AmamiAsama 252 0.992126
AmamiYamatohama 252 0.992126
AmamiYoron 250 0.984252
Azeri 252 0.992126
Baoan 250 0.984252
BarabaTatar 210 0.826772
Bashkir 254 1
Buriat 254 1
Chuvash 248 0.976378
CodexCumanicus 214 0.84252
CrimeanTatar 248 0.976378
Dagur 251 0.988189
Dolgan 235 0.925197
Dongxian 246 0.968504
EasternEvenki 100 0.393701
Even 253 0.996063
EvenkiKamnigan 209 0.822835
Fukuoka 230 0.905512
Gagauz 250 0.984252
Gangwon 253 0.996063
Gyeonggi 254 1
Hachijo 230 0.905512
Hezhe 244 0.96063
Huzhu 250 0.984252
Hwanghae 254 1
Japanese 254 1
Jeju 252 0.992126
Jurchen 180 0.708661
Kagoshima 238 0.937008
Kalmyck 253 0.996063
Kamnigan 249 0.980315
Kangjia 239 0.940945
KarachayBalkar 250 0.984252
Karaim 252 0.992126
KaraKalpak 250 0.984252
Kazakh 250 0.984252
KazanTatar 252 0.992126
Khakas 252 0.992126
Khalaj 228 0.897638
Khalkha 253 0.996063
Kirghiz 252 0.992126
Korean 245 0.964567
Koshikiislands 236 0.929134
Kumamoto 227 0.893701
Kumyk 251 0.988189
KurUrmi 251 0.988189
LateMiddleKorean 251 0.988189
Manchu 251 0.988189
MiddleChulym 237 0.933071
MiddleMongolianMuqaddimataladab 216 0.850394
MiddleMongolianSecretHistory 196 0.771654
Minhe 241 0.948819
MiyakoIrabu 249 0.980315
Moghol 152 0.598425
NanaiBikin 254 1
NanaiMiddleAmur 253 0.996063
Negidal 250 0.984252
Nogai 252 0.992126
NorthAltai 242 0.952756
NorthernChungcheong 253 0.996063
NorthernEvenkiTura 247 0.972441
NorthernEvenkiTutonchany 217 0.854331
NorthernGyeongsang 253 0.996063
NorthernHamgyong 254 1
NorthernJeolla 253 0.996063
NorthernPyongan 254 1
Oirat 248 0.976378
OkinawaShuri 250 0.984252
OkinawaYonamine 248 0.976378
OldJapanese 250 0.984252
OldTurkic 232 0.913386
Oroch 253 0.996063
Orok 254 1
Oroqen 249 0.980315
Salar 226 0.889764
ShiraYughur 252 0.992126
Shor 249 0.980315
Solon 252 0.992126
SouthAltai 249 0.980315
SouthernChungcheong 253 0.996063
SouthernEvenkiChiringda 204 0.80315
SouthernEvenkiVershinaTuturyBaikal 54 0.212598
SouthernGyeongsang 253 0.996063
SouthernHamgyong 254 1
SouthernJeolla 253 0.996063
SouthernPyongan 254 1
StonyEvenkiPTPodkamennayaTunguska 252 0.992126
Tofa 246 0.968504
Turkish 253 0.996063
Turkmen 251 0.988189
Tuvan 247 0.972441
Udihe 253 0.996063
Ulcha 254 1
Uyghur 251 0.988189
Uzbek 248 0.976378
WestYugur 238 0.937008
Xibe 250 0.984252
YaeyamaHatoma 251 0.988189
YaeyamaIshigaki 250 0.984252
Yakut 250 0.984252
Yonaguni 251 0.988189
LinguList commented 3 years ago

The language coverate looks good, as far as I can tell. But there are outliers.

LinguList commented 3 years ago

@tpellard, the code I used for this with lingpy and tabulate (both on pip):

from lingpy import *
from tabulate import tabulate

wl = Wordlist.from_cldf('cldf/cldf-metadata.json', columns=["language_id", "parameter_name", "value", "form", "segments", "cognacy"])

table = []

for language, forms in wl.coverage().items():
    table += [[language, forms, forms/wl.height]] 

print(tabulate(table, headers=["Name", "Words", "Proportion"], tablefmt="pipe"))