Open LinguList opened 2 years ago
@SimonGreenhill, @maryewal @antipodite, what is revealing about this analysis is that with 170 concepts, which amounts to a mutual coverage of 80% in the subsets, we cover only 154 out of > 400 languages in the data. This means that the mutual coverage is much lower for the whole sample.
If you want to know the mutual coverage for all languages in the sample, this can be computed as follows:
for i in range(180, 1, -1):
if mutual_coverage_check(wl, i):
print("Mutual coverage of data is {0}".format(i))
break
Result is:
Mutual coverage of data is 40
This means: the minimal amount of concepts with translations shared between varieties in the sample is 40.
However, it may be possible to delete some outliers here and to arrive at some decent enough coverage, since
from lingpy.compare.sanity import average_coverage
average_coverage(wl)
yields
0.723091307174041
So on average, languages share 0.72 * 210 = 151 concepts.
So maybe it makes sense to just ignore those languages which are beyond 150 concepts to start with. But the question is if any of those languages are crucial for analyses:
table = []
for language, concepts in wl.coverage().items():
if concepts < 150:
table += [[language, concepts]]
print(tabulate(sorted(table, key=lambda x: x[1])))
This would be 62 candidates:
------------------- ---
alasu 81
Nimoa 83
Lup 91
Niuafoou 94
Andra 96
Jiriw 96
Sori 96
Bujan 97
Pak 97
Ponam 97
Bipi 98
Tulu 98
Baluan 99
Mokaren 99
Tasmate 100
Riwo 107
Amblong 108
Nokuku 108
Polonombauk 108
ButmasTurButmas 110
Piamatsina 110
Vunapu 110
Marino 111
Wailapa 111
Pingilapese 113
Kis 114
Mokilese 115
Satawalese 124
Ghayavi 126
Ulithian 126
Kapone 127
Moenebe 134
camuki 135
Nmi 135
Ara 136
Avek 136
Moavek 136
Pinje 136
Poai 136
Poapoa 136
Sirehe 136
Aek 137
Aragur 137
Aro 137
Boewe 137
Ciri 137
Moaek 137
Neku 137
Poamei 137
ua 137
Wamoa 137
Bilibil 138
Doura 138
HaliaSelau 138
UveaWest 139
BaliVitu 142
Tolomako 143
Roro 145
Tokelau 146
NyelayuBelepDialect 147
PuloAnnan 147
IfiraMeleMeleFila 149
------------------- ---
looks like a lot of New Caledonia and several Micronesian, which may be important to include...Some may end up having "duplicate" lists, so wouldn't be a problem to exclude in favor of another, more comprehensive list for the same glottocode.(eg. Ifira-Mele). @SimonGreenhill, what do you think?
I'm not too worried about low coverage here as we are going for breadth in language sample, and unfortunately many of those low count things are languages we definitely need (e.g. lots of the low things are from the Admiralties which is pretty vital, but under-described. We should try to replace low coverage languages with better versions if we have them.
The result is: