Checking Mutual Coverage

LinguList commented 2 years ago

from lingpy.compare.sanity import mutual_coverage_subset, mutual_coverage_check
from lingpy import wordlist
from tabulate import tabulate

wl = Wordlist.from_cldf("cldf/cldf-metadata.json")

# count number of languages which have mutual coverage of the specified number of concepts
table = []
for i in range(210, 170, -1):
    table += [[i, mutual_coverage_subset(wl, i)[0]]]
print(tabulate(table, headers=["Concepts", "Number of Languages"]))

The result is:

  Concepts    Number of Languages
----------  ---------------------
       210                     12
       209                     14
       208                     23
       207                     26
       206                     29
       205                     35
       204                     38
       203                     42
       202                     45
       201                     50
       200                     54
       199                     57
       198                     60
       197                     63
       196                     65
       195                     68
       194                     74
       193                     78
       192                     83
       191                     85
       190                     89
       189                     91
       188                     94
       187                     98
       186                    104
       185                    107
       184                    111
       183                    114
       182                    119
       181                    123
       180                    129
       179                    133
       178                    134
       177                    141
       176                    144
       175                    144
       174                    148
       173                    151
       172                    152
       171                    154

LinguList commented 2 years ago

@SimonGreenhill, @maryewal @antipodite, what is revealing about this analysis is that with 170 concepts, which amounts to a mutual coverage of 80% in the subsets, we cover only 154 out of > 400 languages in the data. This means that the mutual coverage is much lower for the whole sample.

LinguList commented 2 years ago

If you want to know the mutual coverage for all languages in the sample, this can be computed as follows:

for i in range(180, 1, -1):
     if mutual_coverage_check(wl, i):
         print("Mutual coverage of data is {0}".format(i))
         break

Result is:

Mutual coverage of data is 40

LinguList commented 2 years ago

This means: the minimal amount of concepts with translations shared between varieties in the sample is 40.

However, it may be possible to delete some outliers here and to arrive at some decent enough coverage, since

from lingpy.compare.sanity import average_coverage
average_coverage(wl)

yields

0.723091307174041

So on average, languages share 0.72 * 210 = 151 concepts.

LinguList commented 2 years ago

So maybe it makes sense to just ignore those languages which are beyond 150 concepts to start with. But the question is if any of those languages are crucial for analyses:

table = []

for language, concepts in wl.coverage().items():
     if concepts < 150:
         table += [[language, concepts]]

print(tabulate(sorted(table, key=lambda x: x[1])))

This would be 62 candidates:

-------------------  ---
alasu                 81
Nimoa                 83
Lup                   91
Niuafoou              94
Andra                 96
Jiriw                 96
Sori                  96
Bujan                 97
Pak                   97
Ponam                 97
Bipi                  98
Tulu                  98
Baluan                99
Mokaren               99
Tasmate              100
Riwo                 107
Amblong              108
Nokuku               108
Polonombauk          108
ButmasTurButmas      110
Piamatsina           110
Vunapu               110
Marino               111
Wailapa              111
Pingilapese          113
Kis                  114
Mokilese             115
Satawalese           124
Ghayavi              126
Ulithian             126
Kapone               127
Moenebe              134
camuki               135
Nmi                  135
Ara                  136
Avek                 136
Moavek               136
Pinje                136
Poai                 136
Poapoa               136
Sirehe               136
Aek                  137
Aragur               137
Aro                  137
Boewe                137
Ciri                 137
Moaek                137
Neku                 137
Poamei               137
ua                   137
Wamoa                137
Bilibil              138
Doura                138
HaliaSelau           138
UveaWest             139
BaliVitu             142
Tolomako             143
Roro                 145
Tokelau              146
NyelayuBelepDialect  147
PuloAnnan            147
IfiraMeleMeleFila    149
-------------------  ---

maryewal commented 2 years ago

looks like a lot of New Caledonia and several Micronesian, which may be important to include...Some may end up having "duplicate" lists, so wouldn't be a problem to exclude in favor of another, more comprehensive list for the same glottocode.(eg. Ifira-Mele). @SimonGreenhill, what do you think?

SimonGreenhill commented 2 years ago

I'm not too worried about low coverage here as we are going for breadth in language sample, and unfortunately many of those low count things are languages we definitely need (e.g. lots of the low things are from the Admiralties which is pretty vital, but under-described. We should try to replace low coverage languages with better versions if we have them.

lexibank / abvdoceanic

Checking Mutual Coverage #28