lexibank / abvdoceanic

Creative Commons Attribution 4.0 International
5 stars 2 forks source link

Systematic Search for Extremes Cases of Synonymy in the Data #26

Open LinguList opened 2 years ago

LinguList commented 2 years ago

We have discussed several possibilities here. Searching extreme values will help us to assess where potential errors lie, or where languages may show interesting developments.

  1. synonymity per language, since the overall synonymity is rather high with 1.16.
  2. languages with large numbers of consonants and vowels
  3. type token analysis to uncover languages with rare sounds
LinguList commented 2 years ago

Synonymy can be conveniently checked with LingPy.

from lingpy import Wordlist
from lingpy.compare.sanity import synonymy
from tabulate import tabulate

wl = Wordlist.from_cldf('cldf/cldf-metadata.json')
synonyms = synonymy(wl)
counts = {i: {"concepts": [], "languages": []} for i in range(1, 15)}

for (language, concept), freq in synonyms.items():
     counts[freq]["concepts"] += [concept]
     counts[freq]["languages"] += [language]

table = []

for i in range(1, 11):
     table += [[i, len(counts[i]["concepts"]), len(set(counts[i]["concepts"])), len(set(counts[i]["languages"]))]]

print(tabulate(table, headers=["Synonyms", "Occurrences", "Concepts", "Languages"]))
LinguList commented 2 years ago

Results for this analysis are revealing (!):

  Synonyms    Occurrences    Concepts    Languages
----------  -------------  ----------  -----------
         1          63624         210          417
         2           7696         210          380
         3           1140         204          204
         4            461         142          203
         5            124          70           48
         6             62          27           49
         7             26          19           19
         8             12          10           10
         9              0           0            0
        10              1           1            1
LinguList commented 2 years ago

If we tolerate concepts with two forms, which I think if fine, there are still more than 1500 forms with more than 2 forms, and we have even one extreme case of 10 (!) synonyms! The extreme cases with > 6 forms per concept are:

table = []
for i in [7, 8, 10]:
    for concept, language in zip(counts[i]["concepts"], counts[i]["languages"]):
        table += [[i, concept, language]]
print(tabulate(table, headers=["Synonyms", "Concept", "Language"]))

Result is:

 Synonyms  Concept           Language
----------  ----------------  --------------------
         7  big               Rarotongan
         7  narrow            Rarotongan
         7  hand              LamogaiMulakaino
         7  no, not           Pukapuka
         7  to breathe        Rennellese
         7  to cut, hack      FutunaEast
         7  we                Toambaita
         7  to hit            Emae
         7  to hit            IfiraMeleMeleFila
         7  dry               RapanuiEasterIsland
         7  black             RapanuiEasterIsland
         7  to think          Puluwatese
         7  to count          Puluwatese
         7  small             Chuukese
         7  Six               Chuukese
         7  to cut, hack      TungagTungakLavongai
         7  where?            TungagTungakLavongai
         7  we                Haku
         7  we                Naman
         7  to open, uncover  Aiwoo
         7  to cut, hack      Ulithian
         7  flower            Ulithian
         7  where?            Ulithian
         7  to cook           Bola
         7  rope              Nukeria
         7  no, not           NyelayuBelepDialect
         8  to fall           Marquesan
         8  ashes             Marquesan
         8  to split          Waropen
         8  we                FutunaAniwa
         8  to suck           FutunaEast
         8  to cut, hack      IfiraMeleMeleFila
         8  to cut, hack      RapanuiEasterIsland
         8  thou              Chuukese
         8  we                Lengo
         8  in, inside        Ulithian
         8  to come           NyelayuBelepDialect
         8  if                NyelayuBelepDialect
        10  to hit            Marquesan

So there are 10 ways to hit a person in Marquesan and 8 ways to express 1st person pronoun in Lengo.

LinguList commented 2 years ago

@maryewal and @antipodite and @SimonGreenhill, I am not sure how to proceed best, but if we say we tolerate up to 3 synonyms (which is still a lot), it would probably be important to check the remaining > 600 cases point by point for synonyms, to also make sure the phylogenetic analysis is not influenced too much by this excess synonymy.

LinguList commented 2 years ago

I'll do the type-token analysis, etc. later, as I have to attend other things now.

maryewal commented 2 years ago

Re: the multiple synonyms (eg 10 'to hit' in MQS), are these concepts actually all coded for cognacy or just entered as forms?

LinguList commented 2 years ago

The code doesn't check for this, but we can adjust it later on.

LinguList commented 2 years ago

But here's what we find in the cldf file (cldf/cognates.csv):

Marquesan-72_tohit-1-1,Marquesan-72_tohit-1,paki,tohit-6,false,expert,Greenhilletal2008,,,
Marquesan-72_tohit-3-1,Marquesan-72_tohit-3,paì,tohit-6,false,expert,Greenhilletal2008,,,
Marquesan-72_tohit-5-1,Marquesan-72_tohit-5,patu,tohit-7,false,expert,Greenhilletal2008,,,
Marquesan-72_tohit-6-1,Marquesan-72_tohit-6,tuki,tohit-8,false,expert,Greenhilletal2008,,,
Marquesan-72_tohit-7-1,Marquesan-72_tohit-7,tuì,tohit-8,false,expert,Greenhilletal2008,,,
Marquesan-72_tohit-10-1,Marquesan-72_tohit-10,ta,tohit-10,false,expert,Greenhilletal2008,,,
LinguList commented 2 years ago

And here's the original CLDF forms.csv file:

Marquesan-72_tohit-1,91669,Marquesan,72_tohit,paki/paki,paki,p a k i,Donner des petits coups avec la main (Dln),,6,false,^p a k i$,Marquesan
Marquesan-72_tohit-2,91669,Marquesan,72_tohit,paki/paki,paki,p a k i,Donner des petits coups avec la main (Dln),,6,false,^p a k i$,Marquesan
Marquesan-72_tohit-3,91670,Marquesan,72_tohit,paì/paì,paì,p a ì,Donner des petits coups avec la main (Dln),,6,false,^p a ì$,Marquesan
Marquesan-72_tohit-4,91670,Marquesan,72_tohit,paì/paì,paì,p a ì,Donner des petits coups avec la main (Dln),,6,false,^p a ì$,Marquesan
Marquesan-72_tohit-5,91671,Marquesan,72_tohit,patu,patu,p a t u,"Strike (in flaying skin or bark), strike, nudge with elbow (I)",,7,false,^p a t u$,Marquesan
Marquesan-72_tohit-6,91672,Marquesan,72_tohit,tuki,tuki,t u k i,"Battre, ecrasser, piler (Dln)",,8,false,^t u k i$,Marquesan
Marquesan-72_tohit-7,91673,Marquesan,72_tohit,tuì,tuì,t u ì,"Battre, ecrasser, piler (Dln)",,8,false,^t u ì$,Marquesan
Marquesan-72_tohit-8,135713,Marquesan,72_tohit,kere,kere,k e r e,to hit (punch with the fist),,,false,^k e r e$,Marquesan
Marquesan-72_tohit-9,135714,Marquesan,72_tohit,pehi,pehi,p e h i,hit generally,,,false,^p e h i$,Marquesan
Marquesan-72_tohit-10,135748,Marquesan,72_tohit,ta,ta,t a,"""to strike"" (perhaps taa)",,10,false,^t a$,Marquesan
LinguList commented 2 years ago

This reveals one problem in the CLDF conversion which is not in the original data: forms like paki/paki are interpreted as two forms, although it is one, we have to modify the CLDF conversion, accordingly, to only take the first form (this can be easily accounted for).

LinguList commented 2 years ago

This would yield:

Marquesan-72_tohit-1,91669,Marquesan,72_tohit,paki/paki,paki,p a k i,Donner des petits coups avec la main (Dln),,6,false,^p a k i$,Marquesan
Marquesan-72_tohit-3,91670,Marquesan,72_tohit,paì/paì,paì,p a ì,Donner des petits coups avec la main (Dln),,6,false,^p a ì$,Marquesan
Marquesan-72_tohit-5,91671,Marquesan,72_tohit,patu,patu,p a t u,"Strike (in flaying skin or bark), strike, nudge with elbow (I)",,7,false,^p a t u$,Marquesan
Marquesan-72_tohit-6,91672,Marquesan,72_tohit,tuki,tuki,t u k i,"Battre, ecrasser, piler (Dln)",,8,false,^t u k i$,Marquesan
Marquesan-72_tohit-7,91673,Marquesan,72_tohit,tuì,tuì,t u ì,"Battre, ecrasser, piler (Dln)",,8,false,^t u ì$,Marquesan
Marquesan-72_tohit-8,135713,Marquesan,72_tohit,kere,kere,k e r e,to hit (punch with the fist),,,false,^k e r e$,Marquesan
Marquesan-72_tohit-9,135714,Marquesan,72_tohit,pehi,pehi,p e h i,hit generally,,,false,^p e h i$,Marquesan
Marquesan-72_tohit-10,135748,Marquesan,72_tohit,ta,ta,t a,"""to strike"" (perhaps taa)",,10,false,^t a$,Marquesan

Cognate sets are the number in second right-most column (counting in commas). We have 6, 7, 8, and 10.

LinguList commented 2 years ago

Still a lot of variation, right?

maryewal commented 2 years ago

Thanks, Mattis. Yes, still significant variation; looks like 6 sets.

SimonGreenhill commented 2 years ago

yeah, a lot of the synonyms are culled through being coded as cognate set x.

Note that this dataset here is out of date. It will need to be updated when we've revised the cognate coding.

SimonGreenhill commented 2 years ago

Note that the python library NexusMaker will handle all the ABVD specific quirks appropriately so we do not want to make a nexus file from lingpy or edictor.

LinguList commented 2 years ago

Sure, I was not thinking that concrete anyway: I was more interested in the potential impact that this has on the sound inventories, which was the outgoing point in todays meeting, where some problems occurred, which brought us in the end to questions of synonymy and mutual coverage (both also important for sound inventory counts, I think).

LinguList commented 2 years ago

Btw, if the dataset will change soon, how useful is it then to work on detailed orthography profiles? Should this then also be stopped?

SimonGreenhill commented 2 years ago

that's probably ok, the focus has been on cognate sets rather than the lexemes.