lexibank / robbeetstriangulation

CLDF dataset derived from Robbeets et al.'s "Triangulation of the Transeurasian Languages" from 2021
Creative Commons Attribution 4.0 International
0 stars 0 forks source link

Delta Scores #12

Closed LinguList closed 1 week ago

LinguList commented 2 years ago

Following up on the detection by @SimonGreenhill with delta scores, we can easily test this without referring to any other dataset by:

If we find a huge discrepancy, this is a hint (in my opinion) that the data was collected in a bottom-up fashion by considering only proto-forms across subgroups, instead of searching all against all for cognate residues.

LinguList commented 2 years ago

Indo-European

Subgroup Delta STD
all 0.23 0.03
Balto-Slavic 0.35 0.04
Celtic 0.20 0.07
Germanic 0.26 0.04
Indo-Aryan 0.27 0.03
Romance 0.35 0.03

Sino-Tibetan

Subgroup Delta STD
all 0.26 0.04
Kiranti 0.38 0.05
Kuki-Chin 0.10 0.00
Sinitic 0.32 0.07
Tani-Yidu 0.02 0.00
Tibeto-Dulong 0.16 0.04

Dravidian

Subgroup Delta STD
all 0.27 0.04
South Dravidian 0.36 0.05

Altaic

Subgroup Delta STD
all 0.14 0.02
Japonic 0.24 0.04
Koreanic 0.24 0.05
Mongolic 0.32 0.04
Tungusic 0.27 0.03
Turkic 0.34 0.03
LinguList commented 2 years ago

My suspicion could be confirmed:

  1. all datasets show slightly lower scores for ALL languages as compared to the subgroups
  2. Altaic is still exceptional, but differences are not AS strong as they seem (or are they?)
  3. the low delta scores for the family level can be explained by the fact that for longer distances there is less reticulation, and quartets are more likely to be in agreement
  4. however, the rather sharp difference in Altaic is due to the fact that cognates are only annotated for the root level, so the best cognates across subgroups were selected, no surviving single cases in individual dialects and the like

@SimonGreenhill, do you think this explanation makes sense? Can we test this in any way?

LinguList commented 2 years ago

Code is in scripts/deep.py !

SimonGreenhill commented 2 years ago

Looks good, can you print out N too (i.e. how many languages in each group?)

Also, can you compare github.com/lexibank/oskolskayatungusic/ to Altaic:Tungusic, and github.com/lexibank/savelyevturkic/ to Altaic:Turkic? This is a direct test of 4 as Osk. and Sav. are the datasets that are used in the Altaic dataset... If you wanted an exact test then you could prune the altaic:tungusic vs. osk and altaic:turkic vs sav. datasets to have the exact same languages.

(You could also do github.com/lexibank/leejaponic vs Japonic and github.com/lexibank/leekorean vs Korean, but these are from different authors, so not as telling.)

Nexus files will be in phlorest if you'd prefer them.

SimonGreenhill commented 2 years ago

Hmm. if you really wanted to dig into this, you could compare counts on the patterns e.g. what patterns in Altaic:Tungusic were removed from Oskolskaya:Tungusic. But this might be a lot of work for little gain.

LinguList commented 2 years ago

Yep, I'd do the checking of subgroups in a different thread, but adding subgroup sizes (some were excluded as they had no quartets) is also important.

LinguList commented 2 years ago

Indo-European

Subgroup Delta STD Size
all 0.23 0.03 94
Balto-Slavic 0.35 0.04 16
Celtic 0.20 0.07 6
Germanic 0.26 0.04 17
Indo-Aryan 0.27 0.03 30
Romance 0.35 0.03 15

Sino-Tibetan

Subgroup Delta STD Size
all 0.26 0.04 50
Kiranti 0.38 0.05 7
Kuki-Chin 0.10 0.00 4
Sinitic 0.32 0.07 7
Tani-Yidu 0.02 0.00 4
Tibeto-Dulong 0.16 0.04 21

Dravidian

Subgroup Delta STD Size
all 0.27 0.04 20
South Dravidian 0.36 0.05 11

Altaic

Subgroup Delta STD Size
all 0.14 0.02 101
Japonic 0.24 0.04 16
Koreanic 0.24 0.05 16
Mongolic 0.32 0.04 15
Tungusic 0.27 0.03 22
Turkic 0.34 0.03 32