lexibank / robbeetstriangulation

CLDF dataset derived from Robbeets et al.'s "Triangulation of the Transeurasian Languages" from 2021
Creative Commons Attribution 4.0 International
0 stars 0 forks source link

How to measure accuracy of automated cognate detection and compute automatic cognates #5

Closed LinguList closed 1 week ago

LinguList commented 3 years ago
In [1]: from lingpy import *

In [2]: wl = Wordlist.from_cldf('cldf/cldf-metadata.json', columns=["language_id", "concept_name", "value", "form", "segments", "cognacy"])

In [3]: lex = LexStat(wl)

In [4]: lex.cluster(method="sca", threshold=0.45, ref="scaid", cluster_method="infomap")

Analysis still running on my computer, will share results once I have them.

LinguList commented 3 years ago

Surprise (continuing the code):

In [5]: from lingpy.evaluate.acd import bcubes

In [6]: bcubes(lex, "cognacy", "scaid")
*************************
* B-Cubed-Scores        *
* --------------------- *
* Precision:     0.6904 *
* Recall:        0.8213 *
* F-Scores:      0.7502 *
*************************'
Out[6]: (0.6904483990997843, 0.8212886270328971, 0.750206428672806)
LinguList commented 3 years ago

This shows that the algorithm finds A LOT OF FALSE positives (if the cognates in the data are correct). Usually, the algorithm is rather conservative, so this surprises me a lot. The high recall 0.82 is also very surprising.

LinguList commented 3 years ago

Maybe an error in data conversion.

SimonGreenhill commented 3 years ago

Hmm, Mattis, how much of this is due to cognates within families rather than deep TE cognates (e.g. Koreanic:Koreanic, Tungusic:Tungusic etc). I suspect the cognates are pretty good within some of these families, but the between families cognates are more problematic. One thing that might be interesting (a lot of work?) would be to subset the data into families, calculate the b-cubed scores within families, and then compare to the overall scores.

It might even be fun to compare pairwise (Koreanic vs Japonic is probably higher than Koreanic vs. Mongolic)

LinguList commented 3 years ago
lex.get_scorer(runs=10000)
lex.cluster(methods="lexstat", ref="lexstatid", threshold=0.55, cluster_method="infomap")
bcubes(lex, "cognacy", "lexstatid")

This yields (surprise!):

*************************
* B-Cubed-Scores        *
* --------------------- *
* Precision:     0.9188 *
* Recall:        0.8087 *
* F-Scores:      0.8602 *
*************************
LinguList commented 3 years ago

Normally, I am happy if the method reaches 0.8 on these large datasets.

And yes, @SimonGreenhill, we need to see how much cognates are across languages.

LinguList commented 3 years ago

And we rarely have differences ~ 0.10 (I never encountered before) on datasets between SCA and LexStat. We need to check for cross-language-family cognates, yes.

LinguList commented 3 years ago

Thresholds with SCA reach high precision only with 0.25 thresholds, which means sequences are almost identical. What I conclude from this is: We capture language-family-internal cognates here, cross-family cognates are rare.

RustyGray commented 3 years ago

Exactly what I would expect- cross family “cognates” are rare and patchily distributed. r.

Russell Gray Director, Max Planck Institute for Evolutionary Anthropology Head of the Department of Linguistic and Cultural Evolution TEL: +49-3641-68 68 01 FAX: +49-3641-68 68 68 Departmental Administrators: Jena @. Leipzig @. http://www.shh.mpg.de/2375/en http://language.psy.auckland.ac.nz/ https://scholar.google.com/citations?hl=en&user=sksPd1cAAAAJ

On 20. Aug 2021, at 10:40, Johann-Mattis List @.***> wrote:

Thresholds with SCA reach high precision only with 0.25 thresholds, which means sequences are almost identical. What I conclude from this is: We capture language-family-internal cognates here, cross-family cognates are rare.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lexibank/robbeetsaltaic/issues/5#issuecomment-902534591, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEETOPDGHNCD2XH7JJDLFJDT5YIINANCNFSM5COLYISA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email.

tpellard commented 3 years ago

Could somebody give me a reference that I can read to make sense of those figures and follow the discussion?

LinguList commented 3 years ago

I am now running the analysis for Sino-Tibetan to have some comparison here.

LinguList commented 3 years ago

Yes, we have a paper that explains the methods: https://doi.org/10.1371/journal.pone.0170046

This paper also discusses the evaluation metrics used here.

And here's a tutorial that is a bit more into the algorithms: https://doi.org/10.1093/jole/lzy006