Closed LinguList closed 1 week ago
Surprise (continuing the code):
In [5]: from lingpy.evaluate.acd import bcubes
In [6]: bcubes(lex, "cognacy", "scaid")
*************************
* B-Cubed-Scores *
* --------------------- *
* Precision: 0.6904 *
* Recall: 0.8213 *
* F-Scores: 0.7502 *
*************************'
Out[6]: (0.6904483990997843, 0.8212886270328971, 0.750206428672806)
This shows that the algorithm finds A LOT OF FALSE positives (if the cognates in the data are correct). Usually, the algorithm is rather conservative, so this surprises me a lot. The high recall 0.82 is also very surprising.
Maybe an error in data conversion.
Hmm, Mattis, how much of this is due to cognates within families rather than deep TE cognates (e.g. Koreanic:Koreanic, Tungusic:Tungusic etc). I suspect the cognates are pretty good within some of these families, but the between families cognates are more problematic. One thing that might be interesting (a lot of work?) would be to subset the data into families, calculate the b-cubed scores within families, and then compare to the overall scores.
It might even be fun to compare pairwise (Koreanic vs Japonic is probably higher than Koreanic vs. Mongolic)
lex.get_scorer(runs=10000)
lex.cluster(methods="lexstat", ref="lexstatid", threshold=0.55, cluster_method="infomap")
bcubes(lex, "cognacy", "lexstatid")
This yields (surprise!):
*************************
* B-Cubed-Scores *
* --------------------- *
* Precision: 0.9188 *
* Recall: 0.8087 *
* F-Scores: 0.8602 *
*************************
Normally, I am happy if the method reaches 0.8 on these large datasets.
And yes, @SimonGreenhill, we need to see how much cognates are across languages.
And we rarely have differences ~ 0.10 (I never encountered before) on datasets between SCA and LexStat. We need to check for cross-language-family cognates, yes.
Thresholds with SCA reach high precision only with 0.25 thresholds, which means sequences are almost identical. What I conclude from this is: We capture language-family-internal cognates here, cross-family cognates are rare.
Exactly what I would expect- cross family “cognates” are rare and patchily distributed. r.
Russell Gray Director, Max Planck Institute for Evolutionary Anthropology Head of the Department of Linguistic and Cultural Evolution TEL: +49-3641-68 68 01 FAX: +49-3641-68 68 68 Departmental Administrators: Jena @. Leipzig @. http://www.shh.mpg.de/2375/en http://language.psy.auckland.ac.nz/ https://scholar.google.com/citations?hl=en&user=sksPd1cAAAAJ
On 20. Aug 2021, at 10:40, Johann-Mattis List @.***> wrote:
Thresholds with SCA reach high precision only with 0.25 thresholds, which means sequences are almost identical. What I conclude from this is: We capture language-family-internal cognates here, cross-family cognates are rare.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lexibank/robbeetsaltaic/issues/5#issuecomment-902534591, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEETOPDGHNCD2XH7JJDLFJDT5YIINANCNFSM5COLYISA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email.
Could somebody give me a reference that I can read to make sense of those figures and follow the discussion?
I am now running the analysis for Sino-Tibetan to have some comparison here.
Yes, we have a paper that explains the methods: https://doi.org/10.1371/journal.pone.0170046
This paper also discusses the evaluation metrics used here.
And here's a tutorial that is a bit more into the algorithms: https://doi.org/10.1093/jole/lzy006
Analysis still running on my computer, will share results once I have them.