make seabor analysis - Githubissues

LinguList commented 3 years ago

Basically, this requires to run the code on lingrex. Not very complicated. I'll provide an example later.

LinguList commented 3 years ago

Essentially, you can see the first simple method here:

https://github.com/lexibank/seabor/blob/f6e16cfda3393970a50aed38cac1e9d48ef810d4/seaborcommands/fullcomparison.py#L22-L53

This method searches for cognates and then separates those which occur in different language families.

To do this, you need a wordlist. You can load the wordlist into a lexstat object easily:

from lingpy import *
lex = LexStat.from_cldf("cldf/cldf-metadata.json", columns=["language_name", "concept_name", "value", "form", "segments", "language_subgroup", "language_family"])

If this does not work, please check the "columns" and the "namespace" parameters of the from_cldf command in the documentation, as this clarifies the namespaces, which I'd have to look up as well.

From there, you could use the code I have shown here, @fractaldragonflies

LinguList commented 3 years ago

The result (please adjust the for-loop in my example) will yield two cognate identiifers, one inside langauges families, one across, and setting all which do not go across languages to zero. So you can inspect the data conveniently in EDICTOR and search for interesting borrowings already.

fractaldragonflies commented 3 years ago

Thanks Mattis

Created the lex style wordlist without problem after changing language_name => language_id.

Lots of activity in the analysis. Changed language_family for family.

But not sure what to do about KeyError: ‘ucogid’ when processing bcubes.

Not sure about fixing up the Table command either … i.e. I didn’t put in ‘args’ yet.

Hasta mañana!

John Miller @.***

On Aug 17, 2021, at 12:06 PM, Johann-Mattis List @.***> wrote:

The result (please adjust the for-loop in my example) will yield two cognate identiifers, one inside langauges families, one across, and setting all which do not go across languages to zero. So you can inspect the data conveniently in EDICTOR and search for interesting borrowings already.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/intercontinental-dictionary-series/keypano/issues/10#issuecomment-900477713, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIVSLTU6GJTRBKCPPK2ABJTT5KJKZANCNFSM5CKD34AQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email.

LinguList commented 3 years ago

The b-cube don't work now, as you don't have a gold standard here, right? So just ignore this part.

Can you submit your code to a folder "scripts" in this repository, so we can check on this?

fractaldragonflies commented 3 years ago

Done!

Created branch 'analyze'. Has folder scripts with the analyze.py script.
Created initial wordlist (analyze.pano.tsv) and resulting wordlist (analyzer.pano.result.tsv) from analyze are at level of keypano directory.
Commented out the bcube stuff and anything after depending on it.
Produced analysis. Used thresholds of 0.5 and 0.7.
Not sure how the 2 thresholds are treated in output.
I see several SCA ID variables, though nothing with Lex prefix, so suppose these are the groupings.
Nor which are internal family versus cross family comparisons.

fractaldragonflies commented 3 years ago

Reviewed the analysis script and output some more.

Added comments to the analyze.py to better understand what's happening.
Understand that the multiple columns are for the different thresholds.
- scallid{i} column is for the cognate ids that extend over multiple language families.
- sca{i} column is for cognate ids with their language family name wether or not extending over families.
- sca{i}id seems internal bookkeeping and assigns id whether or not there are shared cognates.
Seems were are using cognate matching with lexstat method.
- Not using the Partial class, not the infomap methods.
- I don't know and so can't really appreciate the differences or reasons for choices in methods.
The 0.6 and 0.7 thresholds allow for more matches than the 0.5. Looking at day of week.
- Even matches partial on 'sista' with the Portugues 'sɨʃtɐfɐiɾɐ'. Cool.

Would like to discuss - online is fine.

Need to think about annotating borrowing as well for some development and test subsets in order to refine and test!

intercontinental-dictionary-series / keypano

make seabor analysis #10