Borrowing statistics should be averaged

LinguList commented 4 years ago

Current pairwise borrowing statistics are giving the total counts. This may be seen as okay, if we look at language-internal proportions, but the raw counts of course are not very helpful when comparing inside a language, etc. So our statistics need to be adjusted, ideally simply by counting the average of shared borrowings, e.g., Bangime vs. all Dogon: all individually shared matches and then divide by the number of individual comparisons. The same for Dogon vs. Mande: count each and every language etc. This will drastically lower the number of cases, but give us a clearer picture, and we can still treat the number of shared items as the layer, we can even rank each concept.

IndianaTones commented 4 years ago

Ok, I think I understand. Is this the latest version of the statistics spreadsheet?

LinguList commented 4 years ago

Yes, but this was merely to illustrate major connections first, etc. If we only look at Bangime, it is in fact not that important to average things, but I think it would be better to have some idea here, also with respect to the counts.

I'll try to solve this and come up with some proposal later next week.

IndianaTones commented 4 years ago

I also just realized two somewhat minor but important points:

1) some concepts have more than one entry per language (dialectal variation or different forms completely) 2) some languages have lower coverage across concepts and thus are less represented overall

Will these issues influence the averages?

IndianaTones commented 4 years ago

So, I have to admit finally, I am struggling with this so when you have time, let's discuss somehow...

LinguList commented 4 years ago

I plan on looking into this today.

LinguList commented 4 years ago

Family	Average number of words
Dogon	44.68
Mande	25.58
Atlantic	30.00
Songhai	17.00

LinguList commented 4 years ago

Here are the numbers in Python:

            {'Dogon': [18,
              54,
              51,
              55,
              53,
              39,
              26,
              26,
              51,
              47,
              52,
              56,
              60,
              51,
              52,
              49,
              51,
              19,
              30,
              41,
              49,
              53],
             'Mande': [48, 23, 18, 23, 22, 22, 20, 21, 21, 17, 44, 28],
             'Atlantic': [30],
             'Songhai': [9, 25]})

So the lists contain the number of elements shared for each language in the sample.

I did not check uniquely shared so far.

LinguList commented 4 years ago

But the numbers may be betraying. Here's for pairs only:

F	A
Mande	4.83
Songhai	1.50
Dogon	17.50
Atlantic	4.00

LinguList commented 4 years ago

And the number of shared items:

{'Mande': [15, 13, 4, 4, 4, 3, 3, 3, 3, 2, 2, 2],
             'Songhai': [1, 2],
             'Dogon': [8,
              21,
              18,
              20,
              26,
              24,
              18,
              19,
              12,
              28,
              23,
              19,
              15,
              9,
              17,
              18,
              20,
              18,
              13,
              19,
              12,
              8],
             'Atlantic': [4]}

LinguList commented 4 years ago

script is in folder "scripts/average.py"

LinguList commented 4 years ago

message is: With dogon, there are between 10 and 20 words consistently shared uniquely as Bangime-Dogon. With Songhai, none uniquely. With Atlantic 4, but only one language. And Mande has a huge difference between two langauges and the rest.

IndianaTones commented 4 years ago

Family Average number of words Dogon 44.68 Mande 25.58 Atlantic 30.00 Songhai 17.00

Ok, so this is average number of words in the sample? (just so I understand)

IndianaTones commented 4 years ago

And the number of shared items:

{'Mande': [15, 13, 4, 4, 4, 3, 3, 3, 3, 2, 2, 2],
             'Songhai': [1, 2],
             'Dogon': [8,
              21,
              18,
              20,
              26,
              24,
              18,
              19,
              12,
              28,
              23,
              19,
              15,
              9,
              17,
              18,
              20,
              18,
              13,
              19,
              12,
              8],
             'Atlantic': [4]}

ah this is great! when I had started doing it by hand, I got the same results, but couldn't finish because it was taking me so much time. I will look at the script.

IndianaTones commented 4 years ago

message is: With dogon, there are between 10 and 20 words consistently shared uniquely as Bangime-Dogon. With Songhai, none uniquely. With Atlantic 4, but only one language. And Mande has a huge difference between two langauges and the rest.

ok this is also perfect. just what I wanted to know.

LinguList commented 4 years ago

I am now changing the script to also have the same analysis for Dogon, and include concrete languages names, it is easier to understand what's going on (see also the email I sent to you and Hiba).

For Dogon, we get the following picture:

F	A
Mande	5.19
Isolate	17.50
Songhai	4.52
Atlantic	8.05

IndianaTones commented 4 years ago

script is in folder "scripts/average.py"

for some reason, it is not appearing in the scripts folder?

LinguList commented 4 years ago

Short comment: Songhai is spurious, below 10 in almost all Dogon varieties. Bangime is consisten (of course).

LinguList commented 4 years ago

I'll update in a sec.

LinguList commented 4 years ago

Mande is also not very convincing with respect to exclusively shared candidates (but maybe there are other candidates). But again: I would need to look at the datailed Mande languages there, still not sure how to do that in the statistics. And then Atlantic, where you have most candidates in Perge Tergu (14), Bondu So (11), Bankan Tey (11).

IndianaTones commented 4 years ago

ok so this is different from what I was seeing by going by hand, but I think exclusivity versus general patterns are making the distictions (?)

LinguList commented 4 years ago

We stick to exclusive patterns now. We only look at, say, words in Dogon and Atlantic, not in Bangime and Mnade, etc. Please check the new file relations.md which I just uploaded (average.py is also there now).

IndianaTones commented 4 years ago

great - looking now. when discussing the results, does it make sense to organize the semantic patterns according to those in WOLD?

LinguList commented 4 years ago

I think that is up to you. Please check specifically THIS part, and please reload (I had a bug in the script).

LinguList commented 4 years ago

If we only concentrate on Bangime (it's easier), we find a very striking pattern here: Jenaama and Bambara are very unique among Mande, in terms of similar vocabulary, right? And Dogon languages have a constant amount of material shared.

LinguList commented 4 years ago

I am also writing this up now in the text.

IndianaTones commented 4 years ago

this is so huge!!!!! we'll make an Africanist out of you yet! ;)

lexibank / baf2

Borrowing statistics should be averaged #2