Closed LinguList closed 2 years ago
Ok, I think I understand. Is this the latest version of the statistics spreadsheet?
Yes, but this was merely to illustrate major connections first, etc. If we only look at Bangime, it is in fact not that important to average things, but I think it would be better to have some idea here, also with respect to the counts.
I'll try to solve this and come up with some proposal later next week.
I also just realized two somewhat minor but important points:
1) some concepts have more than one entry per language (dialectal variation or different forms completely) 2) some languages have lower coverage across concepts and thus are less represented overall
Will these issues influence the averages?
So, I have to admit finally, I am struggling with this so when you have time, let's discuss somehow...
I plan on looking into this today.
Family | Average number of words |
---|---|
Dogon | 44.68 |
Mande | 25.58 |
Atlantic | 30.00 |
Songhai | 17.00 |
Here are the numbers in Python:
{'Dogon': [18,
54,
51,
55,
53,
39,
26,
26,
51,
47,
52,
56,
60,
51,
52,
49,
51,
19,
30,
41,
49,
53],
'Mande': [48, 23, 18, 23, 22, 22, 20, 21, 21, 17, 44, 28],
'Atlantic': [30],
'Songhai': [9, 25]})
So the lists contain the number of elements shared for each language in the sample.
I did not check uniquely shared so far.
But the numbers may be betraying. Here's for pairs only:
F | A |
---|---|
Mande | 4.83 |
Songhai | 1.50 |
Dogon | 17.50 |
Atlantic | 4.00 |
And the number of shared items:
{'Mande': [15, 13, 4, 4, 4, 3, 3, 3, 3, 2, 2, 2],
'Songhai': [1, 2],
'Dogon': [8,
21,
18,
20,
26,
24,
18,
19,
12,
28,
23,
19,
15,
9,
17,
18,
20,
18,
13,
19,
12,
8],
'Atlantic': [4]}
script is in folder "scripts/average.py"
message is: With dogon, there are between 10 and 20 words consistently shared uniquely as Bangime-Dogon. With Songhai, none uniquely. With Atlantic 4, but only one language. And Mande has a huge difference between two langauges and the rest.
Family Average number of words Dogon 44.68 Mande 25.58 Atlantic 30.00 Songhai 17.00
Ok, so this is average number of words in the sample? (just so I understand)
And the number of shared items:
{'Mande': [15, 13, 4, 4, 4, 3, 3, 3, 3, 2, 2, 2], 'Songhai': [1, 2], 'Dogon': [8, 21, 18, 20, 26, 24, 18, 19, 12, 28, 23, 19, 15, 9, 17, 18, 20, 18, 13, 19, 12, 8], 'Atlantic': [4]}
ah this is great! when I had started doing it by hand, I got the same results, but couldn't finish because it was taking me so much time. I will look at the script.
message is: With dogon, there are between 10 and 20 words consistently shared uniquely as Bangime-Dogon. With Songhai, none uniquely. With Atlantic 4, but only one language. And Mande has a huge difference between two langauges and the rest.
ok this is also perfect. just what I wanted to know.
I am now changing the script to also have the same analysis for Dogon, and include concrete languages names, it is easier to understand what's going on (see also the email I sent to you and Hiba).
For Dogon, we get the following picture:
F | A |
---|---|
Mande | 5.19 |
Isolate | 17.50 |
Songhai | 4.52 |
Atlantic | 8.05 |
script is in folder "scripts/average.py"
for some reason, it is not appearing in the scripts folder?
Short comment: Songhai is spurious, below 10 in almost all Dogon varieties. Bangime is consisten (of course).
I'll update in a sec.
Mande is also not very convincing with respect to exclusively shared candidates (but maybe there are other candidates). But again: I would need to look at the datailed Mande languages there, still not sure how to do that in the statistics. And then Atlantic, where you have most candidates in Perge Tergu (14), Bondu So (11), Bankan Tey (11).
ok so this is different from what I was seeing by going by hand, but I think exclusivity versus general patterns are making the distictions (?)
We stick to exclusive patterns now. We only look at, say, words in Dogon and Atlantic, not in Bangime and Mnade, etc. Please check the new file relations.md which I just uploaded (average.py is also there now).
great - looking now. when discussing the results, does it make sense to organize the semantic patterns according to those in WOLD?
I think that is up to you. Please check specifically THIS part, and please reload (I had a bug in the script).
If we only concentrate on Bangime (it's easier), we find a very striking pattern here: Jenaama and Bambara are very unique among Mande, in terms of similar vocabulary, right? And Dogon languages have a constant amount of material shared.
I am also writing this up now in the text.
this is so huge!!!!! we'll make an Africanist out of you yet! ;)
Current pairwise borrowing statistics are giving the total counts. This may be seen as okay, if we look at language-internal proportions, but the raw counts of course are not very helpful when comparing inside a language, etc. So our statistics need to be adjusted, ideally simply by counting the average of shared borrowings, e.g., Bangime vs. all Dogon: all individually shared matches and then divide by the number of individual comparisons. The same for Dogon vs. Mande: count each and every language etc. This will drastically lower the number of cases, but give us a clearer picture, and we can still treat the number of shared items as the layer, we can even rank each concept.