Closed LinguList closed 4 years ago
I don't understand what you mean exactly by "manually assign" them and "catch them in the code".
Check lexibank_pharaocoracholaztecan.py. There I wrote code that essentially parses the word document, catches newlines inside the table (converted the table to plain text, but I had to deal with multiple newlines inside the same table), and also identified concepts, etc.
This code allows us to check certain things explicitly (which I call "manually"). This has the advantage of allowing us to do things without touching the original data, as it has been published as is, and it makes more sense to not touch it anymore (only if you write a new paper and do more codings).
@maunus, I have now checked the cognates again. There are some cases not clear to me (I refere to cognates.tsv extracted from your excel sheet).
a
and A
, as I find in row 3?
and -
, both occurring in row 10, for example?A/B
structure, but in one case you have a/(B)
(50), in another case, you have A/(B)
(54, and in one case, you have C D
(is the latter C/D
?ab
in line 68?I have a concrete proposal how to cope with this.
If you check the following examples, there are not many ambiguous cases:
{
"A(B)": ["A"],
"A/(B)": ["A"],
"A/B": ["A", "B"],
"A/B/C": ["A", "B", "C"],
"A/B/D": ["A", "B", "D"],
"A/B?": ["A"],
"A/C": ["A", "C"],
"B/(A)": ["A"],
"B/(a)": ["B"],
"B/C": ["B", "C"],
"C D": ["C", "D"],
"C/(B)": ["C"],
"C/B": ["C", "B"],
"C/E": ["C", "E"],
"D/B": ["D", "B"],
"a/(B)": ["a"],
"a/A": ["a", "A"],
"a/B": ["a", "B"],
"ab": ["ab"],
}
The data is provided in a Python dictionary (or JSON datastructure) here. You can see how I treat the source from the target, so C/B
is two elements, but a/(B)
is one element only, assuming you also could not count that in your nexus file.
If you have two cognates, we provide the word form twice. This is not best practice, but we tolerate it for now, as this is also an example dataset to show you how to do cognate annotation in a more consistent and transparent way with additional tools and long table formats.
If you want to modify parts of the decisions I made here, just point me to them here, or change them directly in the code.
Most of those difference are really just information to myself, about the more detailed structure of the cognates, so A and a are different versions of the same cognate root, whereas B and b would be two versions of another root. I haven't actually used this in the analysis but just treated a/A as the same, but I would like to since it would give amore fine grained structure of shared roots and innovations (but it adds information about phonological changes, and grammatical innovations etc, so I don't know if it really belongs).
Okay. If we now have as an initial goal just to make it possible to derive the nexus or the distances file as it was underlying your paper, we'd then say: lower case upper case is the same, right? For all purposes going beyond this, for additional analyses, I recommend to start from the wordlist file that is submitted in examples, and load it into edictor. It has several advantages, first it shows the long table format we use, which allows too annotae cognates and word in the same table, second, you just git-clone this repository, and then open the file in edictor. You can annotate cognates, etc., and use this for future studies (and I can always help if there are problems).
I think in the first rows I was trying to keep apart ? and - as two different kinds of missing data, one being when there is no data in the sources, and the other being when the extant sources does not allow us to reconstruct a form for PCN. But is seems that in the lower rows I abandoned this distinction (as I probably realized it makes no difference to the analysis). I think we should probably just have "?" for "unknown" across the board.
In 50 Cora and Huichol has a compound root combining A+B, Nahuatl has A, but also root B, though in another meaning, so it shouldn't figure under "navel". Other UA has only root A. So tjhe meaning of the parenthesis is that the root is there, but that it shouldn't count in the analysis (so basically extra information for us, but irrelevant to the computation). C D is supposed to be C/D.
IN line 68 it seems they all ought to be capitals AB and A and B, since there isn't any distinction between a and A, or b and B.
Yes, lower case and upper case should be treated the same in the nexus file, and anything in () should be ignored.
And yes, I want to start learning edictor once we are don with this part. I want to use it for my Nahuatl dialect database.
I think the changes I would make to your proposal is: "ab": ["A", "B"]
Should I be cleaning the cognates.tsv file now? Or will that screw up the stuff you have already been extracting from it?
If we just delete the stuff in () and change all the lower case into capitals we could dispense with the extra code. The information they represent really is only useful for qualitative purposes.
Rather not clean, we better cover it from the code, since this is "officially published", so we rather post-edit it, not the original source.
All done already. There is no extra code but a mapping, so it is better to leave it as this, and keep the original data intact.
Ok, we keep it as is then. Though, I feel the version here is in a way a more "official publication" than the pdf at my website, and I would like it to be better.
Ok, in the distances.dst file there are more decimals than I operated with - where do they come from?
It is hard to compare with the languages in a different order.
I didn't include the proto-languages in my distance matrix, and for the distance number I simply counted the number of cognates out of 100, so I got 0.65 for Cora/Huichol.
Here is the matrix I used:
And here is the one at distances.dst compared with the one I used in Splitstree
I can't really figure out how to compare the two tables. The numbers are inverted right, so that Cora/Huichol gives a distance of 0.3579, but 65/100 shared forms. In the distance matrix I used when I put it into splitstree I put 0.35, there (just taking the inverse of 65/100).
Cognate counting is a tricky business.
There are several ways to count, and often, it is not clear which version one uses.
E.g., you have missing data: how do you count?
how do you count shared cognates?
Our standard calculation in lingpy only compares existing items in both languages. Furthermore, in case of multiple matches, it averages, so you have A/B, it'll give 0.5 to shared A and 0.5 to shared B, etc.
Excluding languages is trivial, just have to adjust the script.
Ok, so that does change the outcome a bit, and accounts for the decimal differences. Now I want to see what the network looks like with those figures.
Here's the count of shared cognates (ignoring meanings):
Language 1 | language 2 | count |
---|---|---|
Cahita | Cora | 33 |
Cahita | Huichol | 36 |
Cahita | Tarahumaran | 66 |
Cahita | Tepiman | 57 |
Cora | Huichol | 63 |
Cora | Tarahumaran | 26 |
Cora | Tepiman | 34 |
Huichol | Tarahumaran | 32 |
Huichol | Tepiman | 35 |
Tarahumaran | Tepiman | 51 |
So there are differences, but hard to tell, why.
Oh, I didn't I didn't exclude proto-Nahua by the way. That is important.
2 cognates lower for Cora/Huichol
Some of the differences are really large.
Wait, I found the bug. We forgot to account for upper-casing the "a" etc.
LA | LB | COUNT |
---|---|---|
Cahita | Cora | 45 |
Cahita | Huichol | 51 |
Cahita | Tarahumaran | 68 |
Cahita | Tepiman | 60 |
Cora | Huichol | 67 |
Cora | Tarahumaran | 40 |
Cora | Tepiman | 43 |
Huichol | Tarahumaran | 46 |
Huichol | Tepiman | 44 |
Tarahumaran | Tepiman | 55 |
Excellent. Can you include proto-Nahuan in the list of shared cognates?
A | B | C |
---|---|---|
Cahita | Cora | 45 |
Cahita | Huichol | 51 |
Cahita | Tarahumaran | 68 |
Cahita | Tepiman | 60 |
Cahita | ProtoNahua | 54 |
Cora | Huichol | 67 |
Cora | Tarahumaran | 40 |
Cora | Tepiman | 43 |
Cora | ProtoNahua | 58 |
Huichol | Tarahumaran | 46 |
Huichol | Tepiman | 44 |
Huichol | ProtoNahua | 57 |
Tarahumaran | Tepiman | 55 |
Tarahumaran | ProtoNahua | 44 |
Tepiman | ProtoNahua | 49 |
BTW: the numbers differ still, since you counted only shared cognates PER cognate set, so AB in one and AB in another would only count one time. This is a bit inconsistent, since you counted AB vs. A also as one match, so the count here (also easier to code on the fly) just counts all shared cognate sets, and I checked with cora vs. huichol, where you find two ABs, so this makes up for 65+2 = 67.
We have a NOTE.md field on github. There, one can add custom comments. So you could do so, and explain a bit more, if you want. E.g., your matrix would be useful there. And we can also add your nexus file here directly.
Great, thanks! There are some odd shifts for example now Cora/Nahuatl has 58 where I originally counted 53, and Nahuatl/Huichol has 57 where I counted 56.
It seems most numbers are higher. Does it count A/B A/B as a single match or as a double match?
I dounf the suggestion in this article to make a lot of sense: It suggests counting percentages of shared vocabulary not out of the 100 but only out of the potential cognates. So when there is missing data the number of potential cognates fall, and when there are double cognates it rises above 100. Is this something we could/should do?
Haugen, Jason D., Michael Everdell, and Benjamin A. Kuperman. "Uto-Aztecan Lexicostatistics 2.0." International Journal of American Linguistics 86, no. 1 (2020): 1-30.
See my note above on AB counting.
Yes, I read it after typing.
well, you know, with cognate counting, I would say: there are so many ways, it won't make much difference. Teh most important thing is: make it standardized, make it transparent how you count, or use a code that always does the same.
I think the point in that article is that since some of the UA languages have very little documentation, the missing data can skew the numbers quite a bit.
The debate is very long, more advanced is the technique by Starostin (whom not many read), and they have a standardized procedure.
He's the first who also said that borrowings should count as missing data.
Ah, that is interesting. This hasn't come up in this word list, but that is how I would do it if I identify a borrowing from Nahuatl in to Cahitan for example.
And one should try to avoid missing data. In this case, it is better to not use a language, if one has low mutual coverage. We discussed this in our Sino-Tibetan study.
Reference here. There's a PDF online (easy to find, otherwise send an email and I share it).
But sometimes they are the languages one is interested in... But I did exclude Tubar and Opata from this lost for that reason.
So how we count in lingpy is:
But I think one can prove that it doesn't make that big of a difference.
It is more important to keep one's data in such a clean state that one doesn't need to do lexicostatistics with UGPMA, but that one can do more complex phylogenetic studies. Neighbornets are nice for comparison, but even here, the preferred way is to go for a binarized representation for presence of absence of cognate sets.
That makes a lot of sense.
Depending on how many cases there are, it may even be possible to manually assign them. But in principle, this dataset has partial cognates, as indicated by
A/B
in the cognates.tsv file, while corresponding cognates in the words themselves are not marked. If there are just a few cases, one could catch them in the code.