Check ambiguous cognate sets

LinguList commented 4 years ago

Depending on how many cases there are, it may even be possible to manually assign them. But in principle, this dataset has partial cognates, as indicated by A/B in the cognates.tsv file, while corresponding cognates in the words themselves are not marked. If there are just a few cases, one could catch them in the code.

Maunus commented 4 years ago

I don't understand what you mean exactly by "manually assign" them and "catch them in the code".

LinguList commented 4 years ago

Check lexibank_pharaocoracholaztecan.py. There I wrote code that essentially parses the word document, catches newlines inside the table (converted the table to plain text, but I had to deal with multiple newlines inside the same table), and also identified concepts, etc.

This code allows us to check certain things explicitly (which I call "manually"). This has the advantage of allowing us to do things without touching the original data, as it has been published as is, and it makes more sense to not touch it anymore (only if you write a new paper and do more codings).

LinguList commented 4 years ago

@maunus, I have now checked the cognates again. There are some cases not clear to me (I refere to cognates.tsv extracted from your excel sheet).

do you make a distinction between a and A, as I find in row 3
what is the difference between ? and -, both occurring in row 10, for example?
I understand the A/B structure, but in one case you have a/(B) (50), in another case, you have A/(B) (54, and in one case, you have C D (is the latter C/D?
what about the cases of ab in line 68?

LinguList commented 4 years ago

I have a concrete proposal how to cope with this.

If you check the following examples, there are not many ambiguous cases:

{
  "A(B)": ["A"],
  "A/(B)": ["A"],
  "A/B": ["A", "B"],
  "A/B/C": ["A", "B", "C"],
  "A/B/D": ["A", "B", "D"],
  "A/B?": ["A"],
  "A/C": ["A", "C"],
  "B/(A)": ["A"],
  "B/(a)": ["B"],
  "B/C": ["B", "C"],                    
  "C D": ["C", "D"],
  "C/(B)": ["C"],
  "C/B": ["C", "B"],
  "C/E": ["C", "E"],
  "D/B": ["D", "B"],
  "a/(B)": ["a"],
  "a/A": ["a", "A"],
  "a/B": ["a", "B"],
  "ab": ["ab"],
}

The data is provided in a Python dictionary (or JSON datastructure) here. You can see how I treat the source from the target, so C/B is two elements, but a/(B) is one element only, assuming you also could not count that in your nexus file.

If you have two cognates, we provide the word form twice. This is not best practice, but we tolerate it for now, as this is also an example dataset to show you how to do cognate annotation in a more consistent and transparent way with additional tools and long table formats.

If you want to modify parts of the decisions I made here, just point me to them here, or change them directly in the code.

Maunus commented 4 years ago

Most of those difference are really just information to myself, about the more detailed structure of the cognates, so A and a are different versions of the same cognate root, whereas B and b would be two versions of another root. I haven't actually used this in the analysis but just treated a/A as the same, but I would like to since it would give amore fine grained structure of shared roots and innovations (but it adds information about phonological changes, and grammatical innovations etc, so I don't know if it really belongs).

LinguList commented 4 years ago

Okay. If we now have as an initial goal just to make it possible to derive the nexus or the distances file as it was underlying your paper, we'd then say: lower case upper case is the same, right? For all purposes going beyond this, for additional analyses, I recommend to start from the wordlist file that is submitted in examples, and load it into edictor. It has several advantages, first it shows the long table format we use, which allows too annotae cognates and word in the same table, second, you just git-clone this repository, and then open the file in edictor. You can annotate cognates, etc., and use this for future studies (and I can always help if there are problems).

Maunus commented 4 years ago

The distinction between a and A is that they are form variants of the same cognate root, so the varieties that have a have a shared innovation to the root.

I think in the first rows I was trying to keep apart ? and - as two different kinds of missing data, one being when there is no data in the sources, and the other being when the extant sources does not allow us to reconstruct a form for PCN. But is seems that in the lower rows I abandoned this distinction (as I probably realized it makes no difference to the analysis). I think we should probably just have "?" for "unknown" across the board.

In 50 Cora and Huichol has a compound root combining A+B, Nahuatl has A, but also root B, though in another meaning, so it shouldn't figure under "navel". Other UA has only root A. So tjhe meaning of the parenthesis is that the root is there, but that it shouldn't count in the analysis (so basically extra information for us, but irrelevant to the computation). C D is supposed to be C/D.

IN line 68 it seems they all ought to be capitals AB and A and B, since there isn't any distinction between a and A, or b and B.

Maunus commented 4 years ago

Yes, lower case and upper case should be treated the same in the nexus file, and anything in () should be ignored.

Maunus commented 4 years ago

And yes, I want to start learning edictor once we are don with this part. I want to use it for my Nahuatl dialect database.

Maunus commented 4 years ago

I think the changes I would make to your proposal is: "ab": ["A", "B"]

Maunus commented 4 years ago

Should I be cleaning the cognates.tsv file now? Or will that screw up the stuff you have already been extracting from it?

Maunus commented 4 years ago

If we just delete the stuff in () and change all the lower case into capitals we could dispense with the extra code. The information they represent really is only useful for qualitative purposes.

LinguList commented 4 years ago

Rather not clean, we better cover it from the code, since this is "officially published", so we rather post-edit it, not the original source.

LinguList commented 4 years ago

All done already. There is no extra code but a mapping, so it is better to leave it as this, and keep the original data intact.

Maunus commented 4 years ago

Ok, we keep it as is then. Though, I feel the version here is in a way a more "official publication" than the pdf at my website, and I would like it to be better.

Maunus commented 4 years ago

Ok, in the distances.dst file there are more decimals than I operated with - where do they come from?

It is hard to compare with the languages in a different order.

I didn't include the proto-languages in my distance matrix, and for the distance number I simply counted the number of cognates out of 100, so I got 0.65 for Cora/Huichol.

Here is the matrix I used: lexical distance matrix

Maunus commented 4 years ago

And here is the one at distances.dst compared with the one I used in Splitstree

I can't really figure out how to compare the two tables. The numbers are inverted right, so that Cora/Huichol gives a distance of 0.3579, but 65/100 shared forms. In the distance matrix I used when I put it into splitstree I put 0.35, there (just taking the inverse of 65/100).
nexus distances

github matrix

LinguList commented 4 years ago

Cognate counting is a tricky business.

There are several ways to count, and often, it is not clear which version one uses.

E.g., you have missing data: how do you count?

how do you count shared cognates?

Our standard calculation in lingpy only compares existing items in both languages. Furthermore, in case of multiple matches, it averages, so you have A/B, it'll give 0.5 to shared A and 0.5 to shared B, etc.

LinguList commented 4 years ago

Excluding languages is trivial, just have to adjust the script.

Maunus commented 4 years ago

Ok, so that does change the outcome a bit, and accounts for the decimal differences. Now I want to see what the network looks like with those figures.

LinguList commented 4 years ago

Here's the count of shared cognates (ignoring meanings):

Language 1	language 2	count
Cahita	Cora	33
Cahita	Huichol	36
Cahita	Tarahumaran	66
Cahita	Tepiman	57
Cora	Huichol	63
Cora	Tarahumaran	26
Cora	Tepiman	34
Huichol	Tarahumaran	32
Huichol	Tepiman	35
Tarahumaran	Tepiman	51

LinguList commented 4 years ago

So there are differences, but hard to tell, why.

Maunus commented 4 years ago

Oh, I didn't I didn't exclude proto-Nahua by the way. That is important.

2 cognates lower for Cora/Huichol

Maunus commented 4 years ago

Some of the differences are really large.

LinguList commented 4 years ago

Wait, I found the bug. We forgot to account for upper-casing the "a" etc.

LinguList commented 4 years ago

LA	LB	COUNT
Cahita	Cora	45
Cahita	Huichol	51
Cahita	Tarahumaran	68
Cahita	Tepiman	60
Cora	Huichol	67
Cora	Tarahumaran	40
Cora	Tepiman	43
Huichol	Tarahumaran	46
Huichol	Tepiman	44
Tarahumaran	Tepiman	55

Maunus commented 4 years ago

Excellent. Can you include proto-Nahuan in the list of shared cognates?

LinguList commented 4 years ago

A	B	C
Cahita	Cora	45
Cahita	Huichol	51
Cahita	Tarahumaran	68
Cahita	Tepiman	60
Cahita	ProtoNahua	54
Cora	Huichol	67
Cora	Tarahumaran	40
Cora	Tepiman	43
Cora	ProtoNahua	58
Huichol	Tarahumaran	46
Huichol	Tepiman	44
Huichol	ProtoNahua	57
Tarahumaran	Tepiman	55
Tarahumaran	ProtoNahua	44
Tepiman	ProtoNahua	49

LinguList commented 4 years ago

BTW: the numbers differ still, since you counted only shared cognates PER cognate set, so AB in one and AB in another would only count one time. This is a bit inconsistent, since you counted AB vs. A also as one match, so the count here (also easier to code on the fly) just counts all shared cognate sets, and I checked with cora vs. huichol, where you find two ABs, so this makes up for 65+2 = 67.

LinguList commented 4 years ago

We have a NOTE.md field on github. There, one can add custom comments. So you could do so, and explain a bit more, if you want. E.g., your matrix would be useful there. And we can also add your nexus file here directly.

Maunus commented 4 years ago

Great, thanks! There are some odd shifts for example now Cora/Nahuatl has 58 where I originally counted 53, and Nahuatl/Huichol has 57 where I counted 56.

It seems most numbers are higher. Does it count A/B A/B as a single match or as a double match?

Maunus commented 4 years ago

I dounf the suggestion in this article to make a lot of sense: It suggests counting percentages of shared vocabulary not out of the 100 but only out of the potential cognates. So when there is missing data the number of potential cognates fall, and when there are double cognates it rises above 100. Is this something we could/should do?

Haugen, Jason D., Michael Everdell, and Benjamin A. Kuperman. "Uto-Aztecan Lexicostatistics 2.0." International Journal of American Linguistics 86, no. 1 (2020): 1-30.

LinguList commented 4 years ago

See my note above on AB counting.

Maunus commented 4 years ago

Yes, I read it after typing.

LinguList commented 4 years ago

well, you know, with cognate counting, I would say: there are so many ways, it won't make much difference. Teh most important thing is: make it standardized, make it transparent how you count, or use a code that always does the same.

Maunus commented 4 years ago

I think the point in that article is that since some of the UA languages have very little documentation, the missing data can skew the numbers quite a bit.

LinguList commented 4 years ago

The debate is very long, more advanced is the technique by Starostin (whom not many read), and they have a standardized procedure.

LinguList commented 4 years ago

He's the first who also said that borrowings should count as missing data.

Maunus commented 4 years ago

Ah, that is interesting. This hasn't come up in this word list, but that is how I would do it if I identify a borrowing from Nahuatl in to Cahitan for example.

LinguList commented 4 years ago

And one should try to avoid missing data. In this case, it is better to not use a language, if one has low mutual coverage. We discussed this in our Sino-Tibetan study.

LinguList commented 4 years ago

Reference here. There's a PDF online (easy to find, otherwise send an email and I share it).

Maunus commented 4 years ago

But sometimes they are the languages one is interested in... But I did exclude Tubar and Opata from this lost for that reason.

LinguList commented 4 years ago

So how we count in lingpy is:

determine slots where both have a word
take this sublist as 100%
count how many cognate sets are shared, if you have synonyms, take proportions (!)
divide this number by the length of the sublist

LinguList commented 4 years ago

But I think one can prove that it doesn't make that big of a difference.

LinguList commented 4 years ago

It is more important to keep one's data in such a clean state that one doesn't need to do lexicostatistics with UGPMA, but that one can do more complex phylogenetic studies. Neighbornets are nice for comparison, but even here, the preferred way is to go for a binarized representation for presence of absence of cognate sets.

Maunus commented 4 years ago

That makes a lot of sense.

lexibank / pharaocoracholaztecan

Check ambiguous cognate sets #6