Open LinguList opened 4 years ago
Important, @Juunlee, please can you add a folder "scripts" to this repository and include the file with the code? The output should be placed in a folder "output".
Please let me know by answering to this issue, how well this works, and what you think about the detected cognates (they are partial, deepadung's cognates are full cognates, so we don't have the same clustering).
In concept #2 (abdomen), PangKham /waʔ/ is very likely cognate with the other languages, but it is given its own CogID.
In concept #9 (breathe), NanSang /tohpʰəːm/ fails to be segmented into two CogIDs, while the same sounds in other languages are.
In concept #12 (cloud), even though the IPA /ŋʔuʔ/ is the same for e.g. BanPaw and ChaYeQing, in one language it creates two CogIDs and in the other one.
In concept #13 (cold), ChaYeQing /Kok/ is very likely cognate with the rest, but it's given its own CogID.
Concept #14 (cut) is given four CogIDs, whereas one should definitely suffice.
In concept #15 (die), NamHsan /jam/ is given its own CogID, even though it is identical to other languages.
Cognate_detection.py did a pretty good job of detecting cognates. It could use some fine-tuning to resolve problems such as the ones that we found.
You need to distinguish those cases where the algorithm judges differently and those where the original data is mistaken.
E.g., /tohpʰəːm/ : if this word is not segmented, how should the algorithm assign it two COGIDS? The algorithm needs proper segmentation already before, as there's no magic to split words.
The other cases I need to check them with the data. Can you paste them here? you can upload wordlists in zip-folders to github.
Yesterday Junsung pushed https://github.com/lexibank/deepadungpalaung/blob/master/output/deepadung-wordlist.tsv and we have been looking at it in EDICTOR.
We would be happy to make a subset of it showcasing the issues, but this would probably end up consisting of about 50% of the concepts, and in each case I think we would do well to include all the languages.
E.g., /tohpʰəːm/ : if this word is not segmented, how should the algorithm assign it two COGIDS?
Junsung and I can go through the entries and add a space in the middle of words wherever it would be helpful. But should we do that in 100 item phylo sheet 1 in raw? And would it be in the spirit of automated phylogeny?
part.get_partial_scorer(runs=1000, threshold=0.85) # make tests with 100 and 1000, when debugging)
part.partial_cluster(method='lexstat', threshold=0.85, ref='cogids', cluster_method='infomap')
I found that increasing the threshold value of these two functions made the program more likely to consider words to be cognate with each other. This addresses the problems we had in concept numbers #2,13,14,15.
The first threshold in get_scorer is different from the second one.
Raising the threshold at 0.85 for the second one is not good, as we have made good experience in many different datasets with 0.55. I suggest to set the first threshold (which looks for pairs good for sound correspondences) to 1.0, and keep the other at 0.55.
0.85 is difficult to defend. Please keep in mind that we also have only 100 words, usually, we'd need 300. So do not expect perfection from an algorithm!
Alternatively, we can compare partial vs. full gold standard cognates as annotated by the author. If you want to do so, let me know, and I pass some code.
Met with Junsung online tonight and we ran the code with the thresholds you recommended.
Unfortunately, it went back to not recognizing many cognates, whereas with .85/.85, it did very well.
Also interested in how to use the author's original cognate judgments, as you mentioned. Junsung and I looked through the original paper in Mon-Khmer Studies Volume 41 the other day, and I'm guessing that the Cognate Score numbers 1, 2, ... in each row indicate the cognate set for the given concept.
Trust me: if the threshold is 0.85 to get a result, the result is betraying and something is wrong. Of course, with T=1.0 all words are cognate, etc. So we do a testing then, that is a good idea, and I'll provide some more information later (write me an email, if you don't hear from me soon, to remind me).
Given the code
part.get_partial_scorer(runs=100, threshold=i)
part.partial_cluster(method='lexstat', threshold=j, ref='cogids', cluster_method='infomap')
Junsung and I found that the following is the number of cogids assigned to "cut," concept #74 (#14 alphabetically), for choices of i by row from .5 to 1 and j by entry from .5 to .85 inclusive, by .05.
[5, 4, 4, 4, 4, 2, 1, 1],
[6, 5, 4, 3, 4, 4, 1, 1],
[5, 4, 5, 4, 2, 3, 1, 1],
[6, 3, 4, 3, 3, 3, 1, 1],
[3, 4, 4, 4, 2, 2, 2, 1],
[3, 4, 2, 3, 3, 1, 2, 2],
[4, 4, 3, 3, 4, 3, 1, 1],
[5, 4, 3, 3, 2, 2, 1, 2],
[4, 3, 3, 3, 3, 3, 1, 1],
[4, 5, 4, 3, 3, 3, 1, 1],
[5, 3, 3, 4, 2, 2, 1, 1]]
Since "cut" is attested as follows,
ɡәp ɡәp ɡәp kap kap kiap kap kəp kəp kăp kə̆p kakə̆p kep kjap kiap kiap
I think at most two cogids is defensible. So .7 seems to be the lowest we can set j with the code the way it is now, and increasing i seems to have little effect on the average number of cognates found per concept.
Can we stop the discussion about thresholds in lexstat? I have already tried to explain: with 100 concepts, it is likely that lexstat does not receive high-enough cognate sets. We need to evaluate this against a gold standard, not against one concept, where you seem to be unhappy about the cognacy. Trust me please with my experience here: either for all concept, or leave it, since you cannot optimize the code for one concept alone.
Try instead the following:
part = Partial(...)
part.partial_cluster(method='sca', threshold=0.45)
This is a method that is working on surface similarities, rather than sound correspondences. The crucial difference is that we don't need the get_partial_scorer
function.
With this method, you can compare the results. But don't use the method="lexstat"
on 100 word lists. If the results look bad, one first has to test for the other method.
Furthermore, let us already make a real test and reconstruct "normal cognates" and evaluate them against the gold standard:
part = Partial(...)
part.cluster(method='sca', ref="scacogid", threshold=0.45)
part.get_scorer(runs=10000)
part.cluster(method='lexstat', ref="lexstatcogid", threshold=0.55)
from lingpy.evaluate.acd import bcubes
bcubes(part, "cogid", "scacogid")
bcubes(part, "cogid", "lexstatcogid")
These will evaluate against the gold standard. So you have an impression how the methods work. You can also loop over all thresholds here, to see what happens, if you are not happy with the results, but do it in a principled way, testing against the gold standard for all words, instead of doing this for one concept where you have a bad feeling.
Can we stop the discussion about thresholds in lexstat?
Gladly! I thought you wanted us to test them because you said:
Trust me: if the threshold is 0.85 to get a result, the result is betraying and something is wrong. Of course, with T=1.0 all words are cognate, etc. So we do a testing then
and I didn't know there was an alternative. Junsung and I will try to apply this alternate method in our next session.
Usually, when one method doesn't work, it's hard for us to discover the alternate methods, because the class inheritance structure of the algorithms tends to go back a few steps through different packages, and the documentation has very few examples.
Usually, when one method doesn't work, it's hard for us to discover the alternate methods, because the class inheritance structure of the algorithms tends to go back a few steps through different packages, and the documentation has very few examples.
Well, we have several tutorials now where methods are discussed, the SCA method is also described in my dissertation, and our PLOS paper from 2017 (along with three other methods), and the evaluation with Bcubes as well. This is exactly what I wanted to show you so you know how to evaluate, which is also why I said: remind me if you want to know how to do this, but since there was no reminder I did not follow up.
Anyway, now you have the method (also all described in lingpy.org), and can test it with the original data as gold standard.
When we run this code, we get the following error:
Traceback (most recent call last):
File "cognate_detection_new.py", line 36, in
I tried learning how to fix it by reading the corresponding files and their documentation, but couldn't figure it out.
In [6]: columns=('concept_name', 'language_id',
...: 'value', 'form', 'segments', 'language_glottocode', 'cogid_cognateset_id'
...: )
...: namespace=(('concept_name', 'concept'), ('language_id',
...: 'doculect'), ('segments', 'tokens'), ('language_glottocode',
...: 'glottolog'), ('concept_concepticon_id', 'concepticon'),
...: ('language_latitude', 'latitude'), ('language_longitude',
...: 'longitude'), ('cognacy', 'cognacy'),
...: ('cogid_cognateset_id', 'cog'))
In [7]: part = Partial.from_cldf(Dataset().cldf_dir.joinpath('cldf-metadata.json'), columns=columns, namespace=namespace)
In [8]: part.renumber('cog')
In [10]: from lingpy.evaluate.acd import bcubes
In [12]: part.partial_cluster(method='sca', threshold=0.45, ref='scaids')
In [13]: part.add_cognate_ids('scaids', 'scaid', idtype='strict')
In [15]: bcubes(part, 'cogid', 'scaid')
*************************
* B-Cubed-Scores *
* --------------------- *
* Precision: 0.8623 *
* Recall: 0.8409 *
* F-Scores: 0.8514 *
*************************'
In [16]: part.add_cognate_ids('scaids', 'scalooseid', idtype='loose')
In [17]: bcubes(part, 'cogid', 'scalooseid')
*************************
* B-Cubed-Scores *
* --------------------- *
* Precision: 0.7687 *
* Recall: 0.9884 *
* F-Scores: 0.8648 *
*************************'
Out[17]: (0.7687142160328286, 0.9883752276361237, 0.8648143564872715)
In [25]: part = Partial.from_cldf(Dataset().cldf_dir.joinpath('cldf-metadata.json'), columns=columns, namespace=namespace)
In [26]: part.renumber("cog")
In [27]: for i in range(20):
...: t = 0.05 * i
...: ts = 't_'+str(i)
...: part.partial_cluster(method='sca', threshold=t, ref=ts)
...: part.add_cognate_ids(ts, ts+'id', idtype='strict')
...: p, r, f = bcubes(part, 'cogid', ts+'id', pprint=False)
...: print('{0:.2f} {1:.4} {2:.4f} {3:.2f}'.format(t, p, r, f))
...:
0.00 0.9989 0.5525 0.71
0.05 0.9813 0.6577 0.79
0.10 0.9813 0.6937 0.81
0.15 0.9804 0.6972 0.81
0.20 0.9767 0.7091 0.82
0.25 0.9324 0.7433 0.83
0.30 0.9027 0.7630 0.83
0.35 0.9009 0.7725 0.83
0.40 0.8849 0.8323 0.86
0.45 0.8623 0.8409 0.85
0.50 0.8607 0.8435 0.85
0.55 0.8512 0.8528 0.85
0.60 0.8314 0.8530 0.84
0.65 0.8271 0.8539 0.84
0.70 0.816 0.8743 0.84
0.75 0.8068 0.8780 0.84
0.80 0.7937 0.8812 0.84
0.85 0.7874 0.8767 0.83
0.90 0.7866 0.8763 0.83
0.95 0.7866 0.8763 0.83
In [30]: for i in range(20):
...: t = 0.05 * i
...: ts = 't_'+str(i)
...: part.partial_cluster(method='sca', threshold=t, ref=ts)
...: part.add_cognate_ids(ts, ts+'id', idtype='loose')
...: p, r, f = bcubes(part, 'cogid', ts+'id', pprint=False)
...: print('{0:.2f} {1:.4} {2:.4f} {3:.2f}'.format(t, p, r, f))
...:
0.00 0.9336 0.7257 0.82
0.05 0.9079 0.8331 0.87
0.10 0.8981 0.8543 0.88
0.15 0.8972 0.8570 0.88
0.20 0.8934 0.8692 0.88
0.25 0.8488 0.8842 0.87
0.30 0.8286 0.9099 0.87
0.35 0.8263 0.9209 0.87
0.40 0.8037 0.9844 0.88
0.45 0.7687 0.9884 0.86
0.50 0.761 0.9884 0.86
0.55 0.7517 0.9894 0.85
0.60 0.7371 0.9894 0.84
0.65 0.732 0.9906 0.84
0.70 0.724 1.0000 0.84
0.75 0.7197 1.0000 0.84
0.80 0.7185 1.0000 0.84
0.85 0.716 1.0000 0.83
0.90 0.716 1.0000 0.83
0.95 0.716 1.0000 0.83
In [31]: part = Partial.from_cldf(Dataset().cldf_dir.joinpath('cldf-metadata.json'), columns=columns, namespace=namespace)
In [32]: part.renumber("cog")
In [33]: part.get_partial_scorer(runs=10000)
In [34]: for i in range(20):
...: t = 0.05 * i
...: ts = 't_'+str(i)
...: part.partial_cluster(method='lexstat', threshold=t, ref=ts)
...: part.add_cognate_ids(ts, ts+'id', idtype='strict')
...: p, r, f = bcubes(part, 'cogid', ts+'id', pprint=False)
...: print('{0:.2f} {1:.4} {2:.4f} {3:.2f}'.format(t, p, r, f))
...:
0.00 1.0 0.1669 0.29
0.05 0.999 0.3475 0.52
0.10 0.9989 0.5350 0.70
0.15 0.9961 0.5875 0.74
0.20 0.9859 0.6560 0.79
0.25 0.9653 0.6989 0.81
0.30 0.9549 0.7519 0.84
0.35 0.9371 0.7780 0.85
0.40 0.9267 0.7915 0.85
0.45 0.9004 0.8060 0.85
0.50 0.885 0.8143 0.85
0.55 0.8813 0.8299 0.85
0.60 0.8717 0.8384 0.85
0.65 0.8614 0.8487 0.86
0.70 0.8529 0.8519 0.85
0.75 0.8462 0.8540 0.85
0.80 0.843 0.8589 0.85
0.85 0.8333 0.8651 0.85
0.90 0.826 0.8678 0.85
0.95 0.8226 0.8660 0.84
In [36]: for i in range(20):
...: t = 0.05 * i
...: ts = 't2_'+str(i)
...: part.partial_cluster(method='lexstat', threshold=t, ref=ts)
...: part.add_cognate_ids(ts, ts+'id', idtype='loose')
...: p, r, f = bcubes(part, 'cogid', ts+'id', pprint=False)
...: print('{0:.2f} {1:.4} {2:.4f} {3:.2f}'.format(t, p, r, f))
...:
0.00 0.9945 0.1866 0.31
0.05 0.9732 0.4466 0.61
0.10 0.9515 0.6896 0.80
0.15 0.9291 0.7521 0.83
0.20 0.9203 0.8146 0.86
0.25 0.8848 0.8618 0.87
0.30 0.8697 0.9172 0.89
0.35 0.855 0.9339 0.89
0.40 0.841 0.9521 0.89
0.45 0.8172 0.9588 0.88
0.50 0.802 0.9646 0.88
0.55 0.7917 0.9777 0.87
0.60 0.7858 0.9818 0.87
0.65 0.772 0.9883 0.87
0.70 0.7629 0.9925 0.86
0.75 0.75 0.9937 0.85
0.80 0.7476 0.9937 0.85
0.85 0.7386 0.9965 0.85
0.90 0.7346 1.0000 0.85
0.95 0.729 1.0000 0.84
Dear @danbriggs dear @Juunlee, please check the results. As I am sure you can nicely see here, the threshold for cognates clusters around 0.4-0.6 for lexstat, depending on how you convert the partial to non-partial cognates and compare them with the original data.
This proves my point that it is best to stick with 0.55 for lexstat (which is best explored for many datasets) and SCA=0.45.
Thank you for the detailed code and analysis. Junsung and I have copied and run a large portion of this code, and it works on our systems as well.
Reviewing the 100 concepts by hand, it seems like both the old method and the new method do a good job on almost all the concepts, with the new method doing slightly better. The old method was likely thrown off only by words with reduplication in them, such as BanPaw kakə̆p "cut" and NamhSan r̥әŋr̥әŋ "long." Both methods seem to be determined to give distinct ids to the different halves of the word, but the old method may have generated several more cogids for identical forms in other doculects because of this.
I am not sure how to pipe through the output, which can be found at
https://github.com/lexibank/deepadungpalaung/blob/master/output/deepadung-wordlist-new.tsv
through to a phylogeny tree; the code we're trying to use can be found at
https://github.com/lexibank/deepadungpalaung/blob/master/scripts/sca_crosssemantic.py
but I have also copied it here:
alms = Alignments(fname+'deepadung-wordlist-new.tsv', ref='cogids')
print('[i] search for bad internal alignments')
find_bad_internal_alignments(alms)
print('[i] search for colexified alignments')
find_colexified_alignments(
alms,
cognates='cogids',
segments='tokens',
ref='crossids'
)
The error we get is as follows:
Traceback (most recent call last):
File "C:\Users\User\palaung4\lib\site-packages\lingpy\align\sca.py", line 663, in add_alignments
modify_ref=modify_ref)
File "C:\Users\User\palaung4\lib\site-packages\lingpy\basic\wordlist.py", line 468, in get_etymdict
cogIdx = self._header[ref]
KeyError: 'cogids'
If I change the first ref='cogids' to ref='scaids' I get the following error:
Traceback (most recent call last):
File "C:\Users\User\palaung4\lib\site-packages\lingpy\align\sca.py", line 663, in add_alignments
modify_ref=modify_ref)
File "C:\Users\User\palaung4\lib\site-packages\lingpy\basic\wordlist.py", line 480, in get_etymdict
cogid = f(cog)
ValueError: invalid literal for int() with base 10: '36 37'
If I change the first ref='cogids' to ref='scaid', that line goes through, but
find_bad_internal_alignments(alms)
doesn't, even if I add ref='scaid' or ref='scaids' to it.
The last two scripts we'll be trying to adapt to the results from the SCA method can be found at
https://github.com/lexibank/deepadungpalaung/blob/master/scripts/cd_correspondence.py
and
https://github.com/lexibank/deepadungpalaung/blob/master/scripts/cd_phylogeny.py
and we will call them sca_correspondence and sca_phylogeny when we do.
I think there is a misunderstanding: if you want to do phylogenies, ignore the alignments and the correspondence patterns, which is essentially what you try to do here. Just type:
part.calculate('tree', ref='cogid') # or scaid or what you like
print(part.tree.asciiArt())
Ah! I see. Since the files were named 1, 2, 3, 4, 5 and 6 in the tutorial accompanying Wu et al., and each of 2 through 5 was dependent on the previous one, I thought that 6_phylogeny was dependent on 5_correspondence or at least 4_crosssemantic, especially since it used crossids.tsv and mentioned the 'crossid' column a couple times.
But we are really looking at two distinct results both based on the results of 2_partial.py, one being the phylogenetic tree, and the other being the sound correspondences.
So far, Junsung and I have found 5 versions of the phylogenetic tree:
(1) using the method outlined in the tutorial, his computer & mine got a slightly different result (2) using the snippet above with 'cogids', we got two more results (3) using 'scaids', I got one more result
but the last three trees are very similar: the dialect clusters mentioned in Deepadung et al. show up as constitutents; Ta-ang is the furthest from the other dialects, as mentioned in Deepadung et al.; ChaYeQing moves around the most, and its position has a question mark in Deepadung et al.
We haven't analyzed the sound correspondences generated by 5_correspondence yet; we will do that soon.
I am guessing that applying an 'imnc' template to the data would require a fairly involved reworking of code, because of the lack of the tones? Would it make sense for Junsung & me to just add 'tone 1 1' to all the morphemes in the data set, and apply the method outlined in the tutorial accompanying Wu et al.?
Please forget about the sound correspondence patterns for now. They are also not important to get the "crossids". They are an independent analysis that makes sense with more than 300 concepts, but we have 100 here only. The best thing is to turn the partial cognates into "normal cognates", as I have already shown you how to do and then calculate a tree with some algorithm.
So to just make this clear again: correspondence patterns are not needed for phylogenies. So you do not need them. We also make this clear in our workflow paper. They are needed for linguists who are interested in sound change.
This also means, you have your workflow now. There is no need to do anything else. Alternatively, you can also check our alternative workflow for Polynesian, where we export data to nexus and then can analyze with other software (see here),
If you want to inspect correspondence patterns, please open a new issue, and we handle from there.
To test the orthography profiles, we need to make sure that there are no empty morphemes in the data, which may result, if a segmented word starts with a
+
or ends with it or has two+
one after the other.Either, this can be done directly, by loading the data into lingpy:
Ideally, this should not yield anything.
Another test is just to run a cognate detection analysis.
This can be done as follows (I did not test, so @Juunlee, please test and fix errors, if there are):
The resulting file
deepadung-wordlist.tsv
can be inspected in EDICTOR. Please let me know by answering to this issue, how well this works, and what you think about the detected cognates (they are partial, deepadung's cognates are full cognates, so we don't have the same clustering). The column is "COGIDS" (automated) vs. "COGID", by the original author).You need to make sure to install
python-igraph
for the clustering.