Closed LinguList closed 7 months ago
I have changed my scripts in such a way that correspondence patterns are not repeated according to frequency, and only correspondence patterns with at least 3 occurrences are used. The results for Bayesian phylogenetic inference are not
Row │ method gqd
─────┼───────────────────────────
1 │ correspondences 0.386842
2 │ combined 0.251401
3 │ cognates 0.252252
So - correspondence patterns are still no good, but the combined method is now about as good as cognate classes alone.
Interesting. So they do less harm now. From the theory -- notwithstanding some artifacts of the methods -- this is much safer now.
I also repeated the ML experiments with the new datasets (without frequency information). The results look rather interesting:
without frequency information:
method gqd(median)
correspondences 0.37296999999999997
combined 0.24331799999999998
cognate 0.283831
So even though the correspondences lead to trees with the highest distances to the reference, the combined data performs best. With the old MSA (with frequency information) it looked like that:
with frequency information
method gqd(median)
correspondences 0.3591745
combined 0.35156
cognate 0.283831
That is actually a nice result. So we can conclude (weakly):
The message is nice for linguists, as they usually are sad if you tell them they should ignore sound change / sound correspondences. So we have this reconciliating tone in the story.
@gerhardJaeger, what do you think, can we start writing this up from here? I'd then make sure to make a first pass on the text and set everything up.
I'll put in the new numbers, and then the ball goes to your court.
THANKS! Then I will advance and then I'll ask @luisevonderwiese later to fill in her methods / numbers?
Yes sure I can do that! Just let me know when you are ready
I have some other interesting remarks: I my former experiments, I used MSAs which also do not contain the frequency information, but which contain all patterns, not only those with a frequency > 2.. This lead to the following results:
method gqd(median)
correspondences 0.3591745
combined 0.338781
cognate 0.283831
So this filtering step really makes a difference
Further, I also considered the difficulty predicted with Pythia (https://github.com/tschuelia/PyPythia)
method difficulty(median)
correspondences 0.29000000000000004
combined 0.14
cognate 0.195
Combining both data types leads to a dataset with a clearer signal, also interesting
Very cool, these results. Such a nice story, also considering the low frequency patterns: they add noise, rather than helping to resolve the trees, so we profit from doing the filtering, and in combination with cognates, this can enhance trees (slightly) in general.
I have replaced the numbers and reformulated the text in the latex file (both on github and in overleaf). Mattis, the paper is all yours. :-)
With respect to the filtering (i.e., selecting only sound sites with a freq > 2) it would be good if we had an actual criterion for sound site selection, as we computer scientists hate ad hoc thresholds and cutoffs. One simple criterion would be some sort of search algo that tries to select sound sites such that the difficulty is minimzed perhaps.
I understand your feeling, but it is important to add that we have studied the patterns for a long time now, and while it is not optimal to use such a hard cut-off, linking the criterion to phylogenies or other criteria is much more difficult than it seems. For the purpose of the current study, we have run many experiments with alternating thresholds in a previous paper that I'd cite for now.
But of course: running additional studies -- if this is not too time consuming -- with alternating frequency cut-offs is probably very useful. Additionally, one can check the influence of considering only vowels or only consonants, as they are also marked in the files (column Structure, where we distinguish c(onsonant) and v(owel)).
I understand that you understand the data :-) Luise and I had a meeting with Gerhard today and we agreed that we will do some additional experiments on this, but not necessarily include them in the paper as the timeline is pretty tight already.
I am happy if you do that. I have been trying for a long time to figure out how to handle these things. If you find some independent way to tell us where to best set the T, this is like the holy grail for this question!
I ran my analyses using the Gamma Model for rate heterogeneity. Like this we obtain a shape parameter alpha ranging from 0 to 100. The lower alpha, the higher the rate heterogeneity in the dataset. Alpha is distributed in a rather unusual way being is either very high or very low for most of the datasets:
ds_id cognate_classes correspondences combined
0 walworthpolynesian 1.332551 4.233181 1.624137
1 constenlachibchan 0.592429 99.870825 4.178381
2 crossandean 1.242856 6.333836 1.153854
3 robinsonap 99.868657 15.268781 3.485770
4 zhivlovobugrian 99.850351 4.244159 3.133747
5 hattorijaponic 99.847976 99.897067 99.889967
6 felekesemitic 1.061707 7.430331 2.692842
7 houchinese 2.357385 6.120207 4.195315
8 dravlex 0.701598 4.300915 2.233731
9 leekoreanic 8.316283 8.420485 3.284162
(When I have a look at the alpha values for other cognate datasets from lexibank, the distribution looks similar) So far I was not able to find other properties of the data which would imply alpha being high/low. This is why I would like to ask you, whether you have any ideas, what the datasets with high/low alpha do have in common, as you are more familiar with the data and its background than I am. Thank you!
Just for your reference, this is how alpha is distributed on a huge collection of empirical molecular datasets:
https://github.com/angtft/RAxMLGroveScripts/blob/main/figures/test_ALPHA.png
This is from this paper here:
https://academic.oup.com/bioinformatics/article/38/6/1741/6486526?login=false
So what we observed for the language datasets with this extreme distribution or outliers in terms of alpha values is very weird.
Yes, very weird. I don't see a pattern. Should we compute some basic characteristics of all datasets? I can offer phonetic average distances between words, between cognates, number of languages, size of words, etc., and we see if we find ANY that explains the data? Or is this unscientific, as we explore and will eventually find something?
Yes I would be interested in the characteristics you mention. Can you provide them only for all cognate datasets in Lexibank or only for the ones selected in this repository here? Thanks a lot!
My hypothesis was that this could somehow be associated with the subjective way by which linguists select sites for their datasets, maybe we should also check the authors of the datasets and see if there is some pattern.
The authors are all different people, I think.
Let us start with the params for the data in our repo here, but we can then extend to lexibank later.
Dataset | Concepts | Languages | Diversity | Distances | SoundsTotal | SoundsAverage | WordLength | WordLengthAverage | WordsPerLanguage | |
---|---|---|---|---|---|---|---|---|---|---|
hattorijaponic | 1710 | 197 | 10 | 0.03 | 0.09 | 61 | 34.60 | 4.10 | 4.08 | 171.00 |
houchinese | 1816 | 139 | 15 | 0.05 | 0.17 | 113 | 42.53 | 5.57 | 5.53 | 121.07 |
felekesemitic | 2412 | 150 | 19 | 0.05 | 0.28 | 76 | 44.68 | 4.67 | 4.65 | 126.95 |
constenlachibchan | 1214 | 106 | 24 | 0.10 | 0.42 | 65 | 20.88 | 3.10 | 3.08 | 50.58 |
zhivlovobugrian | 1879 | 110 | 20 | 0.04 | 0.20 | 66 | 32.55 | 3.44 | 3.45 | 93.95 |
dravlex | 1341 | 100 | 20 | 0.06 | 0.24 | 68 | 35.90 | 3.93 | 3.82 | 67.05 |
walworthpolynesian | 6113 | 207 | 31 | 0.05 | 0.23 | 44 | 20.65 | 4.07 | 4.07 | 197.19 |
robinsonap | 1424 | 216 | 13 | 0.03 | 0.12 | 39 | 23.69 | 3.77 | 3.76 | 109.54 |
leekoreanic | 1960 | 205 | 14 | 0.01 | 0.06 | 40 | 36.71 | 3.91 | 3.91 | 140.00 |
crossandean | 2637 | 150 | 19 | 0.03 | 0.15 | 65 | 28.89 | 4.22 | 4.22 | 138.79 |
The Python script info.py computes these values from the trimmed data now. For Lexibank, it would have to be adjusted.
Thanks a lot, I checked it out and calculated the pearson correlation for the alpha values of the different data types to the provided properties: | alpha | property | pearson correlation | p-value |
---|---|---|---|---|
alpha_cognate_classes | Concepts | 0.27 | 0.45 | |
alpha_cognate_classes | Languages | -0.50 | 0.14 | |
alpha_cognate_classes | Diversity | -0.41 | 0.24 | |
alpha_cognate_classes | Distances | -0.42 | 0.23 | |
alpha_cognate_classes | SoundsTotal | -0.28 | 0.43 | |
alpha_cognate_classes | SoundsAverage | -0.14 | 0.71 | |
alpha_cognate_classes | WordLength | -0.31 | 0.38 | |
alpha_cognate_classes | WordLengthAverage | -0.30 | 0.40 | |
alpha_cognate_classes | WordsPerLanguage | 0.06 | 0.87 | |
alpha_correspondences | Concepts | -0.03 | 0.93 | |
alpha_correspondences | Languages | -0.17 | 0.63 | |
alpha_correspondences | Diversity | 0.33 | 0.36 | |
alpha_correspondences | Distances | 0.26 | 0.47 | |
alpha_correspondences | SoundsTotal | -0.05 | 0.89 | |
alpha_correspondences | SoundsAverage | -0.29 | 0.42 | |
alpha_correspondences | WordLength | -0.38 | 0.28 | |
alpha_correspondences | WordLengthAverage | -0.38 | 0.28 | |
alpha_correspondences | WordsPerLanguage | -0.13 | 0.72 | |
alpha_combined | Concepts | 0.30 | 0.40 | |
alpha_combined | Languages | -0.51 | 0.13 | |
alpha_combined | Diversity | -0.28 | 0.43 | |
alpha_combined | Distances | -0.37 | 0.30 | |
alpha_combined | SoundsTotal | -0.03 | 0.93 | |
alpha_combined | SoundsAverage | 0.11 | 0.77 | |
alpha_combined | WordLength | 0.01 | 0.98 | |
alpha_combined | WordLengthAverage | 0.01 | 0.97 | |
alpha_combined | WordsPerLanguage | 0.38 | 0.29 |
To me it seems as if there is nothing really significant, unfortunately.
(@LinguList I had to remove raise
in l.20 of your info.py
script in order to get it to run)
Yes, sorry, I used that first, for testing, then forgot to push the up-to-date version.
It shows that there is not really an easy solution to all of this, but that we have several random factors that seem to contribute to this. We could run against general scores on delta from splits networks, or something similar, I don't know. If it is reticulation that is causing these problems.
@luisevonderwiese and @gerhardJaeger, as discussed via email, we should now use the correpsondence patterns only one time per pattern, not the frequency information, but you can (please do so) filter out irregular patterns by simply only considering those patterns that occur more than n times! So you check in frequency in the files, take, e.g., only those appearing more than 2 times (3 occurrences is a good number in my opinioin), and then run your experiments. In this way, we account for the idea of regularity here.