Correspondence Pattern Analysis

LinguList commented 10 months ago

@luisevonderwiese and @gerhardJaeger, as discussed via email, we should now use the correpsondence patterns only one time per pattern, not the frequency information, but you can (please do so) filter out irregular patterns by simply only considering those patterns that occur more than n times! So you check in frequency in the files, take, e.g., only those appearing more than 2 times (3 occurrences is a good number in my opinioin), and then run your experiments. In this way, we account for the idea of regularity here.

gerhardJaeger commented 10 months ago

I have changed my scripts in such a way that correspondence patterns are not repeated according to frequency, and only correspondence patterns with at least 3 occurrences are used. The results for Bayesian phylogenetic inference are not

 Row │ method           gqd      
─────┼───────────────────────────
   1 │ correspondences  0.386842
   2 │ combined         0.251401
   3 │ cognates         0.252252

So - correspondence patterns are still no good, but the combined method is now about as good as cognate classes alone.

LinguList commented 10 months ago

Interesting. So they do less harm now. From the theory -- notwithstanding some artifacts of the methods -- this is much safer now.

luisevonderwiese commented 10 months ago

I also repeated the ML experiments with the new datasets (without frequency information). The results look rather interesting:

without frequency information:
method          gqd(median)
correspondences 0.37296999999999997
combined        0.24331799999999998
cognate         0.283831

So even though the correspondences lead to trees with the highest distances to the reference, the combined data performs best. With the old MSA (with frequency information) it looked like that:

with frequency information
method          gqd(median)
correspondences 0.3591745
combined        0.35156
cognate         0.283831

LinguList commented 10 months ago

That is actually a nice result. So we can conclude (weakly):

cognates are very robust
correspondences taken alone are dangerous
combining the data seems to improve results

The message is nice for linguists, as they usually are sad if you tell them they should ignore sound change / sound correspondences. So we have this reconciliating tone in the story.

LinguList commented 10 months ago

@gerhardJaeger, what do you think, can we start writing this up from here? I'd then make sure to make a first pass on the text and set everything up.

gerhardJaeger commented 10 months ago

I'll put in the new numbers, and then the ball goes to your court.

LinguList commented 10 months ago

THANKS! Then I will advance and then I'll ask @luisevonderwiese later to fill in her methods / numbers?

luisevonderwiese commented 10 months ago

Yes sure I can do that! Just let me know when you are ready

I have some other interesting remarks: I my former experiments, I used MSAs which also do not contain the frequency information, but which contain all patterns, not only those with a frequency > 2.. This lead to the following results:

method          gqd(median)
correspondences 0.3591745
combined        0.338781
cognate         0.283831

So this filtering step really makes a difference

luisevonderwiese commented 10 months ago

Further, I also considered the difficulty predicted with Pythia (https://github.com/tschuelia/PyPythia)

method          difficulty(median)
correspondences 0.29000000000000004
combined        0.14
cognate         0.195

Combining both data types leads to a dataset with a clearer signal, also interesting

LinguList commented 10 months ago

Very cool, these results. Such a nice story, also considering the low frequency patterns: they add noise, rather than helping to resolve the trees, so we profit from doing the filtering, and in combination with cognates, this can enhance trees (slightly) in general.

gerhardJaeger commented 10 months ago

I have replaced the numbers and reformulated the text in the latex file (both on github and in overleaf). Mattis, the paper is all yours. :-)

stamatak commented 10 months ago

With respect to the filtering (i.e., selecting only sound sites with a freq > 2) it would be good if we had an actual criterion for sound site selection, as we computer scientists hate ad hoc thresholds and cutoffs. One simple criterion would be some sort of search algo that tries to select sound sites such that the difficulty is minimzed perhaps.

LinguList commented 10 months ago

I understand your feeling, but it is important to add that we have studied the patterns for a long time now, and while it is not optimal to use such a hard cut-off, linking the criterion to phylogenies or other criteria is much more difficult than it seems. For the purpose of the current study, we have run many experiments with alternating thresholds in a previous paper that I'd cite for now.

LinguList commented 10 months ago

But of course: running additional studies -- if this is not too time consuming -- with alternating frequency cut-offs is probably very useful. Additionally, one can check the influence of considering only vowels or only consonants, as they are also marked in the files (column Structure, where we distinguish c(onsonant) and v(owel)).

stamatak commented 10 months ago

I understand that you understand the data :-) Luise and I had a meeting with Gerhard today and we agreed that we will do some additional experiments on this, but not necessarily include them in the paper as the timeline is pretty tight already.

LinguList commented 10 months ago

I am happy if you do that. I have been trying for a long time to figure out how to handle these things. If you find some independent way to tell us where to best set the T, this is like the holy grail for this question!

luisevonderwiese commented 10 months ago

I ran my analyses using the Gamma Model for rate heterogeneity. Like this we obtain a shape parameter alpha ranging from 0 to 100. The lower alpha, the higher the rate heterogeneity in the dataset. Alpha is distributed in a rather unusual way being is either very high or very low for most of the datasets:

                ds_id  cognate_classes  correspondences   combined
0  walworthpolynesian         1.332551         4.233181   1.624137
1   constenlachibchan         0.592429        99.870825   4.178381
2         crossandean         1.242856         6.333836   1.153854
3          robinsonap        99.868657        15.268781   3.485770
4     zhivlovobugrian        99.850351         4.244159   3.133747
5      hattorijaponic        99.847976        99.897067  99.889967
6       felekesemitic         1.061707         7.430331   2.692842
7          houchinese         2.357385         6.120207   4.195315
8             dravlex         0.701598         4.300915   2.233731
9         leekoreanic         8.316283         8.420485   3.284162

(When I have a look at the alpha values for other cognate datasets from lexibank, the distribution looks similar) So far I was not able to find other properties of the data which would imply alpha being high/low. This is why I would like to ask you, whether you have any ideas, what the datasets with high/low alpha do have in common, as you are more familiar with the data and its background than I am. Thank you!

stamatak commented 10 months ago

Just for your reference, this is how alpha is distributed on a huge collection of empirical molecular datasets:

https://github.com/angtft/RAxMLGroveScripts/blob/main/figures/test_ALPHA.png

This is from this paper here:

https://academic.oup.com/bioinformatics/article/38/6/1741/6486526?login=false

So what we observed for the language datasets with this extreme distribution or outliers in terms of alpha values is very weird.

LinguList commented 10 months ago

Yes, very weird. I don't see a pattern. Should we compute some basic characteristics of all datasets? I can offer phonetic average distances between words, between cognates, number of languages, size of words, etc., and we see if we find ANY that explains the data? Or is this unscientific, as we explore and will eventually find something?

luisevonderwiese commented 10 months ago

Yes I would be interested in the characteristics you mention. Can you provide them only for all cognate datasets in Lexibank or only for the ones selected in this repository here? Thanks a lot!

stamatak commented 10 months ago

My hypothesis was that this could somehow be associated with the subjective way by which linguists select sites for their datasets, maybe we should also check the authors of the datasets and see if there is some pattern.

LinguList commented 10 months ago

The authors are all different people, I think.

LinguList commented 10 months ago

Let us start with the params for the data in our repo here, but we can then extend to lexibank later.

LinguList commented 10 months ago

	Dataset	Concepts	Languages	Diversity	Distances	SoundsTotal	SoundsAverage	WordLength	WordLengthAverage	WordsPerLanguage
hattorijaponic	1710	197	10	0.03	0.09	61	34.60	4.10	4.08	171.00
houchinese	1816	139	15	0.05	0.17	113	42.53	5.57	5.53	121.07
felekesemitic	2412	150	19	0.05	0.28	76	44.68	4.67	4.65	126.95
constenlachibchan	1214	106	24	0.10	0.42	65	20.88	3.10	3.08	50.58
zhivlovobugrian	1879	110	20	0.04	0.20	66	32.55	3.44	3.45	93.95
dravlex	1341	100	20	0.06	0.24	68	35.90	3.93	3.82	67.05
walworthpolynesian	6113	207	31	0.05	0.23	44	20.65	4.07	4.07	197.19
robinsonap	1424	216	13	0.03	0.12	39	23.69	3.77	3.76	109.54
leekoreanic	1960	205	14	0.01	0.06	40	36.71	3.91	3.91	140.00
crossandean	2637	150	19	0.03	0.15	65	28.89	4.22	4.22	138.79

LinguList commented 10 months ago

The Python script info.py computes these values from the trimmed data now. For Lexibank, it would have to be adjusted.

luisevonderwiese commented 10 months ago

Thanks a lot, I checked it out and calculated the pearson correlation for the alpha values of the different data types to the provided properties:	alpha	property	pearson correlation
alpha_cognate_classes	Concepts	0.27	0.45
alpha_cognate_classes	Languages	-0.50	0.14
alpha_cognate_classes	Diversity	-0.41	0.24
alpha_cognate_classes	Distances	-0.42	0.23
alpha_cognate_classes	SoundsTotal	-0.28	0.43
alpha_cognate_classes	SoundsAverage	-0.14	0.71
alpha_cognate_classes	WordLength	-0.31	0.38
alpha_cognate_classes	WordLengthAverage	-0.30	0.40
alpha_cognate_classes	WordsPerLanguage	0.06	0.87
alpha_correspondences	Concepts	-0.03	0.93
alpha_correspondences	Languages	-0.17	0.63
alpha_correspondences	Diversity	0.33	0.36
alpha_correspondences	Distances	0.26	0.47
alpha_correspondences	SoundsTotal	-0.05	0.89
alpha_correspondences	SoundsAverage	-0.29	0.42
alpha_correspondences	WordLength	-0.38	0.28
alpha_correspondences	WordLengthAverage	-0.38	0.28
alpha_correspondences	WordsPerLanguage	-0.13	0.72
alpha_combined	Concepts	0.30	0.40
alpha_combined	Languages	-0.51	0.13
alpha_combined	Diversity	-0.28	0.43
alpha_combined	Distances	-0.37	0.30
alpha_combined	SoundsTotal	-0.03	0.93
alpha_combined	SoundsAverage	0.11	0.77
alpha_combined	WordLength	0.01	0.98
alpha_combined	WordLengthAverage	0.01	0.97
alpha_combined	WordsPerLanguage	0.38	0.29

To me it seems as if there is nothing really significant, unfortunately.

(@LinguList I had to remove raise in l.20 of your info.py script in order to get it to run)

LinguList commented 10 months ago

Yes, sorry, I used that first, for testing, then forgot to push the up-to-date version.

LinguList commented 10 months ago

It shows that there is not really an easy solution to all of this, but that we have several random factors that seem to contribute to this. We could run against general scores on delta from splits networks, or something similar, I don't know. If it is reticulation that is causing these problems.

lingpy / are-sounds-sound-paper

Correspondence Pattern Analysis #4