Closed tpellard closed 1 week ago
Ah, we have to handle them in our CLDF conversion, the borrowings, but there's not time to do so now.
If the XML has 3447, vs. 3193, could I ask you, @tpellard, to have a look at the 16_Eurasia3angle*254.xls file to see if it by any chance has 3448 lines? This would mean that my conversion was faulty. The number of distinct cognate sets should match, of course, in the EDICTOR / CLDF version and the spreadsheet.
Checking against the EDICTOR data, we have:
In [1]: from pyedictor import fetch
In [2]: wl = fetch("robbeetsaltaic", to_lingpy=True)
In [3]: etd = wl.get_etymdict(ref="cogid")
In [4]: len(etd)
Out[4]: 3173
BEAST has this strange regulation that says you have to add 000000-lines (only zeros) in your nexus to mark some kind of an analysis, @SimonGreenhill can explain this. If they do concept-wise analyses, it may be possible that they require this empty line to be added for each concept.
I do not know why we are short 20 cognate sets in EDICTOR. It can be that I did not add cells that were empty for some reason. Checking this will require some time I do not have now, unfortunately.
It’s for ascertainment bias.
Sent from my iPhone
On 7. Sep 2021, at 18:39, Johann-Mattis List @.***> wrote:
I do not know why we are short 20 cognate sets in EDICTOR. It can be that I did not add cells that were empty for some reason. Checking this will require some time I do not have now, unfortunately.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/lexibank/robbeetsaltaic/issues/8#issuecomment-914458351, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AEETOPADHEX374ENNYGNRXTUAY52PANCNFSM5DSR7IHQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
yes, each partition in the analysis needs to have a single column of all zeros. This is used to tell BEAST that this type of pattern is never seen (i.e. linguists never collect cognates that do not exist in any of the languages). This is then used to correct the likelihood for this 'ascertainment bias'.
However, there is something off here.
The XML file, tea254cov-ucln-fbd-constrained.xml, which is their best fitting model, has 3447 sites in the alignment. The analysis is set up to partition each word into its own partition, so there should be N all-zero columns, and we know which ones they should be from the XML (they are the ones
<data id="orgdata.go" spec="FilteredAlignment" ascertained="true" excludeto="1" filter="-">
<data id="go" spec="FilteredAlignment" data="@tea254" filter="23-31"/>
...
</data>
i.e. this block says that the partition for the word 'go' contains the sites 23-31 from the main alignment called tea254
., and that the first character excludeto=1
is filtered out for ascertainment correction (ascertained=true
)
Ok, of the 3447 sites in the alignment, 283 are empty, most of these are tagged as Ascertainment correction sites except 31 sites: [122, 188, 375, 776, 826, 885, 917, 995, 1052, 1103, 1474, 1514, 1534, 1647, 2107, 2155, 2250, 2292, 2293, 2468, 2495, 2502, 2557, 2625, 2696, 2838, 3045, 3123, 3137, 3189, 3359]
Alignment 1: tea254, 3447 sites, states=0,1
1: Ascertainment Bias Site
9: Ascertainment Bias Site
23: Ascertainment Bias Site
32: Ascertainment Bias Site
37: Ascertainment Bias Site
44: Ascertainment Bias Site
51: Ascertainment Bias Site
57: Ascertainment Bias Site
64: Ascertainment Bias Site
74: Ascertainment Bias Site
95: Ascertainment Bias Site
103: Ascertainment Bias Site
116: Ascertainment Bias Site
122: Empty Site with 98x0, 0x? of 98 sites
129: Ascertainment Bias Site
135: Ascertainment Bias Site
142: Ascertainment Bias Site
153: Ascertainment Bias Site
164: Ascertainment Bias Site
177: Ascertainment Bias Site
188: Empty Site with 98x0, 0x? of 98 sites
209: Ascertainment Bias Site
218: Ascertainment Bias Site
224: Ascertainment Bias Site
233: Ascertainment Bias Site
244: Ascertainment Bias Site
261: Ascertainment Bias Site
273: Ascertainment Bias Site
281: Ascertainment Bias Site
287: Ascertainment Bias Site
304: Ascertainment Bias Site
309: Ascertainment Bias Site
320: Ascertainment Bias Site
343: Ascertainment Bias Site
350: Ascertainment Bias Site
358: Ascertainment Bias Site
371: Ascertainment Bias Site
375: Empty Site with 98x0, 0x? of 98 sites
400: Ascertainment Bias Site
414: Ascertainment Bias Site
425: Ascertainment Bias Site
430: Ascertainment Bias Site
445: Ascertainment Bias Site
456: Ascertainment Bias Site
465: Ascertainment Bias Site
475: Ascertainment Bias Site
487: Ascertainment Bias Site
494: Ascertainment Bias Site
512: Ascertainment Bias Site
533: Ascertainment Bias Site
544: Ascertainment Bias Site
554: Ascertainment Bias Site
559: Ascertainment Bias Site
581: Ascertainment Bias Site
599: Ascertainment Bias Site
607: Ascertainment Bias Site
620: Ascertainment Bias Site
636: Ascertainment Bias Site
642: Ascertainment Bias Site
656: Ascertainment Bias Site
665: Ascertainment Bias Site
681: Ascertainment Bias Site
702: Ascertainment Bias Site
709: Ascertainment Bias Site
721: Ascertainment Bias Site
739: Ascertainment Bias Site
752: Ascertainment Bias Site
759: Ascertainment Bias Site
765: Ascertainment Bias Site
776: Empty Site with 98x0, 0x? of 98 sites
786: Ascertainment Bias Site
801: Ascertainment Bias Site
813: Ascertainment Bias Site
826: Empty Site with 98x0, 0x? of 98 sites
835: Ascertainment Bias Site
852: Ascertainment Bias Site
861: Ascertainment Bias Site
871: Ascertainment Bias Site
885: Empty Site with 98x0, 0x? of 98 sites
891: Ascertainment Bias Site
902: Ascertainment Bias Site
917: Empty Site with 98x0, 0x? of 98 sites
928: Ascertainment Bias Site
947: Ascertainment Bias Site
952: Ascertainment Bias Site
966: Ascertainment Bias Site
973: Ascertainment Bias Site
993: Ascertainment Bias Site
995: Empty Site with 98x0, 0x? of 98 sites
1009: Ascertainment Bias Site
1015: Ascertainment Bias Site
1022: Ascertainment Bias Site
1030: Ascertainment Bias Site
1038: Ascertainment Bias Site
1049: Ascertainment Bias Site
1052: Empty Site with 98x0, 0x? of 98 sites
1070: Ascertainment Bias Site
1081: Ascertainment Bias Site
1099: Ascertainment Bias Site
1103: Empty Site with 98x0, 0x? of 98 sites
1129: Ascertainment Bias Site
1142: Ascertainment Bias Site
1160: Ascertainment Bias Site
1170: Ascertainment Bias Site
1192: Ascertainment Bias Site
1215: Ascertainment Bias Site
1222: Ascertainment Bias Site
1229: Ascertainment Bias Site
1244: Ascertainment Bias Site
1273: Ascertainment Bias Site
1283: Ascertainment Bias Site
1290: Ascertainment Bias Site
1297: Ascertainment Bias Site
1322: Ascertainment Bias Site
1328: Ascertainment Bias Site
1340: Ascertainment Bias Site
1349: Ascertainment Bias Site
1360: Ascertainment Bias Site
1376: Ascertainment Bias Site
1392: Ascertainment Bias Site
1402: Ascertainment Bias Site
1417: Ascertainment Bias Site
1423: Ascertainment Bias Site
1433: Ascertainment Bias Site
1448: Ascertainment Bias Site
1463: Ascertainment Bias Site
1469: Ascertainment Bias Site
1474: Empty Site with 98x0, 0x? of 98 sites
1478: Ascertainment Bias Site
1502: Ascertainment Bias Site
1514: Empty Site with 98x0, 0x? of 98 sites
1524: Ascertainment Bias Site
1534: Empty Site with 98x0, 0x? of 98 sites
1543: Ascertainment Bias Site
1555: Ascertainment Bias Site
1580: Ascertainment Bias Site
1595: Ascertainment Bias Site
1605: Ascertainment Bias Site
1624: Ascertainment Bias Site
1634: Ascertainment Bias Site
1643: Ascertainment Bias Site
1647: Empty Site with 98x0, 0x? of 98 sites
1657: Ascertainment Bias Site
1679: Ascertainment Bias Site
1696: Ascertainment Bias Site
1709: Ascertainment Bias Site
1723: Ascertainment Bias Site
1734: Ascertainment Bias Site
1742: Ascertainment Bias Site
1751: Ascertainment Bias Site
1763: Ascertainment Bias Site
1775: Ascertainment Bias Site
1791: Ascertainment Bias Site
1805: Ascertainment Bias Site
1819: Ascertainment Bias Site
1837: Ascertainment Bias Site
1847: Ascertainment Bias Site
1859: Ascertainment Bias Site
1873: Ascertainment Bias Site
1886: Ascertainment Bias Site
1902: Ascertainment Bias Site
1915: Ascertainment Bias Site
1923: Ascertainment Bias Site
1936: Ascertainment Bias Site
1942: Ascertainment Bias Site
1951: Ascertainment Bias Site
1966: Ascertainment Bias Site
1973: Ascertainment Bias Site
2000: Ascertainment Bias Site
2014: Ascertainment Bias Site
2035: Ascertainment Bias Site
2045: Ascertainment Bias Site
2053: Ascertainment Bias Site
2067: Ascertainment Bias Site
2092: Ascertainment Bias Site
2104: Ascertainment Bias Site
2107: Empty Site with 98x0, 0x? of 98 sites
2117: Ascertainment Bias Site
2134: Ascertainment Bias Site
2137: Ascertainment Bias Site
2155: Empty Site with 98x0, 0x? of 98 sites
2167: Ascertainment Bias Site
2193: Ascertainment Bias Site
2201: Ascertainment Bias Site
2214: Ascertainment Bias Site
2227: Ascertainment Bias Site
2240: Ascertainment Bias Site
2250: Empty Site with 98x0, 0x? of 98 sites
2253: Ascertainment Bias Site
2272: Ascertainment Bias Site
2286: Ascertainment Bias Site
2292: Empty Site with 98x0, 0x? of 98 sites
2293: Empty Site with 98x0, 0x? of 98 sites
2298: Ascertainment Bias Site
2317: Ascertainment Bias Site
2336: Ascertainment Bias Site
2344: Ascertainment Bias Site
2354: Ascertainment Bias Site
2363: Ascertainment Bias Site
2370: Ascertainment Bias Site
2387: Ascertainment Bias Site
2404: Ascertainment Bias Site
2415: Ascertainment Bias Site
2437: Ascertainment Bias Site
2449: Ascertainment Bias Site
2456: Ascertainment Bias Site
2468: Empty Site with 98x0, 0x? of 98 sites
2485: Ascertainment Bias Site
2489: Ascertainment Bias Site
2495: Empty Site with 98x0, 0x? of 98 sites
2502: Empty Site with 98x0, 0x? of 98 sites
2505: Ascertainment Bias Site
2519: Ascertainment Bias Site
2527: Ascertainment Bias Site
2541: Ascertainment Bias Site
2553: Ascertainment Bias Site
2557: Empty Site with 98x0, 0x? of 98 sites
2577: Ascertainment Bias Site
2593: Ascertainment Bias Site
2605: Ascertainment Bias Site
2615: Ascertainment Bias Site
2624: Ascertainment Bias Site
2625: Empty Site with 98x0, 0x? of 98 sites
2654: Ascertainment Bias Site
2677: Ascertainment Bias Site
2690: Ascertainment Bias Site
2696: Empty Site with 98x0, 0x? of 98 sites
2705: Ascertainment Bias Site
2718: Ascertainment Bias Site
2751: Ascertainment Bias Site
2757: Ascertainment Bias Site
2767: Ascertainment Bias Site
2777: Ascertainment Bias Site
2791: Ascertainment Bias Site
2809: Ascertainment Bias Site
2818: Ascertainment Bias Site
2831: Ascertainment Bias Site
2838: Empty Site with 98x0, 0x? of 98 sites
2844: Ascertainment Bias Site
2849: Ascertainment Bias Site
2870: Ascertainment Bias Site
2880: Ascertainment Bias Site
2893: Ascertainment Bias Site
2911: Ascertainment Bias Site
2919: Ascertainment Bias Site
2941: Ascertainment Bias Site
2957: Ascertainment Bias Site
2966: Ascertainment Bias Site
2976: Ascertainment Bias Site
2999: Ascertainment Bias Site
3010: Ascertainment Bias Site
3028: Ascertainment Bias Site
3041: Ascertainment Bias Site
3045: Empty Site with 98x0, 0x? of 98 sites
3058: Ascertainment Bias Site
3077: Ascertainment Bias Site
3092: Ascertainment Bias Site
3115: Ascertainment Bias Site
3123: Empty Site with 98x0, 0x? of 98 sites
3130: Ascertainment Bias Site
3137: Empty Site with 98x0, 0x? of 98 sites
3143: Ascertainment Bias Site
3169: Ascertainment Bias Site
3176: Ascertainment Bias Site
3186: Ascertainment Bias Site
3189: Empty Site with 98x0, 0x? of 98 sites
3206: Ascertainment Bias Site
3226: Ascertainment Bias Site
3238: Ascertainment Bias Site
3257: Ascertainment Bias Site
3274: Ascertainment Bias Site
3294: Ascertainment Bias Site
3303: Ascertainment Bias Site
3319: Ascertainment Bias Site
3343: Ascertainment Bias Site
3358: Ascertainment Bias Site
3359: Empty Site with 98x0, 0x? of 98 sites
3375: Ascertainment Bias Site
3381: Ascertainment Bias Site
3388: Ascertainment Bias Site
3408: Ascertainment Bias Site
3427: Ascertainment Bias Site
3436: Ascertainment Bias Site
Alignment has 31/3447 problematic sites
This looks very much like what happens when someone deletes a language from the analysis, which is why we had to write a correction to the Bouckaert et al. Indo-European paper. i.e. removing a language removes any cognate sets that only belong to that language.
This was a problem for Indo-European because we deleted 13/116 languages leading to 283 empty sites... and this affected the root age moving it from 8,466 (7,116 -10,410) to 7,579 (5,972-9,351) years. Now, R et al. here have fewer sites in this category but it suggests that the inferred age could be overinflated by a bit.
If we add the deleted Korean back in to the XML, then we get some back:
Alignment has 26/3447 problematic sites
i.e. the following have become ok: 3189, 2838, 1103, 1052, 995, still leaving quite a number of all zero sites.
.. so there are is at least another language or 3 or 4 in the nexus or the xml that got deleted.
If @chrzyki is running these analyses, then I can easily cull the empty sites and send him an XML to run to see if it makes a difference.
Sounds good! I've had created a number of different variations of the XML (together with Russell) and some have finished, some are still running (e.g. particles 4/10). Happy to run another version as well. :)
@LinguList
There are 3193 rows (= cognate sets) in 16_Eurasia3angle_synthesis_SI 1_BV 254.xls
. In Edictor, the following cogids are missing:
109 709 1356 1394 1413 1518 1944 2077 2116 2117 2279 2304 2311 2361 2424 2492 2816 2890 2903 3111
I've checked the .xls file, the rows missing from Edictor are those that have only empty values in all language columns.
There are also 7 empty columns in 16_Eurasia3angle_synthesis_SI 1_BV 254.xls
between Hachijo and Eastern Evenki. @LinguList, have you noticed that the final 4 columns have a fused cell for their language name? There are 2 columns under the heading "Eastern Evenki" and 2 under "Southern Evenki (Vershina-Tutury, Baikal)".
Ouch. That is terrible. Thanks for checking! It means we need to adjust our procedure of conversion, or at least check it.
And if all cells for a row are empty, I excluded them, of course, which shows yet another inconsistency of the original data. Since we only show existing words, we could by no means show non-existing ones. But we could include in our procedure to display automatically which rows are all empty.
Some borrowings are not marked as "FORM_bor" but as "FORM bor". It seems that they are nevertheless encoded as "1" in the BEAST xml files (I manually checked one example, but I don't know how to do that easily for others).
@tpellard, I have converted the data to CSV before reading the data in. The empty columns are accounted for in my code. The strangely merged columns are also accounted for: The second column is displayed as headerless column in the CSV, and I only read in columns with a language. The CSV is online here.
The code seems to do what it is expected to do, all we'd need to add would be a statement to warn if a whole row is completely empty, and maybe an indicator on whether they annotate something as a potential borrowing.
The following are the relevant lines in the iteration.
I think that coding obvious borrowings between the languages under study as cognates is problematic since it introduces noise in the phylogenetic signal and discards important information. What do you think @SimonGreenhill and @RustyGray ? How are such cases usually treated in phylogenetic analyses?
Hi, well one way of treating known loans is to just exclude them. Another is to count them only after they get transmitted in a subgroup e.g. for deep loans the initial borrowing doesn’t count but subsequently they become cognates. We have a lengthy discussion of possible coding practices in the supplement of our IE ms that we hop to submit soon. Cheers, Russell.
Russell Gray Director, Max Planck Institute for Evolutionary Anthropology Head of the Department of Linguistic and Cultural Evolution TEL: +49-3641-68 68 01 FAX: +49-3641-68 68 68 Departmental Administrators: Jena @. Leipzig @. http://www.shh.mpg.de/2375/en http://language.psy.auckland.ac.nz/ https://scholar.google.com/citations?hl=en&user=sksPd1cAAAAJ
On 10. Sep 2021, at 15:24, Thomas Pellard @.***> wrote:
I think that coding obvious borrowings between the languages under study as cognates is problematic since it introduces noise in the phylogenetic signal and discards important information. What do you think @SimonGreenhill https://github.com/SimonGreenhill and @RustyGray https://github.com/RustyGray ? How are such cases usually treated in phylogenetic analyses?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lexibank/robbeetsaltaic/issues/8#issuecomment-916902314, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEETOPEUCZQEMJUHVS4DWITUBIBKHANCNFSM5DSR7IHQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
Are those borrowings singletons, e.g., pertaining to one word form alone, "cognate sets" of size 1? If so, they excluded them by placing them into their own cognate set which does not really do anything for the subgrouping. If not, the coding is problematic, specifically when recurring in more than one family.
Quite a number of these borrowings involve two or three families, for instance Mongolic borrowings into some Turkic and Tungusic languages.
Le ven. 10 sept. 2021 à 20:40, Johann-Mattis List @.***> a écrit :
Are those borrowings singletons, e.g., pertaining to one word form alone, "cognate sets" of size 1? If so, they excluded them by placing them into their own cognate set which does not really do anything for the subgrouping. If not, the coding is problematic, specifically when recurring in more than one family.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lexibank/robbeetsaltaic/issues/8#issuecomment-917127268, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPJOA4QKQWTKBZ7QN3OHCDUBJGKHANCNFSM5DSR7IHQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
-- Guillaume Jacques
Directeur de recherches CNRS (CRLAO) - EPHE- INALCO https://scholar.google.fr/citations?user=1XCp2-oAAAAJ&hl=fr https://langsci-press.org/catalog/book/295 http://cnrs.academia.edu/GuillaumeJacques http://panchr.hypotheses.org/
For instance in COGID #2412 'count (v.)' all Tungusic and Turkic forms are marked as borrowings, leaving only Mongolic forms in the cognate set.
Some forms in the spreadsheet
16_Eurasia3angle_synthesis_SI 1_BV 254.xls
are marked as borrowings by the authors (forms ending in "_bor"). How are they handled in the phylogenetic analysis?Looking at the XML files in
39_Eurasia3angle_synthesis_SI 19_XML files_languages.zip
, I found the 0/1 sequences for each taxon, but their length is 3447 although there are only 3193 cognate sets. To what do the 254 extra digits correspond? Does it have something to do with the fact that there are 254 concepts? How can I check whether borrowings are assigned a 1 or a 0?