lexibank / robbeetstriangulation

CLDF dataset derived from Robbeets et al.'s "Triangulation of the Transeurasian Languages" from 2021
Creative Commons Attribution 4.0 International
0 stars 0 forks source link

Handling of borrowings in the BEAST files #8

Closed tpellard closed 1 week ago

tpellard commented 3 years ago

Some forms in the spreadsheet 16_Eurasia3angle_synthesis_SI 1_BV 254.xls are marked as borrowings by the authors (forms ending in "_bor"). How are they handled in the phylogenetic analysis?

Looking at the XML files in 39_Eurasia3angle_synthesis_SI 19_XML files_languages.zip, I found the 0/1 sequences for each taxon, but their length is 3447 although there are only 3193 cognate sets. To what do the 254 extra digits correspond? Does it have something to do with the fact that there are 254 concepts? How can I check whether borrowings are assigned a 1 or a 0?

LinguList commented 3 years ago

Ah, we have to handle them in our CLDF conversion, the borrowings, but there's not time to do so now.

If the XML has 3447, vs. 3193, could I ask you, @tpellard, to have a look at the 16_Eurasia3angle*254.xls file to see if it by any chance has 3448 lines? This would mean that my conversion was faulty. The number of distinct cognate sets should match, of course, in the EDICTOR / CLDF version and the spreadsheet.

LinguList commented 3 years ago

Checking against the EDICTOR data, we have:

In [1]: from pyedictor import fetch

In [2]: wl = fetch("robbeetsaltaic", to_lingpy=True)

In [3]: etd = wl.get_etymdict(ref="cogid")

In [4]: len(etd)
Out[4]: 3173
LinguList commented 3 years ago

BEAST has this strange regulation that says you have to add 000000-lines (only zeros) in your nexus to mark some kind of an analysis, @SimonGreenhill can explain this. If they do concept-wise analyses, it may be possible that they require this empty line to be added for each concept.

LinguList commented 3 years ago

I do not know why we are short 20 cognate sets in EDICTOR. It can be that I did not add cells that were empty for some reason. Checking this will require some time I do not have now, unfortunately.

RustyGray commented 3 years ago

It’s for ascertainment bias.

Sent from my iPhone

On 7. Sep 2021, at 18:39, Johann-Mattis List @.***> wrote:



I do not know why we are short 20 cognate sets in EDICTOR. It can be that I did not add cells that were empty for some reason. Checking this will require some time I do not have now, unfortunately.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/lexibank/robbeetsaltaic/issues/8#issuecomment-914458351, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AEETOPADHEX374ENNYGNRXTUAY52PANCNFSM5DSR7IHQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

SimonGreenhill commented 3 years ago

yes, each partition in the analysis needs to have a single column of all zeros. This is used to tell BEAST that this type of pattern is never seen (i.e. linguists never collect cognates that do not exist in any of the languages). This is then used to correct the likelihood for this 'ascertainment bias'.

However, there is something off here.

The XML file, tea254cov-ucln-fbd-constrained.xml, which is their best fitting model, has 3447 sites in the alignment. The analysis is set up to partition each word into its own partition, so there should be N all-zero columns, and we know which ones they should be from the XML (they are the ones

<data id="orgdata.go" spec="FilteredAlignment" ascertained="true" excludeto="1" filter="-">
    <data id="go" spec="FilteredAlignment" data="@tea254" filter="23-31"/>
    ...
</data>

i.e. this block says that the partition for the word 'go' contains the sites 23-31 from the main alignment called tea254., and that the first character excludeto=1 is filtered out for ascertainment correction (ascertained=true)

Ok, of the 3447 sites in the alignment, 283 are empty, most of these are tagged as Ascertainment correction sites except 31 sites: [122, 188, 375, 776, 826, 885, 917, 995, 1052, 1103, 1474, 1514, 1534, 1647, 2107, 2155, 2250, 2292, 2293, 2468, 2495, 2502, 2557, 2625, 2696, 2838, 3045, 3123, 3137, 3189, 3359]

Alignment 1: tea254, 3447 sites, states=0,1
    1: Ascertainment Bias Site
    9: Ascertainment Bias Site
   23: Ascertainment Bias Site
   32: Ascertainment Bias Site
   37: Ascertainment Bias Site
   44: Ascertainment Bias Site
   51: Ascertainment Bias Site
   57: Ascertainment Bias Site
   64: Ascertainment Bias Site
   74: Ascertainment Bias Site
   95: Ascertainment Bias Site
  103: Ascertainment Bias Site
  116: Ascertainment Bias Site
  122: Empty Site with  98x0,   0x? of 98 sites
  129: Ascertainment Bias Site
  135: Ascertainment Bias Site
  142: Ascertainment Bias Site
  153: Ascertainment Bias Site
  164: Ascertainment Bias Site
  177: Ascertainment Bias Site
  188: Empty Site with  98x0,   0x? of 98 sites
  209: Ascertainment Bias Site
  218: Ascertainment Bias Site
  224: Ascertainment Bias Site
  233: Ascertainment Bias Site
  244: Ascertainment Bias Site
  261: Ascertainment Bias Site
  273: Ascertainment Bias Site
  281: Ascertainment Bias Site
  287: Ascertainment Bias Site
  304: Ascertainment Bias Site
  309: Ascertainment Bias Site
  320: Ascertainment Bias Site
  343: Ascertainment Bias Site
  350: Ascertainment Bias Site
  358: Ascertainment Bias Site
  371: Ascertainment Bias Site
  375: Empty Site with  98x0,   0x? of 98 sites
  400: Ascertainment Bias Site
  414: Ascertainment Bias Site
  425: Ascertainment Bias Site
  430: Ascertainment Bias Site
  445: Ascertainment Bias Site
  456: Ascertainment Bias Site
  465: Ascertainment Bias Site
  475: Ascertainment Bias Site
  487: Ascertainment Bias Site
  494: Ascertainment Bias Site
  512: Ascertainment Bias Site
  533: Ascertainment Bias Site
  544: Ascertainment Bias Site
  554: Ascertainment Bias Site
  559: Ascertainment Bias Site
  581: Ascertainment Bias Site
  599: Ascertainment Bias Site
  607: Ascertainment Bias Site
  620: Ascertainment Bias Site
  636: Ascertainment Bias Site
  642: Ascertainment Bias Site
  656: Ascertainment Bias Site
  665: Ascertainment Bias Site
  681: Ascertainment Bias Site
  702: Ascertainment Bias Site
  709: Ascertainment Bias Site
  721: Ascertainment Bias Site
  739: Ascertainment Bias Site
  752: Ascertainment Bias Site
  759: Ascertainment Bias Site
  765: Ascertainment Bias Site
  776: Empty Site with  98x0,   0x? of 98 sites
  786: Ascertainment Bias Site
  801: Ascertainment Bias Site
  813: Ascertainment Bias Site
  826: Empty Site with  98x0,   0x? of 98 sites
  835: Ascertainment Bias Site
  852: Ascertainment Bias Site
  861: Ascertainment Bias Site
  871: Ascertainment Bias Site
  885: Empty Site with  98x0,   0x? of 98 sites
  891: Ascertainment Bias Site
  902: Ascertainment Bias Site
  917: Empty Site with  98x0,   0x? of 98 sites
  928: Ascertainment Bias Site
  947: Ascertainment Bias Site
  952: Ascertainment Bias Site
  966: Ascertainment Bias Site
  973: Ascertainment Bias Site
  993: Ascertainment Bias Site
  995: Empty Site with  98x0,   0x? of 98 sites
 1009: Ascertainment Bias Site
 1015: Ascertainment Bias Site
 1022: Ascertainment Bias Site
 1030: Ascertainment Bias Site
 1038: Ascertainment Bias Site
 1049: Ascertainment Bias Site
 1052: Empty Site with  98x0,   0x? of 98 sites
 1070: Ascertainment Bias Site
 1081: Ascertainment Bias Site
 1099: Ascertainment Bias Site
 1103: Empty Site with  98x0,   0x? of 98 sites
 1129: Ascertainment Bias Site
 1142: Ascertainment Bias Site
 1160: Ascertainment Bias Site
 1170: Ascertainment Bias Site
 1192: Ascertainment Bias Site
 1215: Ascertainment Bias Site
 1222: Ascertainment Bias Site
 1229: Ascertainment Bias Site
 1244: Ascertainment Bias Site
 1273: Ascertainment Bias Site
 1283: Ascertainment Bias Site
 1290: Ascertainment Bias Site
 1297: Ascertainment Bias Site
 1322: Ascertainment Bias Site
 1328: Ascertainment Bias Site
 1340: Ascertainment Bias Site
 1349: Ascertainment Bias Site
 1360: Ascertainment Bias Site
 1376: Ascertainment Bias Site
 1392: Ascertainment Bias Site
 1402: Ascertainment Bias Site
 1417: Ascertainment Bias Site
 1423: Ascertainment Bias Site
 1433: Ascertainment Bias Site
 1448: Ascertainment Bias Site
 1463: Ascertainment Bias Site
 1469: Ascertainment Bias Site
 1474: Empty Site with  98x0,   0x? of 98 sites
 1478: Ascertainment Bias Site
 1502: Ascertainment Bias Site
 1514: Empty Site with  98x0,   0x? of 98 sites
 1524: Ascertainment Bias Site
 1534: Empty Site with  98x0,   0x? of 98 sites
 1543: Ascertainment Bias Site
 1555: Ascertainment Bias Site
 1580: Ascertainment Bias Site
 1595: Ascertainment Bias Site
 1605: Ascertainment Bias Site
 1624: Ascertainment Bias Site
 1634: Ascertainment Bias Site
 1643: Ascertainment Bias Site
 1647: Empty Site with  98x0,   0x? of 98 sites
 1657: Ascertainment Bias Site
 1679: Ascertainment Bias Site
 1696: Ascertainment Bias Site
 1709: Ascertainment Bias Site
 1723: Ascertainment Bias Site
 1734: Ascertainment Bias Site
 1742: Ascertainment Bias Site
 1751: Ascertainment Bias Site
 1763: Ascertainment Bias Site
 1775: Ascertainment Bias Site
 1791: Ascertainment Bias Site
 1805: Ascertainment Bias Site
 1819: Ascertainment Bias Site
 1837: Ascertainment Bias Site
 1847: Ascertainment Bias Site
 1859: Ascertainment Bias Site
 1873: Ascertainment Bias Site
 1886: Ascertainment Bias Site
 1902: Ascertainment Bias Site
 1915: Ascertainment Bias Site
 1923: Ascertainment Bias Site
 1936: Ascertainment Bias Site
 1942: Ascertainment Bias Site
 1951: Ascertainment Bias Site
 1966: Ascertainment Bias Site
 1973: Ascertainment Bias Site
 2000: Ascertainment Bias Site
 2014: Ascertainment Bias Site
 2035: Ascertainment Bias Site
 2045: Ascertainment Bias Site
 2053: Ascertainment Bias Site
 2067: Ascertainment Bias Site
 2092: Ascertainment Bias Site
 2104: Ascertainment Bias Site
 2107: Empty Site with  98x0,   0x? of 98 sites
 2117: Ascertainment Bias Site
 2134: Ascertainment Bias Site
 2137: Ascertainment Bias Site
 2155: Empty Site with  98x0,   0x? of 98 sites
 2167: Ascertainment Bias Site
 2193: Ascertainment Bias Site
 2201: Ascertainment Bias Site
 2214: Ascertainment Bias Site
 2227: Ascertainment Bias Site
 2240: Ascertainment Bias Site
 2250: Empty Site with  98x0,   0x? of 98 sites
 2253: Ascertainment Bias Site
 2272: Ascertainment Bias Site
 2286: Ascertainment Bias Site
 2292: Empty Site with  98x0,   0x? of 98 sites
 2293: Empty Site with  98x0,   0x? of 98 sites
 2298: Ascertainment Bias Site
 2317: Ascertainment Bias Site
 2336: Ascertainment Bias Site
 2344: Ascertainment Bias Site
 2354: Ascertainment Bias Site
 2363: Ascertainment Bias Site
 2370: Ascertainment Bias Site
 2387: Ascertainment Bias Site
 2404: Ascertainment Bias Site
 2415: Ascertainment Bias Site
 2437: Ascertainment Bias Site
 2449: Ascertainment Bias Site
 2456: Ascertainment Bias Site
 2468: Empty Site with  98x0,   0x? of 98 sites
 2485: Ascertainment Bias Site
 2489: Ascertainment Bias Site
 2495: Empty Site with  98x0,   0x? of 98 sites
 2502: Empty Site with  98x0,   0x? of 98 sites
 2505: Ascertainment Bias Site
 2519: Ascertainment Bias Site
 2527: Ascertainment Bias Site
 2541: Ascertainment Bias Site
 2553: Ascertainment Bias Site
 2557: Empty Site with  98x0,   0x? of 98 sites
 2577: Ascertainment Bias Site
 2593: Ascertainment Bias Site
 2605: Ascertainment Bias Site
 2615: Ascertainment Bias Site
 2624: Ascertainment Bias Site
 2625: Empty Site with  98x0,   0x? of 98 sites
 2654: Ascertainment Bias Site
 2677: Ascertainment Bias Site
 2690: Ascertainment Bias Site
 2696: Empty Site with  98x0,   0x? of 98 sites
 2705: Ascertainment Bias Site
 2718: Ascertainment Bias Site
 2751: Ascertainment Bias Site
 2757: Ascertainment Bias Site
 2767: Ascertainment Bias Site
 2777: Ascertainment Bias Site
 2791: Ascertainment Bias Site
 2809: Ascertainment Bias Site
 2818: Ascertainment Bias Site
 2831: Ascertainment Bias Site
 2838: Empty Site with  98x0,   0x? of 98 sites
 2844: Ascertainment Bias Site
 2849: Ascertainment Bias Site
 2870: Ascertainment Bias Site
 2880: Ascertainment Bias Site
 2893: Ascertainment Bias Site
 2911: Ascertainment Bias Site
 2919: Ascertainment Bias Site
 2941: Ascertainment Bias Site
 2957: Ascertainment Bias Site
 2966: Ascertainment Bias Site
 2976: Ascertainment Bias Site
 2999: Ascertainment Bias Site
 3010: Ascertainment Bias Site
 3028: Ascertainment Bias Site
 3041: Ascertainment Bias Site
 3045: Empty Site with  98x0,   0x? of 98 sites
 3058: Ascertainment Bias Site
 3077: Ascertainment Bias Site
 3092: Ascertainment Bias Site
 3115: Ascertainment Bias Site
 3123: Empty Site with  98x0,   0x? of 98 sites
 3130: Ascertainment Bias Site
 3137: Empty Site with  98x0,   0x? of 98 sites
 3143: Ascertainment Bias Site
 3169: Ascertainment Bias Site
 3176: Ascertainment Bias Site
 3186: Ascertainment Bias Site
 3189: Empty Site with  98x0,   0x? of 98 sites
 3206: Ascertainment Bias Site
 3226: Ascertainment Bias Site
 3238: Ascertainment Bias Site
 3257: Ascertainment Bias Site
 3274: Ascertainment Bias Site
 3294: Ascertainment Bias Site
 3303: Ascertainment Bias Site
 3319: Ascertainment Bias Site
 3343: Ascertainment Bias Site
 3358: Ascertainment Bias Site
 3359: Empty Site with  98x0,   0x? of 98 sites
 3375: Ascertainment Bias Site
 3381: Ascertainment Bias Site
 3388: Ascertainment Bias Site
 3408: Ascertainment Bias Site
 3427: Ascertainment Bias Site
 3436: Ascertainment Bias Site
Alignment has 31/3447 problematic sites
SimonGreenhill commented 3 years ago

This looks very much like what happens when someone deletes a language from the analysis, which is why we had to write a correction to the Bouckaert et al. Indo-European paper. i.e. removing a language removes any cognate sets that only belong to that language.

This was a problem for Indo-European because we deleted 13/116 languages leading to 283 empty sites... and this affected the root age moving it from 8,466 (7,116 -10,410) to 7,579 (5,972-9,351) years. Now, R et al. here have fewer sites in this category but it suggests that the inferred age could be overinflated by a bit.

If we add the deleted Korean back in to the XML, then we get some back:

Alignment has 26/3447 problematic sites

i.e. the following have become ok: 3189, 2838, 1103, 1052, 995, still leaving quite a number of all zero sites.

.. so there are is at least another language or 3 or 4 in the nexus or the xml that got deleted.

SimonGreenhill commented 3 years ago

If @chrzyki is running these analyses, then I can easily cull the empty sites and send him an XML to run to see if it makes a difference.

chrzyki commented 3 years ago

Sounds good! I've had created a number of different variations of the XML (together with Russell) and some have finished, some are still running (e.g. particles 4/10). Happy to run another version as well. :)

tpellard commented 3 years ago

@LinguList There are 3193 rows (= cognate sets) in 16_Eurasia3angle_synthesis_SI 1_BV 254.xls. In Edictor, the following cogids are missing: 109 709 1356 1394 1413 1518 1944 2077 2116 2117 2279 2304 2311 2361 2424 2492 2816 2890 2903 3111 I've checked the .xls file, the rows missing from Edictor are those that have only empty values in all language columns.

tpellard commented 3 years ago

There are also 7 empty columns in 16_Eurasia3angle_synthesis_SI 1_BV 254.xls between Hachijo and Eastern Evenki. @LinguList, have you noticed that the final 4 columns have a fused cell for their language name? There are 2 columns under the heading "Eastern Evenki" and 2 under "Southern Evenki (Vershina-Tutury, Baikal)".

LinguList commented 3 years ago

Ouch. That is terrible. Thanks for checking! It means we need to adjust our procedure of conversion, or at least check it.

And if all cells for a row are empty, I excluded them, of course, which shows yet another inconsistency of the original data. Since we only show existing words, we could by no means show non-existing ones. But we could include in our procedure to display automatically which rows are all empty.

tpellard commented 3 years ago

Some borrowings are not marked as "FORM_bor" but as "FORM bor". It seems that they are nevertheless encoded as "1" in the BEAST xml files (I manually checked one example, but I don't know how to do that easily for others).

LinguList commented 3 years ago

@tpellard, I have converted the data to CSV before reading the data in. The empty columns are accounted for in my code. The strangely merged columns are also accounted for: The second column is displayed as headerless column in the CSV, and I only read in columns with a language. The CSV is online here.

The code seems to do what it is expected to do, all we'd need to add would be a statement to warn if a whole row is completely empty, and maybe an indicator on whether they annotate something as a potential borrowing.

The following are the relevant lines in the iteration.

https://github.com/lexibank/robbeetsaltaic/blob/1096cdcc1ee4d63c320947a78c1de558891cf20e/lexibank_robbeetsaltaic.py#L80-L97

tpellard commented 3 years ago

I think that coding obvious borrowings between the languages under study as cognates is problematic since it introduces noise in the phylogenetic signal and discards important information. What do you think @SimonGreenhill and @RustyGray ? How are such cases usually treated in phylogenetic analyses?

RustyGray commented 3 years ago

Hi, well one way of treating known loans is to just exclude them. Another is to count them only after they get transmitted in a subgroup e.g. for deep loans the initial borrowing doesn’t count but subsequently they become cognates. We have a lengthy discussion of possible coding practices in the supplement of our IE ms that we hop to submit soon. Cheers, Russell.

Russell Gray Director, Max Planck Institute for Evolutionary Anthropology Head of the Department of Linguistic and Cultural Evolution TEL: +49-3641-68 68 01 FAX: +49-3641-68 68 68 Departmental Administrators: Jena @. Leipzig @. http://www.shh.mpg.de/2375/en http://language.psy.auckland.ac.nz/ https://scholar.google.com/citations?hl=en&user=sksPd1cAAAAJ

On 10. Sep 2021, at 15:24, Thomas Pellard @.***> wrote:

I think that coding obvious borrowings between the languages under study as cognates is problematic since it introduces noise in the phylogenetic signal and discards important information. What do you think @SimonGreenhill https://github.com/SimonGreenhill and @RustyGray https://github.com/RustyGray ? How are such cases usually treated in phylogenetic analyses?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lexibank/robbeetsaltaic/issues/8#issuecomment-916902314, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEETOPEUCZQEMJUHVS4DWITUBIBKHANCNFSM5DSR7IHQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

LinguList commented 3 years ago

Are those borrowings singletons, e.g., pertaining to one word form alone, "cognate sets" of size 1? If so, they excluded them by placing them into their own cognate set which does not really do anything for the subgrouping. If not, the coding is problematic, specifically when recurring in more than one family.

rgyalrong commented 3 years ago

Quite a number of these borrowings involve two or three families, for instance Mongolic borrowings into some Turkic and Tungusic languages.

Le ven. 10 sept. 2021 à 20:40, Johann-Mattis List @.***> a écrit :

Are those borrowings singletons, e.g., pertaining to one word form alone, "cognate sets" of size 1? If so, they excluded them by placing them into their own cognate set which does not really do anything for the subgrouping. If not, the coding is problematic, specifically when recurring in more than one family.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lexibank/robbeetsaltaic/issues/8#issuecomment-917127268, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPJOA4QKQWTKBZ7QN3OHCDUBJGKHANCNFSM5DSR7IHQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Guillaume Jacques

Directeur de recherches CNRS (CRLAO) - EPHE- INALCO https://scholar.google.fr/citations?user=1XCp2-oAAAAJ&hl=fr https://langsci-press.org/catalog/book/295 http://cnrs.academia.edu/GuillaumeJacques http://panchr.hypotheses.org/

tpellard commented 3 years ago

For instance in COGID #2412 'count (v.)' all Tungusic and Turkic forms are marked as borrowings, leaving only Mongolic forms in the cognate set.