phlorest / birchall_et_al2016

Phlorest phylogeny derived from Birchall et al. 2016 'A combined comparative and phylogenetic analysis of the Chapacuran language family'
https://doi.org/10.1086/687383
Creative Commons Attribution 4.0 International
0 stars 0 forks source link

Nexus data file #1

Closed xrotwang closed 2 years ago

xrotwang commented 3 years ago

I'm struggling with understanding the usefulness of the Nexus data file.

As a record of the exact data that was used in the analysis, it should be part of the supplemental materials of the paper - I guess. If it isn't there, we could keep it here in raw/.

But to re-run the analysis, one could also just re-create a Nexus data file from the lexibank dataset - correct?

xrotwang commented 3 years ago

Btw.: There's a slight difference between the lexibank data and the Nexus data file, see More, Tapakura and Kitemoka:

$ csvstat --freq-count 10 -c Language_ID lexibank-analysed/raw/birchallchapacuran/cldf/forms.csv 
  3. "Language_ID"

    Type of data:          Text
    Contains null values:  False
    Unique values:         10
    Longest value:         8 characters
    Most common values:    wari (141x)
                           more (135x)
                           orowin (135x)
                           wanyam (117x)
                           cojubim (103x)
                           tora (98x)
                           tapakura (84x)
                           jaru (84x)
                           kitemoka (79x)
                           urupa (61x)

Row count: 1037

vs.

[Taxon Diagnostics:                                                    ]
[Cojubim = 103                                                         ]
[Jaru = 84                                                             ]
[Kitemoka = 80                                                         ]
[More = 134                                                            ]
[OroWin = 135                                                          ]
[Tapakura = 85                                                         ]
[Tora = 98                                                             ]
[Urupa = 61                                                            ]
[Wanyam = 117                                                          ]
[Wari = 141                                                            ]
SimonGreenhill commented 3 years ago

Hmm, nexus is one of the 2 or 3 main phylogenetic file formats. with data.nex and summary/posterior trees, I can load the data or trees directly into any other analysis program that's out there (splitstree would be a common option, but there are many others). Having to regenerate the nexus file would be an extra annoying hurdle.

Second, it's not the case that the datasets we have contain enough information to (re)generate the nexus file. There are specific decisions about e.g. how to code cognates that will not be evident in the cldf/lexibank datasets (even if we have that information -- this repos is an outlier that has pretty much everything, others have lots of gaps in the chain). Your question here about the taxon count differences is, I think, the result of this coding mismatch -- perhaps unique cognates not included in the nexus (and hence phylogenies)

xrotwang commented 3 years ago

Ok, I see. Ideally, I would have hoped that at some point there'd be a package implementing various "binarization" strategies for CLDF data - "specific decisions about how to code cognates" - filling the gap between lexibank data and data.nex. And considering that the actual forms are still missing in data.nex, adding a CLDF Wordlist created from data.nex doesn't seem to make much sense.

So, I guess I'm ok with adding data.nex to the "official" data of the phlorest phylogeny - i.e. add it to cldf/ - but would want to do some validation, e.g. at least read the file with NexusReader and make sure languages are specified with LanguageTable.ID everywhere.

xrotwang commented 3 years ago

So what I'd do is

This would make sure

Looking at the diff, this seems like a reasonable approach:

$ diff -i -w test.nex raw/Chapacuran_Swadesh207-2019-labelled.nex 
0a1
> 
2a4,14
> [Taxon Diagnostics:                                                    ]
> [Cojubim = 103                                                         ]
> [Jaru = 84                                                             ]
> [Kitemoka = 80                                                         ]
> [More = 134                                                            ]
> [OroWin = 135                                                          ]
> [Tapakura = 85                                                         ]
> [Tora = 98                                                             ]
> [Urupa = 61                                                            ]
> [Wanyam = 117                                                          ]
> [Wari = 141                                                            ]
5c17
<   format datatype=STANDARD gap=- missing=? symbols="01";
---
>     FORMAT DATATYPE=STANDARD MISSING=? GAP=-  SYMBOLS="01";
305a318,319
> 
> 
SimonGreenhill commented 3 years ago

ok, sounds good. that way the format will be standardised. We will however lose the comments \[.*\] which python-nexus doesn't handle (because I couldn't figure out a good way to do that), and we might come across nexus blocks that python-nexus doesn't handle (e.g. distances blocks are not uncommon but I can't spot any in our collection)

SimonGreenhill commented 3 years ago

Ok, I see. Ideally, I would have hoped that at some point there'd be a package implementing various "binarization" strategies for CLDF data - "specific decisions about how to code cognates" - filling the gap between lexibank data and data.nex.

yep -- I've talked a bit about this with @LinguList, and it's on my long list of things to do :)

xrotwang commented 3 years ago

If we lose stuff that python-nexus can't handle, that seems somewhat intended - keeping with the CLDF philosophy that everything in CLDF must have specified analytic uses (i.e. at least well-known ways to handle it). In such cases we'd either extend python-nexus, or are ok with losing stuff.

xrotwang commented 3 years ago

And while the cldfbench setup isn't as compact/portable as the Makefile, I think it

class Dataset(phlorest.Dataset):
    dir = pathlib.Path(__file__).parent
    id = "birchall_et_al2016"

    def cmd_makecldf(self, args):
        self.add_schema(args)
        shutil.copy(self.raw_dir / 'source.bib', self.cldf_dir / 'sources.bib')
        lids = self.add_taxa(args)

        with phlorest.NexusFile(self.cldf_dir / 'summary.nex') as nex:
            f = nexus.NexusReader(self.raw_dir / 'relaxed-binary-simple.time.mcct.trees')
            f.trees.detranslate()
            assert len(f.trees.trees) == 1
            self.add_tree(args, f.trees.trees[0], nex, 'summary', 'summary', 'Birchall_et_al2016')

        with gzip.open(self.raw_dir / 'relaxed-binary-simple.time.trees.gz') as f:
            posterior = nexus.NexusReader.from_string(
                self.sample(self.remove_burnin(f.read().decode('utf8'), 10000), detranslate=True))

        with phlorest.NexusFile(self.cldf_dir / 'posterior.nex') as nex:
            for i, tree in enumerate(posterior.trees.trees, start=1):
                self.add_tree(
                    args, tree, nex, 'posterior-{}'.format(i), 'sample', 'Birchall_et_al2016')

        self.add_data(args, self.raw_dir / 'Chapacuran_Swadesh207-2019-labelled.nex')
SimonGreenhill commented 3 years ago

sounds good (see my comments on the other issue as I think a self.add_tree method would be useful.

xrotwang commented 3 years ago

Ok, we already have add_tree, but I'd see two options:

I'd lean towards the latter, because we'd have to have a detranslate flag, too - which then would only apply to the NexusReader input - which makes the API complex.

xrotwang commented 3 years ago

Oh, and where should the code live? I think a separate pyphlorest package would be more appropriate than making it a pydplace sub-package.

SimonGreenhill commented 2 years ago

did I reply to this? if not yes, let's keep pyphlorest separate for now