Closed xrotwang closed 2 years ago
Btw.: There's a slight difference between the lexibank data and the Nexus data file, see More, Tapakura and Kitemoka:
$ csvstat --freq-count 10 -c Language_ID lexibank-analysed/raw/birchallchapacuran/cldf/forms.csv
3. "Language_ID"
Type of data: Text
Contains null values: False
Unique values: 10
Longest value: 8 characters
Most common values: wari (141x)
more (135x)
orowin (135x)
wanyam (117x)
cojubim (103x)
tora (98x)
tapakura (84x)
jaru (84x)
kitemoka (79x)
urupa (61x)
Row count: 1037
vs.
[Taxon Diagnostics: ]
[Cojubim = 103 ]
[Jaru = 84 ]
[Kitemoka = 80 ]
[More = 134 ]
[OroWin = 135 ]
[Tapakura = 85 ]
[Tora = 98 ]
[Urupa = 61 ]
[Wanyam = 117 ]
[Wari = 141 ]
Hmm, nexus is one of the 2 or 3 main phylogenetic file formats. with data.nex and summary/posterior trees, I can load the data or trees directly into any other analysis program that's out there (splitstree would be a common option, but there are many others). Having to regenerate the nexus file would be an extra annoying hurdle.
Second, it's not the case that the datasets we have contain enough information to (re)generate the nexus file. There are specific decisions about e.g. how to code cognates that will not be evident in the cldf/lexibank datasets (even if we have that information -- this repos is an outlier that has pretty much everything, others have lots of gaps in the chain). Your question here about the taxon count differences is, I think, the result of this coding mismatch -- perhaps unique cognates not included in the nexus (and hence phylogenies)
Ok, I see. Ideally, I would have hoped that at some point there'd be a package implementing various "binarization" strategies for CLDF data - "specific decisions about how to code cognates" - filling the gap between lexibank data and data.nex. And considering that the actual forms are still missing in data.nex, adding a CLDF Wordlist created from data.nex doesn't seem to make much sense.
So, I guess I'm ok with adding data.nex to the "official" data of the phlorest phylogeny - i.e. add it to cldf/
- but would want to do some validation, e.g. at least read the file with NexusReader
and make sure languages are specified with LanguageTable.ID everywhere.
So what I'd do is
NexusReader
data.nex
with NexusReader.write_to_file
This would make sure
data.nex
is somewhat standardized (as "written with python-nexus")Looking at the diff, this seems like a reasonable approach:
$ diff -i -w test.nex raw/Chapacuran_Swadesh207-2019-labelled.nex
0a1
>
2a4,14
> [Taxon Diagnostics: ]
> [Cojubim = 103 ]
> [Jaru = 84 ]
> [Kitemoka = 80 ]
> [More = 134 ]
> [OroWin = 135 ]
> [Tapakura = 85 ]
> [Tora = 98 ]
> [Urupa = 61 ]
> [Wanyam = 117 ]
> [Wari = 141 ]
5c17
< format datatype=STANDARD gap=- missing=? symbols="01";
---
> FORMAT DATATYPE=STANDARD MISSING=? GAP=- SYMBOLS="01";
305a318,319
>
>
ok, sounds good. that way the format will be standardised. We will however lose the comments \[.*\]
which python-nexus doesn't handle (because I couldn't figure out a good way to do that), and we might come across nexus blocks that python-nexus doesn't handle (e.g. distances
blocks are not uncommon but I can't spot any in our collection)
Ok, I see. Ideally, I would have hoped that at some point there'd be a package implementing various "binarization" strategies for CLDF data - "specific decisions about how to code cognates" - filling the gap between lexibank data and data.nex.
yep -- I've talked a bit about this with @LinguList, and it's on my long list of things to do :)
If we lose stuff that python-nexus
can't handle, that seems somewhat intended - keeping with the CLDF philosophy that everything in CLDF must have specified analytic uses (i.e. at least well-known ways to handle it). In such cases we'd either extend python-nexus
, or are ok with losing stuff.
And while the cldfbench setup isn't as compact/portable as the Makefile, I think it
class Dataset(phlorest.Dataset):
dir = pathlib.Path(__file__).parent
id = "birchall_et_al2016"
def cmd_makecldf(self, args):
self.add_schema(args)
shutil.copy(self.raw_dir / 'source.bib', self.cldf_dir / 'sources.bib')
lids = self.add_taxa(args)
with phlorest.NexusFile(self.cldf_dir / 'summary.nex') as nex:
f = nexus.NexusReader(self.raw_dir / 'relaxed-binary-simple.time.mcct.trees')
f.trees.detranslate()
assert len(f.trees.trees) == 1
self.add_tree(args, f.trees.trees[0], nex, 'summary', 'summary', 'Birchall_et_al2016')
with gzip.open(self.raw_dir / 'relaxed-binary-simple.time.trees.gz') as f:
posterior = nexus.NexusReader.from_string(
self.sample(self.remove_burnin(f.read().decode('utf8'), 10000), detranslate=True))
with phlorest.NexusFile(self.cldf_dir / 'posterior.nex') as nex:
for i, tree in enumerate(posterior.trees.trees, start=1):
self.add_tree(
args, tree, nex, 'posterior-{}'.format(i), 'sample', 'Birchall_et_al2016')
self.add_data(args, self.raw_dir / 'Chapacuran_Swadesh207-2019-labelled.nex')
sounds good (see my comments on the other issue as I think a self.add_tree
method would be useful.
Ok, we already have add_tree
, but I'd see two options:
add_tree
accept a NexusReader
as input for tree
oradd_tree_from_nexus
.I'd lean towards the latter, because we'd have to have a detranslate
flag, too - which then would only apply to the NexusReader
input - which makes the API complex.
Oh, and where should the code live? I think a separate pyphlorest
package would be more appropriate than making it a pydplace
sub-package.
did I reply to this? if not yes, let's keep pyphlorest separate for now
I'm struggling with understanding the usefulness of the Nexus data file.
As a record of the exact data that was used in the analysis, it should be part of the supplemental materials of the paper - I guess. If it isn't there, we could keep it here in
raw/
.But to re-run the analysis, one could also just re-create a Nexus data file from the lexibank dataset - correct?