lmaurits / BEASTling

A linguistics-focussed command line tool for generating BEAST XML files.
BSD 2-Clause "Simplified" License
20 stars 6 forks source link

How to specify non-standard taxon sets? #168

Closed Anaphory closed 6 years ago

Anaphory commented 6 years ago

In principle, we can build an XML that leads to the ancestral reconstruction of any MRCA of a set of taxa, or its parent if we want that. Specifications for taxon sets might also be useful for monophyly specifications (#151/#146), geographical models (#150/#153), calibration dates.

How do we want the specification to look like on the beastling.conf side? #153 lists some ideas.

[clades]
foo_clade = glot0001, glot0002, glot0003
bar_clade = glot0004, glot0005, glot0006, glot0007, glot0008
[languages]
clade_definitions = /path/to/file
Anaphory commented 6 years ago

What about taxon sets that are glottolog clades? Can they be used in any case? What if they are also tip names? What if tip names are not in Glottolog? etc.

Anaphory commented 6 years ago

Somehow I feel uneasy about unstructured property names in a new section, but in principle I like that syntax. But we do have multi-line properties, so what about the following?

[languages]
taxonsets = foo_clade = glot0001, glot0002, glot0003
    bar_clade = glot0004, glot0005, glot0006, glot0007, glot0008
    glottolog
    /path/to/file

Formally: taxonsets (they are not necessarily clades, just names for collections of tips, which can then be used as clade specifications, but also for other things) is a multi-line value (every line is indeted deeper than the taxonsets = … line). Every line is either

Every single taxon is a taxonset of size 1. (This means that if we introduce a way to specify parent-reconstruction, this serves as the “the parent of this tip” without additional logic, it means the function parsing lines into taxonsets does not need special handling for nested taxonsets, and it means we show BEAST – which specifies that every TaxonSet is a Taxon what the logic actually is.)

Anaphory commented 6 years ago

Taxonsets is a very biological term. That can change, but I don't want it to imply these are already monophyletic.

lmaurits commented 6 years ago

Does the Python standard library's OptionParser permit you to have equals signs in the value of a parameter?

When you say "we do have multi-line properties", do you mean they are supported by the parser or BEASTling is already making use of them?

Agreed, "taxonsets" is very biological and actually that is something I have tried to avoid and which the BEASTling paper mentioned, so if we can avoid exposing that term to BEASTling users it would be good. Just like we have tried to avoid terms like alignment, sequence, mutation, etc.

Anaphory commented 6 years ago

I'm trying an example out right now.

I meant “The config parser supports it”, not that we have things like that already.

Maybe “groups” or something like that?

xrotwang commented 6 years ago

Glottolog uses multi-line properties (as a way to have lists as options) often, so they are supported by clldutils.inifile.INI: https://github.com/clld/clldutils/blob/68322265af7cc7161141cdea7c3b763a1f1a25db/clldutils/inifile.py#L45-L46

lmaurits commented 6 years ago

Am I the only one who has an instinctive "eww, gross" reaction to the idea of multi-line properties?

xrotwang commented 6 years ago

@lmaurits I guess so :) They are pretty common, I see them in pyramid config files all the time. One shouldn't break the expectation, though, that you turn a multiline property into a list running v.split() - i.e. items of a list may be separated by newlines, but may also be whitespace separated in the same line.

lmaurits commented 6 years ago

What about taxon sets that are glottolog clades? Can they be used in any case? What if they are also tip names? What if tip names are not in Glottolog? etc.

Taxonsets corresponding to the leaves of Glottolog clades should just work anywhere right now, i.e. for calibrations, geographic sampling or prior specification, etc. If the tip names are not Glottolog nothing will work, so this functionality is basically only of value for folks using non-Glottolog tip names. Which I hope is only people whose taxa aren't in Glottolog, e.g. dialectologists.

Anaphory commented 6 years ago

@lmaurits I don't like them, but I consider them the lesser evil compared to introducing a new section with “anything goes” property names and bigger magic and other sections necessary to include external sources (glottolog, files).

@xrotwang Beastling generally assumes that lists are comma-separated. What we are thinking about here is a list-of-(named-or-external-)lists, which is a data type we haven't used yet.

@lmaurits Or – my use case here – people who want to investigate the truth of a particular glottolog clade. Which is why I need language groups, not clades – I'm investigating whether a thing is paraphyletic or not.

lmaurits commented 6 years ago

Somehow I feel uneasy about unstructured property names in a new section

Do you know which part makes you uneasy? We already have unstructured property names in [calibrations] and [geo_priors]. Is it the introduction of a new section which seems to contain only content which logically belongs in [languages], purely for the sake of nicer syntax?

Anaphory commented 6 years ago

Quite likely! I guess I can get over it, considering it's only that.

Anaphory commented 6 years ago

We could think of doing things like the following:

[languages]
language_group_sources = glottolog-macroareas, glottolog-clades, path/language_metadata/universal_groups.ini, path/Wordlist-metadata.json:LanguageTable:Region
[language_groups]
on-papua = kbt, abg
macronesian = on-papua, mala1545

The thing I mean with glottolog-clades could be the default for language_group_sources, and then all of MRCA-ASR, geo_priors, calibrations, monophyletic restrictions, etc. could use the same back-end for assembling the taxon sets they operate with, without changing the fact that the generic case is “use glottolog clades”?

(That unification would be a thing for 1.5/2.0 depending on whether it can be made backwards-compatible or not.)

Anaphory commented 6 years ago

The pull request #171 needs a merge conflict resolved, did anyone have other issues with it?

Anaphory commented 6 years ago

(That unification would be a thing for 1.5/2.0 depending on whether it can be made backwards-compatible or not.)

That is the main reason this is not closed, btw., but it could be made a separate issue if we want.

Anaphory commented 6 years ago

I did that now as #181.