Closed Anaphory closed 6 years ago
What about taxon sets that are glottolog clades? Can they be used in any case? What if they are also tip names? What if tip names are not in Glottolog? etc.
Somehow I feel uneasy about unstructured property names in a new section, but in principle I like that syntax. But we do have multi-line properties, so what about the following?
[languages]
taxonsets = foo_clade = glot0001, glot0002, glot0003
bar_clade = glot0004, glot0005, glot0006, glot0007, glot0008
glottolog
/path/to/file
Formally:
taxonsets
(they are not necessarily clades, just names for collections of tips, which can then be used as clade specifications, but also for other things) is a multi-line value (every line is indeted deeper than the taxonsets = …
line). Every line is either
glottolog
, for importing all higher clade names from glottolog (semantics to be specified)<name> = <name>(, <name>)*
Every single taxon is a taxonset of size 1. (This means that if we introduce a way to specify parent-reconstruction, this serves as the “the parent of this tip” without additional logic, it means the function parsing lines into taxonsets does not need special handling for nested taxonsets, and it means we show BEAST – which specifies that every TaxonSet is a Taxon what the logic actually is.)
Taxonsets is a very biological term. That can change, but I don't want it to imply these are already monophyletic.
Does the Python standard library's OptionParser permit you to have equals signs in the value of a parameter?
When you say "we do have multi-line properties", do you mean they are supported by the parser or BEASTling is already making use of them?
Agreed, "taxonsets" is very biological and actually that is something I have tried to avoid and which the BEASTling paper mentioned, so if we can avoid exposing that term to BEASTling users it would be good. Just like we have tried to avoid terms like alignment, sequence, mutation, etc.
I'm trying an example out right now.
I meant “The config parser supports it”, not that we have things like that already.
Maybe “groups” or something like that?
Glottolog uses multi-line properties (as a way to have lists as options) often, so they are supported by clldutils.inifile.INI
:
https://github.com/clld/clldutils/blob/68322265af7cc7161141cdea7c3b763a1f1a25db/clldutils/inifile.py#L45-L46
Am I the only one who has an instinctive "eww, gross" reaction to the idea of multi-line properties?
@lmaurits I guess so :) They are pretty common, I see them in pyramid config files all the time. One shouldn't break the expectation, though, that you turn a multiline property into a list running v.split()
- i.e. items of a list may be separated by newlines, but may also be whitespace separated in the same line.
What about taxon sets that are glottolog clades? Can they be used in any case? What if they are also tip names? What if tip names are not in Glottolog? etc.
Taxonsets corresponding to the leaves of Glottolog clades should just work anywhere right now, i.e. for calibrations, geographic sampling or prior specification, etc. If the tip names are not Glottolog nothing will work, so this functionality is basically only of value for folks using non-Glottolog tip names. Which I hope is only people whose taxa aren't in Glottolog, e.g. dialectologists.
@lmaurits I don't like them, but I consider them the lesser evil compared to introducing a new section with “anything goes” property names and bigger magic and other sections necessary to include external sources (glottolog, files).
@xrotwang Beastling generally assumes that lists are comma-separated. What we are thinking about here is a list-of-(named-or-external-)lists, which is a data type we haven't used yet.
@lmaurits Or – my use case here – people who want to investigate the truth of a particular glottolog clade. Which is why I need language groups, not clades – I'm investigating whether a thing is paraphyletic or not.
Somehow I feel uneasy about unstructured property names in a new section
Do you know which part makes you uneasy? We already have unstructured property names in [calibrations]
and [geo_priors]
. Is it the introduction of a new section which seems to contain only content which logically belongs in [languages]
, purely for the sake of nicer syntax?
Quite likely! I guess I can get over it, considering it's only that.
We could think of doing things like the following:
[languages]
language_group_sources = glottolog-macroareas, glottolog-clades, path/language_metadata/universal_groups.ini, path/Wordlist-metadata.json:LanguageTable:Region
[language_groups]
on-papua = kbt, abg
macronesian = on-papua, mala1545
The thing I mean with glottolog-clades
could be the default for language_group_sources
, and then all of MRCA-ASR, geo_priors, calibrations, monophyletic restrictions, etc. could use the same back-end for assembling the taxon sets they operate with, without changing the fact that the generic case is “use glottolog clades”?
(That unification would be a thing for 1.5/2.0 depending on whether it can be made backwards-compatible or not.)
The pull request #171 needs a merge conflict resolved, did anyone have other issues with it?
(That unification would be a thing for 1.5/2.0 depending on whether it can be made backwards-compatible or not.)
That is the main reason this is not closed, btw., but it could be made a separate issue if we want.
I did that now as #181.
In principle, we can build an XML that leads to the ancestral reconstruction of any MRCA of a set of taxa, or its parent if we want that. Specifications for taxon sets might also be useful for monophyly specifications (#151/#146), geographical models (#150/#153), calibration dates.
How do we want the specification to look like on the beastling.conf side? #153 lists some ideas.