Support language subsampling

lmaurits commented 7 years ago

It would be nice to be able to subsample a dataset (after the results of all other language filtering features BEASTling provides, like by family, by macroarea, etc.) to a specified number of langs (e.g. 100). This would let one test smaller versions of large analysis quickly, among other things. Suggested syntax is:

[languages]
subsample_size = 100

I'm thinking of rushing this in for 1.4.0, because I need it, done in such a way that you ask for 100 (or whatever) languages and BEASTling randomly subsamples that exact amount with every language having an equal probability of inclusion, because this is quick and easy to do with random.sample. The trick will be initialising the PRNG seed with, say, the whole list of languages, so that you get the same random subsample each time, so you can e.g. do comparisons between different substitution models.

But in future releases this could be extended a lot, e.g. careful subsampling so that different Glottolog (sub)families or macroareas are equally represented, to the best extent possible with the dataset. I imagine augmenting subsample_size with subsampling_strategy, e.g.:

[languages]
subsample_size = 100
subsampling_strategy = family_balanced

Possible subsampling strategies are family balancing, macroarea balancing, phylogenetic diversity maximising (e.g. sample as many Glottolog families as possible, and if sampling multiple languages from each family is required, sample them from as many different subfamilies as possible, etc.).

Defining theoretically solid subsampling strategies for phylogenetics is actually quite an important issue for the field, supporting and even defining best practices in an easy to use way in BEASTling would be excellent.

xrotwang commented 7 years ago

Sounds Good. We could add keywords for known samples, like WALS 100 and WALS 200.

Am 25.10.2017 12:40 schrieb "Luke Maurits" notifications@github.com:

It would be nice to be able to subsample a dataset (after the results of all other language filtering features BEASTling provides, like by family, by macroarea, etc.) to a specified number of langs (e.g. 100). This would let one test smaller versions of large analysis quickly, among other things. Suggested syntax is:

[languages] subsample_size = 100

I'm thinking of rushing this in for 1.4.0, because I need it, done in such a way that you ask for 100 (or whatever) languages and BEASTling randomly subsamples that exact amount with every language having an equal probability of inclusion, because this is quick and easy to do with random.sample. The trick will be initialising the PRNG seed with, say, the whole list of languages, so that you get the same random subsample each time, so you can e.g. do comparisons between different substitution models.

But in future releases this could be extended a lot, e.g. careful subsampling so that different Glottolog (sub)families or macroareas are equally represented, to the best extent possible with the dataset. I imagine augmenting subsample_size with subsampling_strategy, e.g.:

[languages] subsample_size = 100 subsampling_strategy = family_balanced

Possible subsampling strategies are family balancing, macroarea balancing, phylogenetic diversity maximising (e.g. sample as many Glottolog families as possible, and if sampling multiple languages from each family is required, sample them from as many different subfamilies as possible, etc.).

Defining theoretically solid subsampling strategies for phylogenetics is actually quite an important issue for the field, supporting and even defining best practices in an easy to use way in BEASTling would be excellent.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lmaurits/BEASTling/issues/170, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1HKDLmcuI5l6nuSj7HgWY-FNMAKuKuks5svx4hgaJpZM4QF3A_ .

Anaphory commented 7 years ago

We can read lists of languages from external files, can't we? – That should already do what @xrotwang suggests, maybe after removing some sanity checks which we have at the moment.

Wouldn't it be useful to make that sampling logic explicitly available as a tiny separate helper script that reads a list of languages and produces a subsampled list of languages?

lmaurits commented 7 years ago

You really like your external scripts, huh? :stuck_out_tongue:

lmaurits commented 7 years ago

I mean, I get it, the Unix philosophy advocate in me understands why this is helpful, it would be good if people could use the same functionality to get a list of languages for use with non-BEASTling software. But I'm also wary of making things harder than we already do for users who are not command line wizards.

We could, of course, write this in such a way that just this little bit of functionality can be used standalone by a Python programmer, e.g. from beastling import subsample or whatever. Then the helper script becomes trivial.

Anaphory commented 7 years ago

Hm, I think this can become quite deep magic, so people would want to have a look at the intermediate result anyway to see what's going on. That's another reason to make the separation of these steps explicit.

xrotwang commented 7 years ago

It seems difficult to figure out the perfect place for such a functionality. A helper script wouldn't be much different from using R - which I assume will support all sorts of sampling. So I'd be inclined to integrate this tightly into beastling because loose integration does not gain much.

Am 25.10.2017 12:48 schrieb "Gereon Kaiping" notifications@github.com:

We can read lists of languages from external files, can't we? – That should already do what @xrotwang https://github.com/xrotwang suggests, maybe after removing some sanity checks which we have at the moment.

Wouldn't it be useful to make that sampling logic explicitly available as a tiny separate helper script that reads a list of languages and produces a subsampled list of languages?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lmaurits/BEASTling/issues/170#issuecomment-339304064, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1HKOpmQH_3zb5rhpEIFJHbw7O6-UG9ks5svyARgaJpZM4QF3A_ .

Anaphory commented 7 years ago

[… Part of me calls for allowing command line scripts inside configuration files, so reading from files is reduced to languages = $(cat that_file.txt) and subsampling would be languages = $(python subsample.py languages.csv) to solve all these problems. NO! BAD IDEA! ]

Anaphory commented 7 years ago

@xrotwang Well, if there is a 3-line R script that did this – I think that's about the length of the minimal Python script for @lmaurits ' random subsampling example – that could easily go into that part of the tutorial/docs, for showing “this is how to combine this with the tools you might already know from somewhere else”.

But I think this functionality has the potential for becoming useful and complicated – various scorer functions, specifying random subsampling on levels below the family level, which would need to be configured, geographical sampling, etc. – so that I would give it a separate entry_point for a separate module, which will lile inside the beastling package or at least the beastling repository.

lmaurits commented 7 years ago

I'm just going to put it in Configuration for now to get it done, but I am totally happy to leave the option of breaking it out into a script on the table for 1.5, when we are doing drastic rearrangements anyway. We should think about this in a general way, rather than considering just this use case, i.e. do we want BEASTling to be a tightly integrated bundle, a loose connection of scripts for power users with a nice UI tying them all together seemlessly for many users, etc. Do we want to distribute helper scripts (e.g. I have some code I've been meaning to tidy up that takes a BEASTling-generated GeoJSON file and outputs a map using BaseMaps), if not within BEASTling then in a BEASTling wiki or something?

lmaurits commented 7 years ago

Perkele! This is not as straightforward as one might imagine, due to somewhat awkward information passing between the Configuration object and the Model objects. Configuration can't do the subsampling until after taking the union or interesection of all the individual Models' language sets, i.e. after the Models have already been asked by Configuration to throw out languages based on family, macroarea ,etc. However, the models rely on their opinion as to which languages are still in the analysis to generate XML, so after Configuration does subsampling it has to tell all the Models to, once again, throw some languages out. This is not remotely difficult, just profusely ugly, with this multi-stage back-and-forth message passing. Will come back to this tomorrow.

Anaphory commented 7 years ago

Is there a lesson learnt from this about how to structure things better if we ever find the time for that?

lmaurits commented 7 years ago

Yes, I think so. This morning I sat down with pen and paper and traced out all this backing and forthing. In the end it was pretty clear how to make subsampling work with a quick but ugly hack, and I've done that. However, it was also pretty clear that the way things are done is sub-optimal and, in fact, there are surely other bugs lurking here*. This should definitely be cleaned up for 1.5 if not sooner. I'll create an issue for this.

e.g. if you have multiple datasets with non-matching language sets and you do an "intersection" analysis as opposed to a "union", the data for those languages not present in all datasets still winds up in the XML, although I don't think it influences the likelihood, so it's not a terrible bug. Actual terrible bugs that I haven't though of yet may, of course, exist.

lmaurits commented 7 years ago

Okay, the basic implementation of this is now documented and tested. I've opened issue #173 to make sure we fix up this ugly bit of the code for 1.5, so I'm closing this issue now.

lmaurits / BEASTling

Support language subsampling #170