Open jameshadfield opened 3 years ago
After our recent conversations internally and with @dpark01 about reducing the complexity of the ncov workflow and improving the portability of the existing workflow with other workflow languages and/or platforms, I'm bumping this here as a higher priority issue and moving it from the "backlog" to the "next up".
Here is my current hack--would love to replace all that with augur subsample
It would be nice if a command like this could include emit as output a numeric count of selected samples in each deme.
PR #762 begins an implementation of augur subsample
Update: we've had internal discussions considering this again with a different YAML schema and the addition of weighted sampling (#1318).
Tasks
@victorlin to fill this out
Links
augur subsample
proposalOriginal issue
A common use case is versatile sub-sampling of datasets to suit a particular research question. The current best example of this is the (wonderful) SARS-CoV-2 pipeline which leverages a augur filter rule, a script to calculate priorities and snakemake wizardry to allow versatile, declarative subsampling schemes to be simply and intuitively defined.
This allows a simple-to-reason-with YAML file to result in a very bespoke subsampling scheme:
The question arises: how do we do this for a different pathogen?
As the SARS-CoV-2 example leverages snakemake, one solution would be to abstract that logic into a importable snakemake rule. The alternative approach would be a new augur command
augur subsample
which takes a YAML file declaring the desired subsampling settings. Learning from our work on nCoV, this would essentially replace the snakemake-controlledaugur filter
commands with a singleaugur subsample
command. The yaml file would look similar / identical to the current snakemake implementation. The subcommand would leverage the functions used byaugur filter
as well as the priorities script from nCoV.Thoughts?
Examples
subsampling.yaml
: