nextstrain / augur

Pipeline components for real-time phylodynamic analysis
https://docs.nextstrain.org/projects/augur/
GNU Affero General Public License v3.0
268 stars 129 forks source link

augur subsample command #635

Open jameshadfield opened 3 years ago

jameshadfield commented 3 years ago

Tasks

@victorlin to fill this out

Links


Original issue

A common use case is versatile sub-sampling of datasets to suit a particular research question. The current best example of this is the (wonderful) SARS-CoV-2 pipeline which leverages a augur filter rule, a script to calculate priorities and snakemake wizardry to allow versatile, declarative subsampling schemes to be simply and intuitively defined.

This allows a simple-to-reason-with YAML file to result in a very bespoke subsampling scheme: image

The question arises: how do we do this for a different pathogen?

As the SARS-CoV-2 example leverages snakemake, one solution would be to abstract that logic into a importable snakemake rule. The alternative approach would be a new augur command augur subsample which takes a YAML file declaring the desired subsampling settings. Learning from our work on nCoV, this would essentially replace the snakemake-controlled augur filter commands with a single augur subsample command. The yaml file would look similar / identical to the current snakemake implementation. The subcommand would leverage the functions used by augur filter as well as the priorities script from nCoV.

Thoughts?

Examples

subsampling.yaml:

schemes:
  switzerland:
    # Focal samples for country
    country:
      group_by: "division year month"
      max_sequences: 1500
      exclude: "--exclude-where 'country!={country}'"
    # Contextual samples from country's region
    region:
      group_by: "country year month"
      seq_per_group: 20
      exclude: "--exclude-where 'country={country}' 'region!={region}'"
      priorities:
        type: "proximity"
        focus: "country"
    # Contextual samples from the rest of the world,
    # excluding the current region to avoid resampling.
    global:
      group_by: "country year month"
      seq_per_group: 10
      exclude: "--exclude-where 'region={region}'"
      priorities:
        type: "proximity"
        focus: "country"
augur subsample --include <TXT> --sequences <FASTA> \
    --metadata <TSV> --schemes <YAML> --output <FASTA>
huddlej commented 3 years ago

After our recent conversations internally and with @dpark01 about reducing the complexity of the ncov workflow and improving the portability of the existing workflow with other workflow languages and/or platforms, I'm bumping this here as a higher priority issue and moving it from the "backlog" to the "next up".

dpark01 commented 3 years ago

Here is my current hack--would love to replace all that with augur subsample

It would be nice if a command like this could include emit as output a numeric count of selected samples in each deme.

jameshadfield commented 3 years ago

PR #762 begins an implementation of augur subsample

victorlin commented 5 months ago

Update: we've had internal discussions considering this again with a different YAML schema and the addition of weighted sampling (#1318).