Allow weighted subsampling

victorlin commented 1 year ago

Context

Currently, --subsample-max-sequences effectively calculates a value for --sequences-per-group which applies to all groups specified by --group-by.

This behavior does not work for all scenarios. Example: There are 5 countries with vastly different population sizes. 60 sequences are requested for subsampling. The command would be:

augur filter \
  --group-by country \
  --subsample-max-sequences 60

This means 12 sequences will be sampled from each country, regardless of population size. One may want the sample to be representative of country population sizes and not uniform. This limitation is what prompts higher-level subsampling workarounds such as https://github.com/nextstrain/ncov/pull/1074.

Tasks

[x] #1454
[x] Release in a new version of Augur: 25.3.0
[x] https://github.com/nextstrain/docs.nextstrain.org/pull/223

Rollout

[x] Use it in ncov workflow: https://github.com/nextstrain/ncov/issues/1141
[ ] Use it in other workflows?

Original proposed solution

Implement an option --subsample-weights, which reads a file that specifies weights per --group-by column. A simple example:

augur filter \
  --group-by country \
  --subsample-max-sequences 60 \
  --subsample-weights weights.yaml

weights.yaml:

# Weight countries by population size.
country:
    A: 1000
    B: 1000
    C: 300
    D: 100
    E: 600

With this information, a different amount of sequences can be calculated per group.

A would have 60*1000/3000 = 20 sequences.
C would have 60*300/3000 = 6 sequences.

The absence of a column from the weights file can imply equal weighting. In other words, the example can be updated to use --group-by country month while keeping weights.yaml as-is to have weighted country sampling for each time bin.

Or, a more complex example where time is also weighted:

augur filter \
  --group-by country month \
  --subsample-max-sequences 60 \
  --subsample-weights weights.yaml

weights.yaml:

# Weight countries by population size.
country:
    A: 1000
    B: 1000
    C: 300
    D: 100
    E: 600

# Get twice the amount of sequences from 2021 compared to 2020.
month:
    2020-01: 1
    2020-02: 1
    2020-03: 1
    # … all months in 2020 are weighted with 1
    2020-01: 2
    2020-02: 2
    2020-03: 2
    # … all months in 2021 are weighted with 2

Notes:

The file format is up for debate. At the least, it can be JSON or YAML, but not anything tabular (not enough dimensions to cover multiple group by columns).
This would necessitate all possible values of a column to have a weight so that the denominator can be calculated by summing up all the values.
Weights should be relative within each column.
(I think) as long as the weights are non-negative, the values can be multiplied across columns to get effective weighting for all combinations.

victorlin commented 6 months ago

There's been lots of internal discussions on this feature. Contrary to the proposal in the issue description, it does seem reasonable to encode multi-dimensional weights in a CSV/TSV format, though it's likely that this type of file must be generated via a script.

country     month       weight
A           2020-01     N
A           2020-02     N
A           2020-03     N
…
B           2020-01     N
B           2020-02     N
B           2020-03     N
…

Some more notes:

The weights file should be mutually exclusive with --group-by (determined by weights file columns) and --sequences-per-group (calculated dynamically using weights).
The list of weights should partition the data completely. In the example above, all combinations of country and month present in the data must be provided a weight. Raise user error if it doesn't.

In the initial implementation, all cells of the weights file must have a value. In the future, this can be extended to allow partitioning of the data at different resolutions. Here's an example with geographically even sampling on two different resolutions:

country     division    weight
A                       <1/n_countries>
B                       <1/n_countries>
C                       <1/n_countries>
…
USA         WA          <1/n_countries * 1/n_divisions>
USA         WA          <1/n_countries * 1/n_divisions>
USA         WA          <1/n_countries * 1/n_divisions>
…
USA         OR          <1/n_countries * 1/n_divisions>
USA         OR          <1/n_countries * 1/n_divisions>
…

trvrb commented 6 months ago

Thanks for spelling things out in such detail @victorlin. A couple thoughts:

I really like the behavior in the original YAML version of being able to specify independent weights for column 1 (eg country) vs column 2 (eg month). The situations where we have an interaction effect between weights seem quite limited (I can't think of an immediate example in existing subsampling routines).

I could easily write this YAML file for ncov, while for the fully specified TSV example, I'd need a script that generates a large number of combinations (that I don't actually care about).

Note that you could still encode interactions terms in a YAML file, eg:

# Weight countries by population size.
country month:
    A 2020-01: 10
    B 2020-01: 10
    C 2020-01: 3
    D 2020-01: 1
    E 2020-02: 6
    A 2020-02: 10
    B 2020-02: 10
    C 2020-02: 3
    D 2020-02: 1
    E 2020-02: 6

Again, I believe that independent columns will cover >90% of use cases and then won't force people to write intermediate scripts if they have multiple columns they care about.

2.

This would necessitate all possible values of a column to have a weight so that the denominator can be calculated by summing up all the values.

The list of weights should partition the data completely. In the example above, all combinations of country and month present in the data must be provided a weight. Raise user error if it doesn't.

To avoid enforced verbosity I could also imagine assuming a weight of 1 for any missing entries. But raise a warning saying that missing values have been assumed to be 1.

victorlin commented 6 months ago

I could easily write this YAML file for ncov

My speculative hesitation with YAML is that it'll be hard to translate from a source file e.g. case counts which are typically in TSV format (but I haven't actually tried). YAML would definitely be easier to manually define simple weighting logic such as "2x sequences from region A compared to B".

I'd need a script that generates a large number of combinations (that I don't actually care about).

Good point. The combinations need to be programmatically generated somewhere along the lines. If providing weights as YAML, the subsampling tool would internally generate weights per group analogous to the TSV.

I think it'd be manageable to first implement the underlying logic and allow configuration via both YAML and TSV to get a feel for what works better under different scenarios.
To avoid enforced verbosity I could also imagine assuming a weight of 1 for any missing entries.

My (again speculative) concern is that there may be few cases in which 1 is a useful default, especially if weights are based on case counts or population size.

This seems like a small behavioral detail in which we'll only know what to do once we have an implementation to test against real world usage. We could start with errors to notice if enforced verbosity is overkill.

victorlin commented 6 months ago

After working on https://github.com/nextstrain/ncov/commit/0fd6861b5b550306160e1ce92c65b0c65085e096 I've realized that in order to reduce the number of samples (i.e. calls to augur filter) in the workflow, augur filter will need the extended implementation that allows partitioning of the data at different resolutions. I don't see how the initial implementation will simplify the ncov workflow.

victorlin commented 6 months ago

Here's an idea: implement weighted subsampling as a part of augur subsample and configure it in the new YAML.

Using the currently proposed YAML as-is would look something like:

samples:
  north_america_6m:
    size: 4000

    weights:
      # Region weighting: 4:1 for North America to rest of world
      region:
        North America: 4
        # Africa: 1
        # Asia: 1
        # Europe: 1
        # …

      # Time weighting: 4:1 for recent sequences to early sequences (2M threshold)
      month:
        # 2020-01: 1
        # 2020-02: 1
        # …
        2024-02: 4
        2024-03: 4

Issues:

For region weighting, assuming a weight of 1 for missing entries, the weighting will change from 4:1 North America to rest of the world to 4:1 North America to every other region (i.e. 4:6 North America to rest of the world). Time weighting is similarly affected.
Time weighting is verbose and lacks ability to use relative dates.
For both region and time weighting, the column to group by for uniform sampling within each group is no longer encoded. In the current ncov workflow, this is encoded as different --group-by columns for individual samples.

Here's an alternative which addresses those issues:

samples:
  north_america_6m:
    size: 4000

    partitions:
      # Region weighting: 4:1 for North America to rest of world
      region:
      - query: region == 'North America'
        weight: 4
        uniform_sampling: division
      - query: region != 'North America'
        weight: 1
        uniform_sampling: country

      # Time weighting: 4:1 for recent sequences to early sequences (2M threshold)
      month:
      - query: date >= 2M
        weight: 4
        uniform_sampling: week
      - query: date < 2M
        weight: 1
        uniform_sampling: month

victorlin commented 6 months ago

After thinking more along the lines of implementing this in augur subsample, I've realized there's two types of weighted sampling:

Weighted sampling between intermediate samples (e.g. 4:1 between North America vs. rest of the world)
Weighted sampling within an intermediate sample (e.g. dynamic sequences per group based on geo-temporal case counts).

I think these can be implemented separately, where (1) can be YAML-based (2) can be TSV-based. I've added more detail and examples in the subsampling doc.

trvrb commented 5 months ago

Thanks for the thoughts @victorlin. I'll try to pull together a more cohesive thread for how I'd see this working for the ncov example. But broadly, I like the general idea of encoding weights independently between categories (country vs month for example) and assuming no interaction between categories. Ie if you have weight of 4 in North America and weight of 1 for global context and if you have weight of 4 for recent samples and weight of 1 for older samples, then I'd assume sampling weight of 4x4 = 16 for recent North America, 1x4 = 4 for recent global, 1x4 = 4 for older North America and 1x1 = 1 for older global.

victorlin commented 1 month ago

augur frequencies has a weights interface which was never discussed here:

https://github.com/nextstrain/augur/blob/d8faf01ec1008b4e7b98fa2285d2f9c931c505d7/augur/frequencies.py#L65-L66

It is different in that it only allows weighting on a single column (defined by --weights-attribute) and the file format is JSON instead of TSV.

I've considered the idea of swapping --group-by-weights with --weights + --weights-attribute for the sake of consistency across Augur. It's definitely possible, but I'm going against it for the following reasons ordered from most to least important:

The swap is compatible with the only confirmed use case which uses a single column. However, we've considered use cases of multiple columns which would require some careful changes to the --weights + --weights-attribute interface. Those are already supported by the TSV format and I would rather leave the support built-in than pending a redesign of the --weights + --weights-attribute interface.
The name --group-by-weights pairs nicely with --group-by. This is useful because all weighted columns must be passed to --group-by, i.e. --group-by-weights is an extension of --group-by. It wouldn't be so obvious with --weights-attribute.
The --group-by-weights in #1454 has already implemented various checks for the TSV format.

trvrb commented 1 month ago

I think that --group-by-weights is helpfully clear when paired with the familiar --group-by. I support your decision to keep the interface as currently implemented.

victorlin commented 1 month ago

augur filter --group-by-weights was released in Augur 25.3.0.

nextstrain / augur