Closed victorlin closed 1 month ago
There's been lots of internal discussions on this feature. Contrary to the proposal in the issue description, it does seem reasonable to encode multi-dimensional weights in a CSV/TSV format, though it's likely that this type of file must be generated via a script.
country month weight
A 2020-01 N
A 2020-02 N
A 2020-03 N
…
B 2020-01 N
B 2020-02 N
B 2020-03 N
…
Some more notes:
--group-by
(determined by weights file columns) and --sequences-per-group
(calculated dynamically using weights).country
and month
present in the data must be provided a weight. Raise user error if it doesn't.In the initial implementation, all cells of the weights file must have a value. In the future, this can be extended to allow partitioning of the data at different resolutions. Here's an example with geographically even sampling on two different resolutions:
country division weight
A <1/n_countries>
B <1/n_countries>
C <1/n_countries>
…
USA WA <1/n_countries * 1/n_divisions>
USA WA <1/n_countries * 1/n_divisions>
USA WA <1/n_countries * 1/n_divisions>
…
USA OR <1/n_countries * 1/n_divisions>
USA OR <1/n_countries * 1/n_divisions>
…
Thanks for spelling things out in such detail @victorlin. A couple thoughts:
country
) vs column 2 (eg month
). The situations where we have an interaction effect between weights seem quite limited (I can't think of an immediate example in existing subsampling routines).I could easily write this YAML file for ncov, while for the fully specified TSV example, I'd need a script that generates a large number of combinations (that I don't actually care about).
Note that you could still encode interactions terms in a YAML file, eg:
# Weight countries by population size.
country month:
A 2020-01: 10
B 2020-01: 10
C 2020-01: 3
D 2020-01: 1
E 2020-02: 6
A 2020-02: 10
B 2020-02: 10
C 2020-02: 3
D 2020-02: 1
E 2020-02: 6
Again, I believe that independent columns will cover >90% of use cases and then won't force people to write intermediate scripts if they have multiple columns they care about.
2.
This would necessitate all possible values of a column to have a weight so that the denominator can be calculated by summing up all the values.
The list of weights should partition the data completely. In the example above, all combinations of country and month present in the data must be provided a weight. Raise user error if it doesn't.
To avoid enforced verbosity I could also imagine assuming a weight of 1
for any missing entries. But raise a warning saying that missing values have been assumed to be 1
.
I could easily write this YAML file for ncov
My speculative hesitation with YAML is that it'll be hard to translate from a source file e.g. case counts which are typically in TSV format (but I haven't actually tried). YAML would definitely be easier to manually define simple weighting logic such as "2x sequences from region A compared to B".
I'd need a script that generates a large number of combinations (that I don't actually care about).
Good point. The combinations need to be programmatically generated somewhere along the lines. If providing weights as YAML, the subsampling tool would internally generate weights per group analogous to the TSV.
I think it'd be manageable to first implement the underlying logic and allow configuration via both YAML and TSV to get a feel for what works better under different scenarios.
To avoid enforced verbosity I could also imagine assuming a weight of
1
for any missing entries.
My (again speculative) concern is that there may be few cases in which 1
is a useful default, especially if weights are based on case counts or population size.
This seems like a small behavioral detail in which we'll only know what to do once we have an implementation to test against real world usage. We could start with errors to notice if enforced verbosity is overkill.
After working on https://github.com/nextstrain/ncov/commit/0fd6861b5b550306160e1ce92c65b0c65085e096 I've realized that in order to reduce the number of samples (i.e. calls to augur filter
) in the workflow, augur filter
will need the extended implementation that allows partitioning of the data at different resolutions. I don't see how the initial implementation will simplify the ncov workflow.
Here's an idea: implement weighted subsampling as a part of augur subsample
and configure it in the new YAML.
Using the currently proposed YAML as-is would look something like:
samples:
north_america_6m:
size: 4000
weights:
# Region weighting: 4:1 for North America to rest of world
region:
North America: 4
# Africa: 1
# Asia: 1
# Europe: 1
# …
# Time weighting: 4:1 for recent sequences to early sequences (2M threshold)
month:
# 2020-01: 1
# 2020-02: 1
# …
2024-02: 4
2024-03: 4
Issues:
1
for missing entries, the weighting will change from 4:1 North America to rest of the world to 4:1 North America to every other region (i.e. 4:6 North America to rest of the world). Time weighting is similarly affected.--group-by
columns for individual samples.Here's an alternative which addresses those issues:
samples:
north_america_6m:
size: 4000
partitions:
# Region weighting: 4:1 for North America to rest of world
region:
- query: region == 'North America'
weight: 4
uniform_sampling: division
- query: region != 'North America'
weight: 1
uniform_sampling: country
# Time weighting: 4:1 for recent sequences to early sequences (2M threshold)
month:
- query: date >= 2M
weight: 4
uniform_sampling: week
- query: date < 2M
weight: 1
uniform_sampling: month
After thinking more along the lines of implementing this in augur subsample
, I've realized there's two types of weighted sampling:
I think these can be implemented separately, where (1) can be YAML-based (2) can be TSV-based. I've added more detail and examples in the subsampling doc.
Thanks for the thoughts @victorlin. I'll try to pull together a more cohesive thread for how I'd see this working for the ncov
example. But broadly, I like the general idea of encoding weights independently between categories (country
vs month
for example) and assuming no interaction between categories. Ie if you have weight of 4 in North America and weight of 1 for global context and if you have weight of 4 for recent samples and weight of 1 for older samples, then I'd assume sampling weight of 4x4 = 16 for recent North America, 1x4 = 4 for recent global, 1x4 = 4 for older North America and 1x1 = 1 for older global.
augur frequencies
has a weights interface which was never discussed here:
It is different in that it only allows weighting on a single column (defined by --weights-attribute
) and the file format is JSON instead of TSV.
I've considered the idea of swapping --group-by-weights
with --weights
+ --weights-attribute
for the sake of consistency across Augur. It's definitely possible, but I'm going against it for the following reasons ordered from most to least important:
--weights
+ --weights-attribute
interface. Those are already supported by the TSV format and I would rather leave the support built-in than pending a redesign of the --weights
+ --weights-attribute
interface.--group-by-weights
pairs nicely with --group-by
. This is useful because all weighted columns must be passed to --group-by
, i.e. --group-by-weights
is an extension of --group-by
. It wouldn't be so obvious with --weights-attribute
.--group-by-weights
in #1454 has already implemented various checks for the TSV format.I think that --group-by-weights
is helpfully clear when paired with the familiar --group-by
. I support your decision to keep the interface as currently implemented.
augur filter --group-by-weights
was released in Augur 25.3.0.
Context
Currently,
--subsample-max-sequences
effectively calculates a value for--sequences-per-group
which applies to all groups specified by--group-by
.This behavior does not work for all scenarios. Example: There are 5 countries with vastly different population sizes. 60 sequences are requested for subsampling. The command would be:
This means 12 sequences will be sampled from each country, regardless of population size. One may want the sample to be representative of country population sizes and not uniform. This limitation is what prompts higher-level subsampling workarounds such as https://github.com/nextstrain/ncov/pull/1074.
Tasks
Rollout
Original proposed solution
Implement an option
--subsample-weights
, which reads a file that specifies weights per--group-by
column. A simple example:weights.yaml
:With this information, a different amount of sequences can be calculated per group.
A
would have 60*1000/3000 = 20 sequences.C
would have 60*300/3000 = 6 sequences.The absence of a column from the weights file can imply equal weighting. In other words, the example can be updated to use
--group-by country month
while keepingweights.yaml
as-is to have weightedcountry
sampling for each time bin.Or, a more complex example where time is also weighted:
weights.yaml
:Notes: