filter: Reduce over-sampling in partial months with `--group-by month`

victorlin commented 2 years ago

Context

@trvrb from nextstrain/ncov#957:

With this narrow of timespans there is some unavoidable funny interaction with how augur filter subsamples based on --vpm, ie viruses per month. We have common situations where if current date is say May 15 we end up with

min date of March 15

desire by augur filter to equally sample viruses from March, April and May categories

so that March and May have 2 weeks for sampling of X viruses and April has 4 weeks for sampling of X viruses. This results in more densely sampled, in terms of viruses per day, months of March and May compared to April.

This effect will be more pronounced in scenarios where current date is, say, May 28, and so X viruses are sampled in 3 days in March and 30 days in April.

To fully address this we'd need to extend augur filter to have the option of per-week sampling categories in addition to per-month sampling categories. Or perhaps some continuous specification. However, I don't think this is too big of an issue in terms of the current PR and it's something we can refine once Augur is updated.

Example

cat > metadata.tsv << ~~
strain  date
SEQ1    2022-03-21
SEQ2    2022-03-22
SEQ3    2022-03-23
SEQ4    2022-04-01
SEQ5    2022-04-02
SEQ6    2022-04-03
SEQ7    2022-05-01
SEQ8    2022-05-02
SEQ9    2022-05-03
SEQ10   2022-05-04
~~

augur filter \
--metadata metadata.tsv \
--min-date 2022-03-15 \
--max-date 2022-05-15 \
--group-by year month \
--subsample-max-sequences 8 \
--subsample-seed 0 \
--output-metadata out.tsv
# Sampling at 2 per group.
# 4 strains were dropped during filtering
#   4 of these were dropped because of subsampling criteria
# 6 strains passed all filters

cat out.tsv | sort -k 2
# SEQ1  2022-03-21
# SEQ2  2022-03-22
# SEQ4  2022-04-01
# SEQ5  2022-04-02
# SEQ7  2022-05-01
# SEQ9  2022-05-03
# strain    date

When requesting --subsample-max-sequences, this will evenly sample from the 3 groups 2022-03, 2022-04, 2022-05. However, note that the --min-date and --max-date make the sampling window to be half of 2022-03, all of 2022-04, and half of 2022-05. An ideal sampling strategy would sample proportional to the sampling window (e.g. a 2/4/2 split).

victorlin commented 2 years ago

have the option of per-week sampling categories in addition to per-month sampling categories.

I don't think --group-by week is right:

It's more difficult to extract that info from YYYY-MM-DD.
There will be the same problem of over-sampling with "partial weeks" if using something like --min-date 1M which translates to 4W and some change.

Or perhaps some continuous specification.

This seems right to me. It is fairly straightforward to enable --group-by day so we can have --group-by ... year month day for the "continuous" approach. Run time might be impacted since this creates ~30x more groups compared to --group-by ... year month. Are there any other drawbacks to this approach?

trvrb commented 2 years ago

I definitely take your point on --group-by week, but there are some funny interactions here. In the current system we're often mashing together geography and time into our sampling categories, so we end up with effectively:

UK Apr 2022
UK May 2022
Spain Apr 2022
Spain May 2022
Africa Apr 2022
Africa May 2022 etc...

for current Europe-focused ncov builds. With 6 month focus we have 6 months x 46 countries = 276 categories. If this was days, we'd have 180 days x 46 countries = 8280 categories. I believe (but could be confused) that by random picking among the 8280 we'd be biasing towards temporal diversity and away from geographic diversity relative to the 276 category scenario. Ie with ~3000 tips in the 276 category scenario you'd have ~11 per country and ~2 per month pretty systematically. But in the 8280 category scenario, I'd think that stochastically you might have different counts per county as each category would be picked ~1/3 of the time. (I might be thinking about this wrong, feel like I'd want to test to confirm)

corneliusroemer commented 1 year ago

Group by day is not good, because daily sequencing volumne varies a lot whereas weekly volumne does not. There's not much collection on Saturdays, Sundays, etc.

Weekly is the right way to go for now - definitely better than just monthly.

Sorry I only see this now.

nextstrain / augur

filter: Reduce over-sampling in partial months with `--group-by month` #960

Context

Example