Closed victorlin closed 1 year ago
have the option of per-week sampling categories in addition to per-month sampling categories.
I don't think --group-by week
is right:
YYYY-MM-DD
.--min-date 1M
which translates to 4W
and some change.Or perhaps some continuous specification.
This seems right to me. It is fairly straightforward to enable --group-by day
so we can have --group-by ... year month day
for the "continuous" approach. Run time might be impacted since this creates ~30x more groups compared to --group-by ... year month
. Are there any other drawbacks to this approach?
I definitely take your point on --group-by week
, but there are some funny interactions here. In the current system we're often mashing together geography and time into our sampling categories, so we end up with effectively:
for current Europe-focused ncov builds. With 6 month focus we have 6 months x 46 countries = 276 categories. If this was days, we'd have 180 days x 46 countries = 8280 categories. I believe (but could be confused) that by random picking among the 8280 we'd be biasing towards temporal diversity and away from geographic diversity relative to the 276 category scenario. Ie with ~3000 tips in the 276 category scenario you'd have ~11 per country and ~2 per month pretty systematically. But in the 8280 category scenario, I'd think that stochastically you might have different counts per county as each category would be picked ~1/3 of the time. (I might be thinking about this wrong, feel like I'd want to test to confirm)
Group by day is not good, because daily sequencing volumne varies a lot whereas weekly volumne does not. There's not much collection on Saturdays, Sundays, etc.
Weekly is the right way to go for now - definitely better than just monthly.
Sorry I only see this now.
Context
@trvrb from nextstrain/ncov#957:
Example
When requesting
--subsample-max-sequences
, this will evenly sample from the 3 groups2022-03
,2022-04
,2022-05
. However, note that the--min-date
and--max-date
make the sampling window to be half of2022-03
, all of2022-04
, and half of2022-05
. An ideal sampling strategy would sample proportional to the sampling window (e.g. a 2/4/2 split).