nextstrain / augur

Pipeline components for real-time phylodynamic analysis
https://docs.nextstrain.org/projects/augur/
GNU Affero General Public License v3.0
268 stars 129 forks source link

Align fractional sequences per group conversion methods #1614

Open victorlin opened 1 month ago

victorlin commented 1 month ago

Probabilistic sampling is necessary when targeting a fractional number of sequences per group. This is possible in two scenarios:

  1. Uniform sampling when the number of groups exceeds the number of total requested sequences
  2. Weighted sampling when this expression evaluates to a fractional number:

    $$ \frac{\text{weight}}{\sum{\text{weights}}} * \text{total requested sequences} $$

Each scenario implements its own way of converting the fractional number to a whole number that can be used for sampling.

  1. Poisson sampling method: sample from a Poisson distribution with the mean $\lambda$ being the fractional number of sequences per group which is constant across all groups.

    https://github.com/nextstrain/augur/blob/47c83e06422e7a1d1437035e99025ee2a7d746d9/augur/filter/subsample.py#L299

  2. Probabilistic rounding method: round the number probabilistically by adding a random number between [0,1) and truncating the decimal part. The Poisson sampling method does not work for weighted sampling because the fractional number of sequences per group is not guaranteed to be constant across all groups.

    https://github.com/nextstrain/augur/blob/47c83e06422e7a1d1437035e99025ee2a7d746d9/augur/filter/subsample.py#L452

Proposed change

Replace the Poisson sampling method with the probabilistic rounding method. A notable difference: with the Poisson sampling method, there is a slim chance for 2 or more sequences per group. That would not happen with the probabilistic rounding method. I think that is fine and even preferred because it avoids the possibility of under-sampling.

This can be explained through example: --group-by month with 12 months and --subsample-max-sequences 10. This means the fractional number of sequences per group should be¹ $\frac{10}{12} \approx 0.83$. For each of the 12 months we need a whole number of sequences per group.

Probabilistic rounding would be: 0.83 has a 83% chance of rounding to 1 and a 17% chance of rounding to 0.