Each scenario implements its own way of converting the fractional number to a whole number that can be used for sampling.
Poisson sampling method: sample from a Poisson distribution with the mean $\lambda$ being the fractional number of sequences per group which is constant across all groups.
Probabilistic rounding method: round the number probabilistically by adding a random number between [0,1) and truncating the decimal part. The Poisson sampling method does not work for weighted sampling because the fractional number of sequences per group is not guaranteed to be constant across all groups.
Replace the Poisson sampling method with the probabilistic rounding method. A notable difference: with the Poisson sampling method, there is a slim chance for 2 or more sequences per group. That would not happen with the probabilistic rounding method. I think that is fine and even preferred because it avoids the possibility of under-sampling.
This can be explained through example: --group-by month with 12 months and --subsample-max-sequences 10. This means the fractional number of sequences per group should be¹ $\frac{10}{12} \approx 0.83$. For each of the 12 months we need a whole number of sequences per group.
Probabilistic rounding would be: 0.83 has a 83% chance of rounding to 1 and a 17% chance of rounding to 0.
Probabilistic sampling is necessary when targeting a fractional number of sequences per group. This is possible in two scenarios:
Weighted sampling when this expression evaluates to a fractional number:
$$ \frac{\text{weight}}{\sum{\text{weights}}} * \text{total requested sequences} $$
Each scenario implements its own way of converting the fractional number to a whole number that can be used for sampling.
Poisson sampling method: sample from a Poisson distribution with the mean $\lambda$ being the fractional number of sequences per group which is constant across all groups.
https://github.com/nextstrain/augur/blob/47c83e06422e7a1d1437035e99025ee2a7d746d9/augur/filter/subsample.py#L299
Probabilistic rounding method: round the number probabilistically by adding a random number between [0,1) and truncating the decimal part. The Poisson sampling method does not work for weighted sampling because the fractional number of sequences per group is not guaranteed to be constant across all groups.
https://github.com/nextstrain/augur/blob/47c83e06422e7a1d1437035e99025ee2a7d746d9/augur/filter/subsample.py#L452
Proposed change
Replace the Poisson sampling method with the probabilistic rounding method. A notable difference: with the Poisson sampling method, there is a slim chance for 2 or more sequences per group. That would not happen with the probabilistic rounding method. I think that is fine and even preferred because it avoids the possibility of under-sampling.
This can be explained through example:
--group-by month
with 12 months and--subsample-max-sequences 10
. This means the fractional number of sequences per group should be¹ $\frac{10}{12} \approx 0.83$. For each of the 12 months we need a whole number of sequences per group.Probabilistic rounding would be: 0.83 has a 83% chance of rounding to 1 and a 17% chance of rounding to 0.