Use exact fractional sequences per group

nextstrain / augur

Pipeline components for real-time phylodynamic analysis

https://docs.nextstrain.org/projects/augur/

GNU Affero General Public License v3.0

268 stars 128 forks source link

Use exact fractional sequences per group #1599

Closed victorlin closed 2 months ago

victorlin commented 3 months ago

Description of proposed changes

The previous _calculate_fractional_sequences_per_group() was an approximation of this exact value. The approximation could return a fractional value above 1, which would fail the assertion in get_probabilistic_group_sizes().

Related issue(s)

Closes #1588
Closes #1598

Checklist

[x] Automated checks pass
[x] Check if you need to add a changelog message
[x] Check if you need to add tests
[x] Check if you need to update docs

corneliusroemer commented 3 months ago

Thanks! Is the linked issue correct? Shouldn't it close #1598? Rather than #1588

victorlin commented 3 months ago

Thanks, it should close both. I've updated the PR description.

codecov[bot] commented 3 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 70.99%. Comparing base (47c83e0) to head (d73dbee).

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #1599 +/- ## ========================================== - Coverage 71.02% 70.99% -0.04% ========================================== Files 79 79 Lines 8256 8247 -9 Branches 2003 2001 -2 ========================================== - Hits 5864 5855 -9 Misses 2101 2101 Partials 291 291 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

corneliusroemer commented 3 months ago

There's a reason the original calculation is the way it is. I don't think that's a bug or unnecessary.

Imagine there are 90 groups with 1 sequence and 10 groups with 1000. We want to sample 1000 sequences.

The original calculation would pick around 91 sequences per group.

Yours now picks 10 per group.

The original resulted in around 1000 sampled sequences.

Your calculation now in only 190.

victorlin commented 3 months ago

@corneliusroemer that scenario is not affected by the changes here. I've responded in more detail at https://github.com/nextstrain/augur/issues/1588#issuecomment-2307563375

victorlin commented 2 months ago

I meant to release this as part of 25.4.0 but forgot to merge 🤦 it will come in the next release.