nextstrain / augur

Pipeline components for real-time phylodynamic analysis
https://docs.nextstrain.org/projects/augur/
GNU Affero General Public License v3.0
268 stars 129 forks source link

Subsample has uncaught assertion in ncov #1598

Closed corneliusroemer closed 1 month ago

corneliusroemer commented 2 months ago

ncov errors last night. It seems to be related to filter/subsample?

Here's the log: https://github.com/nextstrain/ncov/actions/runs/10521511438/job/29152346864#step:5:1

[batch] [2024-08-23T07:36:29+00:00]         Subsample all sequences by 'context_early' scheme for build 'south-america_1m' with the following parameters:
[batch] [2024-08-23T07:36:29+00:00]          - group by: --group-by country year month
[batch] [2024-08-23T07:36:29+00:00]          - sequences per group: 
[batch] [2024-08-23T07:36:29+00:00]          - subsample max sequences: --subsample-max-sequences 160
[batch] [2024-08-23T07:36:29+00:00]          - min-date: 
[batch] [2024-08-23T07:36:29+00:00]          - max-date: --max-date 1M
[batch] [2024-08-23T07:36:29+00:00]          - 
[batch] [2024-08-23T07:36:29+00:00]          - exclude: --exclude-where 'region=South America'
[batch] [2024-08-23T07:36:29+00:00]          - include: 
[batch] [2024-08-23T07:36:29+00:00]          - query: 
[batch] [2024-08-23T07:36:29+00:00]          - priority: 
[batch] [2024-08-23T07:36:29+00:00]         
[batch] [2024-08-23T07:36:29+00:00] Reason: Missing output files: results/south-america_1m/sample-context_early.txt; Input files updated by another job: results/gisaid_21L_metadata.tsv.zst
[batch] [2024-08-23T07:36:29+00:00]         augur filter             --metadata results/gisaid_21L_metadata.tsv.zst             --include nextstrain_profiles/nextstrain-gisaid-21L/include.txt             --exclude defaults/exclude.txt                          --max-date 1M             --exclude-where 'region=South America'                                                                 --group-by country year month                                       --subsample-max-sequences 160                          --output-strains results/south-america_1m/sample-context_early.txt 2>&1 | tee logs/subsample_south-america_1m_context_early.txt
[batch] [2024-08-23T07:36:29+00:00]         
[batch] [2024-08-23T07:36:48+00:00] Sampling at 1 per group.
[batch] [2024-08-23T07:36:55+00:00] WARNING: Asked to provide at most 400 sequences, but there are 406 groups.
[batch] [2024-08-23T07:36:55+00:00] Sampling probabilistically at 1.0254 sequences per group, meaning it is possible to have more than the requested maximum of 400 sequences after filtering.
[batch] [2024-08-23T07:36:55+00:00] Traceback (most recent call last):
[batch] [2024-08-23T07:36:55+00:00]   File "/nextstrain/augur/augur/__init__.py", line 70, in run
[batch] [2024-08-23T07:36:55+00:00]     return args.__command__.run(args)
[batch] [2024-08-23T07:36:55+00:00]   File "/nextstrain/augur/augur/filter/__init__.py", line 135, in run
[batch] [2024-08-23T07:36:55+00:00]     return _run(args)
[batch] [2024-08-23T07:36:55+00:00]   File "/nextstrain/augur/augur/filter/_run.py", line 295, in run
[batch] [2024-08-23T07:36:55+00:00]     group_sizes = get_probabilistic_group_sizes(
[batch] [2024-08-23T07:36:55+00:00]   File "/nextstrain/augur/augur/filter/subsample.py", line 285, in get_probabilistic_group_sizes
[batch] [2024-08-23T07:36:55+00:00]     assert target_group_size < 1.0
[batch] [2024-08-23T07:36:55+00:00] AssertionError
[batch] [2024-08-23T07:36:55+00:00] An error occurred (see above) that has not been properly handled by Augur.
[batch] [2024-08-23T07:36:55+00:00] To report this, please open a new issue including the original command and the error above:
[batch] [2024-08-23T07:36:55+00:00]     <https://github.com/nextstrain/augur/issues/new/choose>
[batch] [2024-08-23T07:36:56+00:00] [Fri Aug 23 07:36:55 2024]
[batch] [2024-08-23T07:36:56+00:00] Error in rule subsample:
[batch] [2024-08-23T07:36:56+00:00]     jobid: 108
[batch] [2024-08-23T07:36:56+00:00]     input: results/gisaid_21L_metadata.tsv.zst, nextstrain_profiles/nextstrain-gisaid-21L/include.txt, nextstrain_profiles/nextstrain-gisaid-21L/include.txt, defaults/exclude.txt
[batch] [2024-08-23T07:36:56+00:00]     output: results/global_2m/sample-north_america_recent.txt
[batch] [2024-08-23T07:36:56+00:00]     log: logs/subsample_global_2m_north_america_recent.txt (check log file(s) for error details)
[batch] [2024-08-23T07:36:56+00:00]     conda-env: /nextstrain/build/.snakemake/conda/ef7f392b0ecf86741cd7c0bee42f4f0e_
[batch] [2024-08-23T07:36:56+00:00]     shell:
[batch] [2024-08-23T07:36:56+00:00]         
[batch] [2024-08-23T07:36:56+00:00]         augur filter             --metadata results/gisaid_21L_metadata.tsv.zst             --include nextstrain_profiles/nextstrain-gisaid-21L/include.txt             --exclude defaults/exclude.txt             --min-date 2M                          --exclude-where 'region!=North America'                                                                 --group-by division week                                       --subsample-max-sequences 400                          --output-strains results/global_2m/sample-north_america_recent.txt 2>&1 | tee logs/subsample_global_2m_north_america_recent.txt
[batch] [2024-08-23T07:36:56+00:00]         
[batch] [2024-08-23T07:36:56+00:00]         (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
[batch] [2024-08-23T07:36:56+00:00] Logfile logs/subsample_global_2m_north_america_recent.txt:
[batch] [2024-08-23T07:36:56+00:00] ================================================================================
[batch] [2024-08-23T07:36:56+00:00] WARNING: Asked to provide at most 400 sequences, but there are 406 groups.
[batch] [2024-08-23T07:36:56+00:00] Sampling probabilistically at 1.0254 sequences per group, meaning it is possible to have more than the requested maximum of 400 sequences after filtering.
[batch] [2024-08-23T07:36:56+00:00] Traceback (most recent call last):
[batch] [2024-08-23T07:36:56+00:00]   File "/nextstrain/augur/augur/__init__.py", line 70, in run
[batch] [2024-08-23T07:36:56+00:00]     return args.__command__.run(args)
[batch] [2024-08-23T07:36:56+00:00]   File "/nextstrain/augur/augur/filter/__init__.py", line 135, in run
[batch] [2024-08-23T07:36:56+00:00]     return _run(args)
[batch] [2024-08-23T07:36:56+00:00]   File "/nextstrain/augur/augur/filter/_run.py", line 295, in run
[batch] [2024-08-23T07:36:56+00:00]     group_sizes = get_probabilistic_group_sizes(
[batch] [2024-08-23T07:36:56+00:00]   File "/nextstrain/augur/augur/filter/subsample.py", line 285, in get_probabilistic_group_sizes
[batch] [2024-08-23T07:36:56+00:00]     assert target_group_size < 1.0
[batch] [2024-08-23T07:36:56+00:00] AssertionError
[batch] [2024-08-23T07:36:56+00:00] An error occurred (see above) that has not been properly handled by Augur.
[batch] [2024-08-23T07:36:56+00:00] To report this, please open a new issue including the original command and the error above:
[batch] [2024-08-23T07:36:56+00:00]     <https://github.com/nextstrain/augur/issues/new/choose>
[batch] [2024-08-23T07:36:56+00:00] 
victorlin commented 2 months ago

Given the timing, release of #1454 yesterday was my initial suspicion. That PR moved this code around and I thought maybe the conditions got messed up in the change to augur/filter/_run.py.

A deeper inspection shows that it's unrelated and I believe the timing is just coincidence. The error is an approximation issue that can be fixed by addressing #1588:

400 / 406
# 0.9852 <- "exact" target_group_size is less than 1 which will pass the assertion

augur.filter.subsample._calculate_fractional_sequences_per_group(400, [1,]*406)
# 1.0254 <- "approximated" target_group_size is greater than 1 which fails the assertion
corneliusroemer commented 2 months ago

Interesting, thanks for the quick investigation!

victorlin commented 2 months ago

Yesterday's scheduled run was successful. I downloaded the relevant log file for the failing run and the flanking succeeding runs

nextstrain build --aws-batch --attach <batch job id> --download 'logs/subsample_global_2m_north_america_recent.txt' ~/tmp

and confirmed that this is due to approximation issue: