Subsample has uncaught assertion in ncov

corneliusroemer commented 3 months ago

ncov errors last night. It seems to be related to filter/subsample?

Here's the log: https://github.com/nextstrain/ncov/actions/runs/10521511438/job/29152346864#step:5:1

[batch] [2024-08-23T07:36:29+00:00]         Subsample all sequences by 'context_early' scheme for build 'south-america_1m' with the following parameters:
[batch] [2024-08-23T07:36:29+00:00]          - group by: --group-by country year month
[batch] [2024-08-23T07:36:29+00:00]          - sequences per group: 
[batch] [2024-08-23T07:36:29+00:00]          - subsample max sequences: --subsample-max-sequences 160
[batch] [2024-08-23T07:36:29+00:00]          - min-date: 
[batch] [2024-08-23T07:36:29+00:00]          - max-date: --max-date 1M
[batch] [2024-08-23T07:36:29+00:00]          - 
[batch] [2024-08-23T07:36:29+00:00]          - exclude: --exclude-where 'region=South America'
[batch] [2024-08-23T07:36:29+00:00]          - include: 
[batch] [2024-08-23T07:36:29+00:00]          - query: 
[batch] [2024-08-23T07:36:29+00:00]          - priority: 
[batch] [2024-08-23T07:36:29+00:00]         
[batch] [2024-08-23T07:36:29+00:00] Reason: Missing output files: results/south-america_1m/sample-context_early.txt; Input files updated by another job: results/gisaid_21L_metadata.tsv.zst
[batch] [2024-08-23T07:36:29+00:00]         augur filter             --metadata results/gisaid_21L_metadata.tsv.zst             --include nextstrain_profiles/nextstrain-gisaid-21L/include.txt             --exclude defaults/exclude.txt                          --max-date 1M             --exclude-where 'region=South America'                                                                 --group-by country year month                                       --subsample-max-sequences 160                          --output-strains results/south-america_1m/sample-context_early.txt 2>&1 | tee logs/subsample_south-america_1m_context_early.txt
[batch] [2024-08-23T07:36:29+00:00]         
[batch] [2024-08-23T07:36:48+00:00] Sampling at 1 per group.
[batch] [2024-08-23T07:36:55+00:00] WARNING: Asked to provide at most 400 sequences, but there are 406 groups.
[batch] [2024-08-23T07:36:55+00:00] Sampling probabilistically at 1.0254 sequences per group, meaning it is possible to have more than the requested maximum of 400 sequences after filtering.
[batch] [2024-08-23T07:36:55+00:00] Traceback (most recent call last):
[batch] [2024-08-23T07:36:55+00:00]   File "/nextstrain/augur/augur/__init__.py", line 70, in run
[batch] [2024-08-23T07:36:55+00:00]     return args.__command__.run(args)
[batch] [2024-08-23T07:36:55+00:00]   File "/nextstrain/augur/augur/filter/__init__.py", line 135, in run
[batch] [2024-08-23T07:36:55+00:00]     return _run(args)
[batch] [2024-08-23T07:36:55+00:00]   File "/nextstrain/augur/augur/filter/_run.py", line 295, in run
[batch] [2024-08-23T07:36:55+00:00]     group_sizes = get_probabilistic_group_sizes(
[batch] [2024-08-23T07:36:55+00:00]   File "/nextstrain/augur/augur/filter/subsample.py", line 285, in get_probabilistic_group_sizes
[batch] [2024-08-23T07:36:55+00:00]     assert target_group_size < 1.0
[batch] [2024-08-23T07:36:55+00:00] AssertionError
[batch] [2024-08-23T07:36:55+00:00] An error occurred (see above) that has not been properly handled by Augur.
[batch] [2024-08-23T07:36:55+00:00] To report this, please open a new issue including the original command and the error above:
[batch] [2024-08-23T07:36:55+00:00]     <https://github.com/nextstrain/augur/issues/new/choose>
[batch] [2024-08-23T07:36:56+00:00] [Fri Aug 23 07:36:55 2024]
[batch] [2024-08-23T07:36:56+00:00] Error in rule subsample:
[batch] [2024-08-23T07:36:56+00:00]     jobid: 108
[batch] [2024-08-23T07:36:56+00:00]     input: results/gisaid_21L_metadata.tsv.zst, nextstrain_profiles/nextstrain-gisaid-21L/include.txt, nextstrain_profiles/nextstrain-gisaid-21L/include.txt, defaults/exclude.txt
[batch] [2024-08-23T07:36:56+00:00]     output: results/global_2m/sample-north_america_recent.txt
[batch] [2024-08-23T07:36:56+00:00]     log: logs/subsample_global_2m_north_america_recent.txt (check log file(s) for error details)
[batch] [2024-08-23T07:36:56+00:00]     conda-env: /nextstrain/build/.snakemake/conda/ef7f392b0ecf86741cd7c0bee42f4f0e_
[batch] [2024-08-23T07:36:56+00:00]     shell:
[batch] [2024-08-23T07:36:56+00:00]         
[batch] [2024-08-23T07:36:56+00:00]         augur filter             --metadata results/gisaid_21L_metadata.tsv.zst             --include nextstrain_profiles/nextstrain-gisaid-21L/include.txt             --exclude defaults/exclude.txt             --min-date 2M                          --exclude-where 'region!=North America'                                                                 --group-by division week                                       --subsample-max-sequences 400                          --output-strains results/global_2m/sample-north_america_recent.txt 2>&1 | tee logs/subsample_global_2m_north_america_recent.txt
[batch] [2024-08-23T07:36:56+00:00]         
[batch] [2024-08-23T07:36:56+00:00]         (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
[batch] [2024-08-23T07:36:56+00:00] Logfile logs/subsample_global_2m_north_america_recent.txt:
[batch] [2024-08-23T07:36:56+00:00] ================================================================================
[batch] [2024-08-23T07:36:56+00:00] WARNING: Asked to provide at most 400 sequences, but there are 406 groups.
[batch] [2024-08-23T07:36:56+00:00] Sampling probabilistically at 1.0254 sequences per group, meaning it is possible to have more than the requested maximum of 400 sequences after filtering.
[batch] [2024-08-23T07:36:56+00:00] Traceback (most recent call last):
[batch] [2024-08-23T07:36:56+00:00]   File "/nextstrain/augur/augur/__init__.py", line 70, in run
[batch] [2024-08-23T07:36:56+00:00]     return args.__command__.run(args)
[batch] [2024-08-23T07:36:56+00:00]   File "/nextstrain/augur/augur/filter/__init__.py", line 135, in run
[batch] [2024-08-23T07:36:56+00:00]     return _run(args)
[batch] [2024-08-23T07:36:56+00:00]   File "/nextstrain/augur/augur/filter/_run.py", line 295, in run
[batch] [2024-08-23T07:36:56+00:00]     group_sizes = get_probabilistic_group_sizes(
[batch] [2024-08-23T07:36:56+00:00]   File "/nextstrain/augur/augur/filter/subsample.py", line 285, in get_probabilistic_group_sizes
[batch] [2024-08-23T07:36:56+00:00]     assert target_group_size < 1.0
[batch] [2024-08-23T07:36:56+00:00] AssertionError
[batch] [2024-08-23T07:36:56+00:00] An error occurred (see above) that has not been properly handled by Augur.
[batch] [2024-08-23T07:36:56+00:00] To report this, please open a new issue including the original command and the error above:
[batch] [2024-08-23T07:36:56+00:00]     <https://github.com/nextstrain/augur/issues/new/choose>
[batch] [2024-08-23T07:36:56+00:00]

victorlin commented 3 months ago

Given the timing, release of #1454 yesterday was my initial suspicion. That PR moved this code around and I thought maybe the conditions got messed up in the change to augur/filter/_run.py.

A deeper inspection shows that it's unrelated and I believe the timing is just coincidence. The error is an approximation issue that can be fixed by addressing #1588:

400 / 406
# 0.9852 <- "exact" target_group_size is less than 1 which will pass the assertion

augur.filter.subsample._calculate_fractional_sequences_per_group(400, [1,]*406)
# 1.0254 <- "approximated" target_group_size is greater than 1 which fails the assertion

corneliusroemer commented 3 months ago

Interesting, thanks for the quick investigation!

victorlin commented 3 months ago

Yesterday's scheduled run was successful. I downloaded the relevant log file for the failing run and the flanking succeeding runs

nextstrain build --aws-batch --attach <batch job id> --download 'logs/subsample_global_2m_north_america_recent.txt' ~/tmp

and confirmed that this is due to approximation issue:

✅ 2024-08-20:

WARNING: Asked to provide at most 400 sequences, but there are 412 groups.
Sampling probabilistically at 0.9522 sequences per group, meaning it is possible to have more than the requested maximum of 400 sequences after filtering.

❌ 2024-08-23:

WARNING: Asked to provide at most 400 sequences, but there are 406 groups.
Sampling probabilistically at 1.0254 sequences per group, meaning it is possible to have more than the requested maximum of 400 sequences after filtering.

✅ 2024-08-25:

WARNING: Asked to provide at most 400 sequences, but there are 432 groups.
Sampling probabilistically at 0.9033 sequences per group, meaning it is possible to have more than the requested maximum of 400 sequences after filtering.

nextstrain / augur

Subsample has uncaught assertion in ncov #1598