nextstrain / augur

Pipeline components for real-time phylodynamic analysis
https://docs.nextstrain.org/projects/augur/
GNU Affero General Public License v3.0
268 stars 128 forks source link

filter: Error on sequence duplicates #1613

Closed victorlin closed 2 months ago

victorlin commented 3 months ago

Description of proposed changes

Data is expected to be de-duplicated prior to running augur filter. A similar check is already in place for --metadata.

Reasoning for this change is here: https://github.com/nextstrain/augur/issues/810#issuecomment-2286856280

Related issue(s)

Closes #1602

Checklist

codecov[bot] commented 3 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 71.14%. Comparing base (db54927) to head (da9301c).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #1613 +/- ## ========================================== + Coverage 71.06% 71.14% +0.07% ========================================== Files 79 79 Lines 8268 8273 +5 Branches 2010 2011 +1 ========================================== + Hits 5876 5886 +10 + Misses 2101 2099 -2 + Partials 291 288 -3 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

victorlin commented 2 months ago

I will plan to merge this once another breaking change from DEPRECATED.md is also ready to be merged, so they can be released in the same major version.

corneliusroemer commented 1 month ago

Update

I found the reasoning for this change in https://github.com/nextstrain/augur/issues/810#issuecomment-2286856280 (one needs to follow quite a few links to eventually end up there), in the future might be nice to add this to the PR description, it would have saved me quite some time.

It might have been nice as well to point out in the changelog that to migrate one should use augur merge to deduplicate. I feel the breaking change was maybe introduced a little to quickly here. I might have preferred to issue a (deprecation) warning for a while when there are duplicates found, which would have allowed us to find out how many workflows would be affected.

Original comment

Hmm, just seeing this result in CI failure in ncov - not sure this is such a great change? It makes filter less usable as an entry point. Why do we need to require deduplication before filter?

The reasoning is given just as:

Data is expected to be de-duplicated prior to running augur filter. A similar check is already in place for --metadata.

Who expects it to be de-duplicated?

Could we potentially allow opt-out from throwing an error to make it easier to upgrade without having to modify workflows? Maybe we should only error if the sequences are different but happen to have the same name? If they are identical there's no reason to error, really?

https://github.com/nextstrain/ncov/actions/runs/11164069511/job/31032539057#step:5:12039

Brave Browser 2024-10-04 13 31 08
[batch] [2024-10-03T16:24:15+00:00] ERROR: The following strains are duplicated in 'results/gisaid_21L_aligned.fasta.zst':
[batch] [2024-10-03T16:24:15+00:00] Wuhan/Hu-1/2019
[batch] [2024-10-03T16:24:15+00:00] ================================================================================
[batch] [2024-10-03T16:24:39+00:00] ERROR: The following strains are duplicated in 'results/gisaid_21L_aligned.fasta.zst':
[batch] [2024-10-03T16:24:39+00:00] Wuhan/Hu-1/2019
[batch] [2024-10-03T16:24:40+00:00] [Thu Oct  3 16:24:39 2024]
[batch] [2024-10-03T16:24:40+00:00] Error in rule combine_samples:
[batch] [2024-10-03T16:24:40+00:00]     jobid: 863
[batch] [2024-10-03T16:24:40+00:00]     input: results/gisaid_21L_aligned.fasta.zst, results/gisaid_21L_metadata.tsv.zst, results/oceania_6m/sample-focal_early.txt, results/oceania_6m/sample-context_early.txt, results/oceania_6m/sample-focal_recent.txt, results/oceania_6m/sample-context_recent.txt
[batch] [2024-10-03T16:24:40+00:00]     output: results/oceania_6m/oceania_6m_subsampled_sequences.fasta.xz, results/oceania_6m/oceania_6m_subsampled_metadata.tsv.xz
[batch] [2024-10-03T16:24:40+00:00]     log: logs/subsample_regions_oceania_6m.txt (check log file(s) for error details)
[batch] [2024-10-03T16:24:40+00:00]     conda-env: /nextstrain/build/.snakemake/conda/ef7f392b0ecf86741cd7c0bee42f4f0e_
[batch] [2024-10-03T16:24:40+00:00]     shell:
[batch] [2024-10-03T16:24:40+00:00]         
[batch] [2024-10-03T16:24:40+00:00]         augur filter \
[batch] [2024-10-03T16:24:40+00:00]             --sequences results/gisaid_21L_aligned.fasta.zst \
[batch] [2024-10-03T16:24:40+00:00]             --metadata results/gisaid_21L_metadata.tsv.zst \
[batch] [2024-10-03T16:24:40+00:00]             --exclude-all \
[batch] [2024-10-03T16:24:40+00:00]             --include results/oceania_6m/sample-focal_early.txt results/oceania_6m/sample-context_early.txt results/oceania_6m/sample-focal_recent.txt results/oceania_6m/sample-context_recent.txt \
[batch] [2024-10-03T16:24:40+00:00]             --output-sequences results/oceania_6m/oceania_6m_subsampled_sequences.fasta.xz \
[batch] [2024-10-03T16:24:40+00:00]             --output-metadata results/oceania_6m/oceania_6m_subsampled_metadata.tsv.xz 2>&1 | tee logs/subsample_regions_oceania_6m.txt
[batch] [2024-10-03T16:24:40+00:00]         
[batch] [2024-10-03T16:24:40+00:00]         (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
[batch] [2024-10-03T16:24:40+00:00] Logfile logs/subsample_regions_oceania_6m.txt:
[batch] [2024-10-03T16:24:40+00:00] ================================================================================
[batch] [2024-10-03T16:24:40+00:00] ERROR: The following strains are duplicated in 'results/gisaid_21L_aligned.fasta.zst':
[batch] [2024-10-03T16:24:40+00:00] Wuhan/Hu-1/2019
victorlin commented 1 month ago

@corneliusroemer thanks for your thoughts here and apologies for making an abrupt and unclear change. This is a good case of a developer (me) being out of touch with actual usage. If I were to do this again, I would treat the previous behavior (allowing sequence duplicates) as a "feature" with deprecation warning instead of a bug that needed to be fixed immediately.

Why do we need to require deduplication before filter?

The reasoning is more clear for metadata input: it is indexed on the ID column which must be unique. It's less clear for sequences, but consistency between metadata and sequence output would be my reasoning.

Could we potentially allow opt-out from throwing an error to make it easier to upgrade without having to modify workflows?

It's do-able though it would be specific to sequences, something along the lines of --allow-duplicate-sequences. If the sole purpose is for backwards compatibility, we could introduce it temporarily with a deprecation warning?