Closed victorlin closed 2 months ago
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 71.14%. Comparing base (
db54927
) to head (da9301c
).
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
I will plan to merge this once another breaking change from DEPRECATED.md is also ready to be merged, so they can be released in the same major version.
I found the reasoning for this change in https://github.com/nextstrain/augur/issues/810#issuecomment-2286856280 (one needs to follow quite a few links to eventually end up there), in the future might be nice to add this to the PR description, it would have saved me quite some time.
It might have been nice as well to point out in the changelog that to migrate one should use augur merge
to deduplicate. I feel the breaking change was maybe introduced a little to quickly here. I might have preferred to issue a (deprecation) warning for a while when there are duplicates found, which would have allowed us to find out how many workflows would be affected.
Hmm, just seeing this result in CI failure in ncov - not sure this is such a great change? It makes filter less usable as an entry point. Why do we need to require deduplication before filter?
The reasoning is given just as:
Data is expected to be de-duplicated prior to running augur filter. A similar check is already in place for --metadata.
Who expects it to be de-duplicated?
Could we potentially allow opt-out from throwing an error to make it easier to upgrade without having to modify workflows? Maybe we should only error if the sequences are different but happen to have the same name? If they are identical there's no reason to error, really?
https://github.com/nextstrain/ncov/actions/runs/11164069511/job/31032539057#step:5:12039
[batch] [2024-10-03T16:24:15+00:00] ERROR: The following strains are duplicated in 'results/gisaid_21L_aligned.fasta.zst':
[batch] [2024-10-03T16:24:15+00:00] Wuhan/Hu-1/2019
[batch] [2024-10-03T16:24:15+00:00] ================================================================================
[batch] [2024-10-03T16:24:39+00:00] ERROR: The following strains are duplicated in 'results/gisaid_21L_aligned.fasta.zst':
[batch] [2024-10-03T16:24:39+00:00] Wuhan/Hu-1/2019
[batch] [2024-10-03T16:24:40+00:00] [Thu Oct 3 16:24:39 2024]
[batch] [2024-10-03T16:24:40+00:00] Error in rule combine_samples:
[batch] [2024-10-03T16:24:40+00:00] jobid: 863
[batch] [2024-10-03T16:24:40+00:00] input: results/gisaid_21L_aligned.fasta.zst, results/gisaid_21L_metadata.tsv.zst, results/oceania_6m/sample-focal_early.txt, results/oceania_6m/sample-context_early.txt, results/oceania_6m/sample-focal_recent.txt, results/oceania_6m/sample-context_recent.txt
[batch] [2024-10-03T16:24:40+00:00] output: results/oceania_6m/oceania_6m_subsampled_sequences.fasta.xz, results/oceania_6m/oceania_6m_subsampled_metadata.tsv.xz
[batch] [2024-10-03T16:24:40+00:00] log: logs/subsample_regions_oceania_6m.txt (check log file(s) for error details)
[batch] [2024-10-03T16:24:40+00:00] conda-env: /nextstrain/build/.snakemake/conda/ef7f392b0ecf86741cd7c0bee42f4f0e_
[batch] [2024-10-03T16:24:40+00:00] shell:
[batch] [2024-10-03T16:24:40+00:00]
[batch] [2024-10-03T16:24:40+00:00] augur filter \
[batch] [2024-10-03T16:24:40+00:00] --sequences results/gisaid_21L_aligned.fasta.zst \
[batch] [2024-10-03T16:24:40+00:00] --metadata results/gisaid_21L_metadata.tsv.zst \
[batch] [2024-10-03T16:24:40+00:00] --exclude-all \
[batch] [2024-10-03T16:24:40+00:00] --include results/oceania_6m/sample-focal_early.txt results/oceania_6m/sample-context_early.txt results/oceania_6m/sample-focal_recent.txt results/oceania_6m/sample-context_recent.txt \
[batch] [2024-10-03T16:24:40+00:00] --output-sequences results/oceania_6m/oceania_6m_subsampled_sequences.fasta.xz \
[batch] [2024-10-03T16:24:40+00:00] --output-metadata results/oceania_6m/oceania_6m_subsampled_metadata.tsv.xz 2>&1 | tee logs/subsample_regions_oceania_6m.txt
[batch] [2024-10-03T16:24:40+00:00]
[batch] [2024-10-03T16:24:40+00:00] (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
[batch] [2024-10-03T16:24:40+00:00] Logfile logs/subsample_regions_oceania_6m.txt:
[batch] [2024-10-03T16:24:40+00:00] ================================================================================
[batch] [2024-10-03T16:24:40+00:00] ERROR: The following strains are duplicated in 'results/gisaid_21L_aligned.fasta.zst':
[batch] [2024-10-03T16:24:40+00:00] Wuhan/Hu-1/2019
@corneliusroemer thanks for your thoughts here and apologies for making an abrupt and unclear change. This is a good case of a developer (me) being out of touch with actual usage. If I were to do this again, I would treat the previous behavior (allowing sequence duplicates) as a "feature" with deprecation warning instead of a bug that needed to be fixed immediately.
Why do we need to require deduplication before filter?
The reasoning is more clear for metadata input: it is indexed on the ID column which must be unique. It's less clear for sequences, but consistency between metadata and sequence output would be my reasoning.
Could we potentially allow opt-out from throwing an error to make it easier to upgrade without having to modify workflows?
It's do-able though it would be specific to sequences, something along the lines of --allow-duplicate-sequences
. If the sole purpose is for backwards compatibility, we could introduce it temporarily with a deprecation warning?
Description of proposed changes
Data is expected to be de-duplicated prior to running augur filter. A similar check is already in place for
--metadata
.Reasoning for this change is here: https://github.com/nextstrain/augur/issues/810#issuecomment-2286856280
Related issue(s)
Closes #1602
Checklist