nextstrain / mpox

Nextstrain build for mpox virus
https://nextstrain.org/mpox
MIT License
39 stars 16 forks source link

ingest: deduplicate sequences using strain names #33

Open joverlee521 opened 2 years ago

joverlee521 commented 2 years ago

Context

Once we've completed #32, we can use strain names to deduplicate sequences. This is necessary in case different groups sequence the same virus or if sequences are generated from different protocols. (NOTE: This is separate from the versioning in GenBank, we already pull in the latest version of GenBank sequences).

Description

The duplicate sequences should probably be filtered out in a new script (e.g. ingest/bin/deduplicate-records) OR potentially use the augur deduplicate command (see https://github.com/nextstrain/augur/issues/919).

We probably want to keep a file with all sequences in case people want the duplicate sequences for any reason. The deduplicated files will be the main ones used for LAPIS and/or our monkeypox builds.

jameshadfield commented 2 years ago

Update: We currently have a duplicate in the hMPX build (MPXV-M5312_HM12_Rivers from accessions MT903340 and NC_063383). It’s not a huge problem as it's not in the current outbreak.