nextstrain / dengue

Nextstrain build for dengue virus
https://nextstrain.org/dengue
8 stars 10 forks source link

Exclude circular synthetic (or chimeric) sequences #29

Closed j23414 closed 4 months ago

j23414 commented 5 months ago

Description of proposed changes

For now, exclude the circular synthetic sequences from the phylogenetic build flagged by https://github.com/nextstrain/dengue/issues/28.

Alternatively, we can attempt to drop the plasma from the ends of the sequences if it proves feasible.

I ran a quick check of other records in phylogenetic/data/metadata_all.tsv to identify any other sequences that are longer than 15000nt, and I did not see any. But please feel free to flag any records I may have missed.

Related issue(s)

Checklist

j23414 commented 5 months ago

re: 4ac53189d785264153dea458be46830e32353eac

Duplicates, referring to identical sequences that may or may not be distinct samples, were highlighted in the following comment: https://github.com/nextstrain/dengue/issues/28#issuecomment-1951297740. Additional discussion can be found in the thread starting here: https://github.com/nextstrain/dengue/issues/28#issuecomment-1955359193.

It is crucial to note that some of these excluded duplicates actually represent patents (PAT) for vaccine candidates, and as such, they are omitted from the phylogenetic analysis of current dengue diversity.

When encountering duplicates and there is a reference sequence identified with prefix (NC_), the preference was to retain the reference and exclude other duplicates.

In cases where multiple VRL records share the same nucleotide sequence, the earliest sample in alphabetical order was selected, and the others excluded.

The rationale for each exclusion is documented in the respective comments in the exclude.txt file.

Future work: Later on we may be able to work on establishing some deduplication guidelines in this issue: https://github.com/nextstrain/dengue/issues/30