robert-koch-institut / SARS-CoV-2-Sequenzdaten_aus_Deutschland

Ein zentraler Bestandteil einer erfolgreichen Erregersurveillance ist das Verständnis der Verbreitung eines Erregers sowie seiner pathogenen Eigenschaften. Hierbei stellt das Wissen über das Erregergenom eine wichtige Informationsquelle dar. So erlaubt der Nachweis von Mutationen im Genom eines Erregers, Verwandtschaftsbeziehungen zu rekonstruie...
https://robert-koch-institut.github.io/SARS-CoV-2-Sequenzdaten_aus_Deutschland/
Creative Commons Attribution 4.0 International
67 stars 7 forks source link

Sequences from BY-LGL are all frame shifted due to `-` being `N` in S:69/70del #33

Open corneliusroemer opened 1 year ago

corneliusroemer commented 1 year ago

Reviewing latest submissions on GISAID I noticed that a batch of sequences from BY-LGL seems to have a bioinformatics error:

Almost all sequences are frame shifted in S1 after S:70 due to there being 5 deleted nucleotides and 1 N nucleotide.

The 1 N is the problem, it should be a deletion so that there is no frameshift. Thanks!

See screenshots:

image image
MarieLataretu commented 1 year ago

Reminds me of this issue https://github.com/nanoporetech/medaka/issues/351, and the BAM looks very similar grafik

icestorm972 commented 1 year ago

As far as I remember it's not only at S:del69/70 but occured for other deletions, too (but 69/70 is the most prominent, common and problematic one).

When analyzed by nextclade the problematic sequences result in (3×n-1) dashes and an extra N

(@corneliusroemer wrote you a question about that in July on Twitter, if thats normal)

P.S. If it's still the same as 2 months ago: When you look at the unaligned/original FASTA sequence, it seems like that instead of a single '-' at deletions to indicate a gap of indetermined length, the erroneous sequence submissions have always an 'N'

MarieLataretu commented 1 year ago

Back in January, it was also S:69/70

When you look at the unaligned/original FASTA sequence, it seems like that instead of a single '-' at deletions to indicate a gap of indetermined length, the erroneous sequence submissions have always an 'N'

medaka variant calls a 5 instead of 6 nt deletion. If there is an 'N' for this missing deletion, it means that the position is masked afterwards due to low coverage 🤔

corneliusroemer commented 1 year ago

Thanks for pointing to the medaka issue @MarieLataretu

So I guess the lab is not using the suggested workaround: solved with sup (super-acc) basecalling and respective medaka model and neither fixes the issue manually. Such a frame shift in S is totally unviable.

If you drop the sequences in nextclade.org you will see the issue immediately. Weird that GISAID allowed these frame shifted sequences through - I thought they check for frameshifts.

I see - I probably shouldn't have opened this issue here as the submission didn't go through RKI? Or am I wrong?

MarieLataretu commented 1 year ago

I didn't had time to look at the frame shift sequences (and metadata) data in DESH.

There is only one sample sequenced at RKI with this frame shift (at least since the last frame shift wave at the beginning of 2022). However, we use the sup model and the frame shift still appears.

A workaround is to use the nanopolish mode instead of medaka in the ARTIC workflow.