Closed corneliusroemer closed 2 months ago
Reminds me of this issue https://github.com/nanoporetech/medaka/issues/351, and the BAM looks very similar
As far as I remember it's not only at S:del69/70 but occured for other deletions, too (but 69/70 is the most prominent, common and problematic one).
When analyzed by nextclade the problematic sequences result in (3×n-1) dashes and an extra N
(@corneliusroemer wrote you a question about that in July on Twitter, if thats normal)
P.S. If it's still the same as 2 months ago: When you look at the unaligned/original FASTA sequence, it seems like that instead of a single '-' at deletions to indicate a gap of indetermined length, the erroneous sequence submissions have always an 'N'
Back in January, it was also S:69/70
When you look at the unaligned/original FASTA sequence, it seems like that instead of a single '-' at deletions to indicate a gap of indetermined length, the erroneous sequence submissions have always an 'N'
medaka variant
calls a 5 instead of 6 nt deletion. If there is an 'N' for this missing deletion, it means that the position is masked afterwards due to low coverage 🤔
Thanks for pointing to the medaka issue @MarieLataretu
So I guess the lab is not using the suggested workaround: solved with sup (super-acc) basecalling and respective medaka model
and neither fixes the issue manually. Such a frame shift in S is totally unviable.
If you drop the sequences in nextclade.org you will see the issue immediately. Weird that GISAID allowed these frame shifted sequences through - I thought they check for frameshifts.
I see - I probably shouldn't have opened this issue here as the submission didn't go through RKI? Or am I wrong?
I didn't had time to look at the frame shift sequences (and metadata) data in DESH.
There is only one sample sequenced at RKI with this frame shift (at least since the last frame shift wave at the beginning of 2022). However, we use the sup
model and the frame shift still appears.
A workaround is to use the nanopolish
mode instead of medaka
in the ARTIC workflow.
Reviewing latest submissions on GISAID I noticed that a batch of sequences from BY-LGL seems to have a bioinformatics error:
Almost all sequences are frame shifted in S1 after S:70 due to there being 5 deleted nucleotides and 1 N nucleotide.
The 1 N is the problem, it should be a deletion so that there is no frameshift. Thanks!
See screenshots: