replikation / poreCov

SARS-CoV-2 workflow for nanopore sequence data
https://case-group.github.io/
GNU General Public License v3.0
39 stars 17 forks source link

Frameshift detection #199

Closed hoelzer closed 2 years ago

hoelzer commented 2 years ago

@replikation following up the discussion on BC: I think it's reasonable to put a terminal print warning about Omicron sequences missing many Spike mutations and showing sup basecalling and/or switching to Nanopolish as options to maybe fix that.

However, instead, we could also implement a general check for frameshifts (FS). For example, this could be done via https://gitlab.com/s.fuchs/covsonar in two easy steps:

All reconstructed consensus sequences can be added to a covSonar database:

sonar.py add -f genomes.fasta --db mydb --cpus 8

Then, we can query this database via

sonar.py match --db mydb --only_frameshifts | awk 'BEGIN{FS=","};{print $1}' | grep -v accession > ids-frameshift.txt

which will give back all sequence IDs that have a frameshift.

Now, we could additionally mark them in the report and/or print a message that one should be aware of that and maybe try basecalling with a higher accuracy model or switching to Nanopolish/Medaka. Or at least investigate the sequences if the frameshift is actually true.

This would also help people w/ subsequent analyses and e,g, GISAID upload, ...

What do you think?

MarieLataretu commented 2 years ago

(nextclade also reports the frameshifts in the frameShifts column)

hoelzer commented 2 years ago

(nextclade also reports the frameshifts in the frameShifts color)

ah true - that we already have. ;) would be even easier than introducing another tool.

MarieLataretu commented 2 years ago

For the report I see two options:

MarieLataretu commented 2 years ago

I think, an additional column is the way to go. @RaverJay can you add the frameShifts column to the report, if you all agree @replikation @hoelzer ?

It's on amino acid level, but good enough for now.