replikation / poreCov

SARS-CoV-2 workflow for nanopore sequence data
https://case-group.github.io/
GNU General Public License v3.0
40 stars 17 forks source link

Report Insertions as well #82

Closed hoelzer closed 2 years ago

hoelzer commented 3 years ago

It seems Nextclade also provides information on insertions and not only deletions:

> colnames(x)
 [1] "seqName"
 [2] "clade"
 [3] "qc.overallScore"
 [4] "qc.overallStatus"
 [5] "totalGaps"
 [6] "totalInsertions"
 [7] "totalMissing"
 [8] "totalMutations"
 [9] "totalNonACGTNs"
[10] "totalPcrPrimerChanges"
[11] "substitutions"
[12] "deletions"
[13] "insertions"
[14] "missing"
[15] "nonACGTNs"
[16] "pcrPrimerChanges"
[17] "aaSubstitutions"
[18] "totalAminoacidSubstitutions"
[19] "aaDeletions"
[20] "totalAminoacidDeletions"
[21] "alignmentEnd"
[22] "alignmentScore"
[23] "alignmentStart"
[24] "qc.missingData.missingDataThreshold"
[25] "qc.missingData.score"
[26] "qc.missingData.status"
[27] "qc.missingData.totalMissing"
[28] "qc.mixedSites.mixedSitesThreshold"
[29] "qc.mixedSites.score"
[30] "qc.mixedSites.status"
[31] "qc.mixedSites.totalMixedSites"
[32] "qc.privateMutations.cutoff"
[33] "qc.privateMutations.excess"
[34] "qc.privateMutations.score"
[35] "qc.privateMutations.status"
[36] "qc.privateMutations.total"
[37] "qc.snpClusters.clusteredSNPs"
[38] "qc.snpClusters.score"
[39] "qc.snpClusters.status"
[40] "qc.snpClusters.totalSNPs"
[41] "errors"

see [13]

Could be worth to integrate this as well.

RaverJay commented 3 years ago

Well we are reporting amino acid changes, so these columns: "aaSubstitutions" "aaDeletions"

but there is no "aaInsertions" at this time.

Also the "insertions" column is empty in all the runs I just checked

hoelzer commented 3 years ago

Yes, I think insertions are rare. Let me check if I can find an example Nextclade output w/ insertions.

hoelzer commented 3 years ago

Okay, so it seems insertions are only reported on nucleotide level. E.g. I just checked some sequences and selected some different outputs:

totalInsertions insertions
17      21534:A,27373:CTTTCGATCTCT,27380:C,27384:CTC
16      27373:CTTTCGATCTCT,27380:C,27384:CTC
10      9627-9631,20528-20531,22492,28967-28969
28      1569:C,1574:AGAGCTAG,27373:CTTTCGATCTCT,27380:C,27384:CTC,28250:CTG
many    matches"""
248     7010-7049,9543-9561,14331-14515,29760-29767
1       28250:G
17      39,19286,29602,29605,29631,29706,29752,29772,29779,29785,29792-29794,29797,29806,29866,29868-29870

It seems the output format can vary a lot. So I am not sure if it is so meaningful to add this. Maybe just information if there are insertions or not (totalInsertions)?

replikation commented 3 years ago
hoelzer commented 3 years ago

I think it's difficult to include in the final report at least. There can be various insertions also based on seq errors, used protocols, ...

The question would be also how the newest Nextstrain output looks like in this context?

My last idea was to at least add a column for the totalInsertions so that one can easily spot weirdos for further checks

hoelzer commented 2 years ago

@replikation @RaverJay actually this is getting now important with Omicron, especially because of an larger insertion in the Spike:

S:R214REPE

Currently, this is just not shown in the report which can lead to misleading results.

Not sure what's best to solve this, I think currently we could still only extract such information from Nextclade.

Another way would be to use our tool covsonar to basically call all substitutions, insertions and deletions with one tool and include this into the report... https://gitlab.com/s.fuchs/covsonar

To do so, we would need a process that generates the covsonar database for all genomes and then we could easily extract all information.

RaverJay commented 2 years ago

Yeah this is bad

Problem is, Nextclade output is still missing a 'aaInsertions' column (there is only: substitutions deletions insertions aaSubstitutions aaDeletions)

We could of course calculate it from 'insertions' though

@replikation what do you think, add or switch to covsonar?

hoelzer commented 2 years ago

Ahh, we would then need a converter from nt insertions reported by Nextclade to aa insertions. Something like

https://codon2nucleotide.theo.io/

CovSonar might solve that but here the only weak point is that the tool is also under development still (currently people are on a CovSonar2 version) so there might be certain changes. And it's not tested so extensively like Nextclade, etc...

RaverJay commented 2 years ago

Nextclade might take too long to add this: https://github.com/nextstrain/nextclade/issues/319

Maybe we should implement it ourselves at this time - just 'borrow' the code from codon2nucleotide and translate to AAs

hoelzer commented 2 years ago

Yes, seems fair to me. Then we still rely on Nextclade that anyway runs and just translate the nt insertions for the report. I thin having the code from https://github.com/theosanderson/codon2nucleotide as a small script in bin should do the trick?

RaverJay commented 2 years ago

It's a little bit more complicated than that, see PR

corneliusroemer commented 2 years ago

In the upcoming release (today or tomorrow) Nextclade/align will output aa insertions. Maybe this is still relevant?

replikation commented 2 years ago

@corneliusroemer thanks for informing us