Closed hoelzer closed 2 years ago
Well we are reporting amino acid changes, so these columns: "aaSubstitutions" "aaDeletions"
but there is no "aaInsertions" at this time.
Also the "insertions" column is empty in all the runs I just checked
Yes, I think insertions are rare. Let me check if I can find an example Nextclade output w/ insertions.
Okay, so it seems insertions are only reported on nucleotide level. E.g. I just checked some sequences and selected some different outputs:
totalInsertions insertions
17 21534:A,27373:CTTTCGATCTCT,27380:C,27384:CTC
16 27373:CTTTCGATCTCT,27380:C,27384:CTC
10 9627-9631,20528-20531,22492,28967-28969
28 1569:C,1574:AGAGCTAG,27373:CTTTCGATCTCT,27380:C,27384:CTC,28250:CTG
many matches"""
248 7010-7049,9543-9561,14331-14515,29760-29767
1 28250:G
17 39,19286,29602,29605,29631,29706,29752,29772,29779,29785,29792-29794,29797,29806,29866,29868-29870
It seems the output format can vary a lot. So I am not sure if it is so meaningful to add this. Maybe just information if there are insertions or not (totalInsertions
)?
I think it's difficult to include in the final report at least. There can be various insertions also based on seq errors, used protocols, ...
The question would be also how the newest Nextstrain output looks like in this context?
My last idea was to at least add a column for the totalInsertions
so that one can easily spot weirdos for further checks
@replikation @RaverJay actually this is getting now important with Omicron, especially because of an larger insertion in the Spike:
S:R214REPE
Currently, this is just not shown in the report which can lead to misleading results.
Not sure what's best to solve this, I think currently we could still only extract such information from Nextclade.
Another way would be to use our tool covsonar
to basically call all substitutions, insertions and deletions with one tool and include this into the report... https://gitlab.com/s.fuchs/covsonar
To do so, we would need a process that generates the covsonar database for all genomes and then we could easily extract all information.
Yeah this is bad
Problem is, Nextclade output is still missing a 'aaInsertions' column (there is only: substitutions deletions insertions aaSubstitutions aaDeletions)
We could of course calculate it from 'insertions' though
@replikation what do you think, add or switch to covsonar?
Ahh, we would then need a converter from nt insertions reported by Nextclade to aa insertions. Something like
https://codon2nucleotide.theo.io/
CovSonar might solve that but here the only weak point is that the tool is also under development still (currently people are on a CovSonar2 version) so there might be certain changes. And it's not tested so extensively like Nextclade, etc...
Nextclade might take too long to add this: https://github.com/nextstrain/nextclade/issues/319
Maybe we should implement it ourselves at this time - just 'borrow' the code from codon2nucleotide and translate to AAs
Yes, seems fair to me. Then we still rely on Nextclade that anyway runs and just translate the nt insertions for the report. I thin having the code from https://github.com/theosanderson/codon2nucleotide as a small script in bin
should do the trick?
It's a little bit more complicated than that, see PR
In the upcoming release (today or tomorrow) Nextclade/align will output aa insertions. Maybe this is still relevant?
@corneliusroemer thanks for informing us
It seems Nextclade also provides information on insertions and not only deletions:
see [13]
Could be worth to integrate this as well.