replikation / poreCov

SARS-CoV-2 workflow for nanopore sequence data
https://case-group.github.io/
GNU General Public License v3.0
39 stars 16 forks source link

add aa insertions from Nextclade #177

Closed RaverJay closed 2 years ago

RaverJay commented 2 years ago

Just to see how painful it would be, this adds custom code to convert Nextclade's nt insertion call to aa insertions. Implements #82

image

Some things are very experimental:

So do you think we should include this? Then I would clean this up a little, then we test and merge

Obviously it would be much better if Nextclade implemented this in their output

hoelzer commented 2 years ago

Uff, yeah I see. That the Wuhan reference is hardocded for now might be okay: bc/ this is anyway used by everyone to have comparable results.

I agree, that we might have strange outlier sequences w/ many or long insertions and then this fucks up the report a bit.

We should test this (e.g. I can also run some real current data sets here).

And then we could switch to the Nextclade aa insertion reporting later, but I also agree that's unclear when this will be addressed in implemented.

So my vote is to have this in (especially bc/ of Omicron) and then do some testing. If the reports look fine, we merge and might later replace this by the Nextclade implementation.

RaverJay commented 2 years ago

I moved this out of the report process to a module+script, should be clean and working now And if Nextclade reports it in the future, this can easily be changed back

Please test =)

hoelzer commented 2 years ago

Nice, looks good!

One thing, we could also link to the "hijacked" github repo for the nt2aa translation here:

Note: amino acid insertions are currently not reported directly by Nextclade, and were instead converted from nucleotide insertions with custom code when possible.

? Besides, I would then also merge that and do a prerelase. I also run this w/ ~20 Delta sequences and then just no insertion was reported (which is correct)

hoelzer commented 2 years ago

and btw, we could also use SNPeff actually, there is a specific version for SARS-CoV-2 where you can pass the final VCF file and then it should give you the amino acid translations, e.g. see here

https://gitlab.com/RKIBioinformaticsPipelines/ncov_minipipe/-/blob/master/covpipe/rules/inspect_vars.smk

But, if this works atm we can also go with the usual system via Nextclade.

RaverJay commented 2 years ago

image added

hoelzer commented 2 years ago

great thx @RaverJay ! So I would merge this then, and SNPeff we can keep in mind but this would introduce a larger change then