Closed dnanto closed 2 years ago
Thanks for pointing out this inconsistency.
The strand direction from the genemap is currently not used, so technically Nextalign works as intended. I've relabelled this as an enhancement proposal.
You're the first to use Nextalign for DNA viruses ;)
@corneliusroemer I'd love to contribute. Got the build environment set-up and started on a branch...
@dnanto Sorry for the troubles. Indeed, we implemented Nextclade to analyze SARS-CoV-2 initially, so other viruses got a bit neglected.
The best option it seems would be to check for gene.strand
and if it's "-"
, then reverse-complement the result
string here:
https://github.com/nextstrain/nextclade/blob/54662409026f2cf69a048b39ea8f16e8d3396edd/packages/nextalign/src/translate/extractGene.cpp#L88
I hope that gene.strand
is what it says it is. But that also needs to be double-checked. This field is not used anywhere currently.
I think we should have a reverse complement function somewhere in Nextclade's PCR primers code. But not sure about that. It may also use plain strings, not sequence strings (which use different char symbols). If there is such function, then it's fine to just copy the files over to Nextalign for now. I have plans to clean things up in the repo soon and I'll deduplicate it then.
Let me know if you need help.
@ivan-aksamentov No need to apologize, I'm really grateful for these tools. That is basically what I did here. A quick test has the desired output:
root@02f0e7441320:/home# bash -x test.sh
+ printf '>id\nATGACTGAATCCACTTAANNNTTAAGTGGATTCAGTCAT\n'
+ echo . . gene 1 15 . + 0 gene_name=fwd
+ echo . . gene 25 39 . - 0 gene_name=rev
+ tr ' ' '\t'
+ .build/Release-Static/packages/nextalign_cli/nextalign-Linux-x86_64 -j 1 --in-order -i test.fasta -r test.fasta --genes=fwd,rev --genemap test.gff3 --output-dir . --min-length 33
+ head test.gene.fwd.fasta test.gene.rev.fasta
==> test.gene.fwd.fasta <==
>id
MTEST
==> test.gene.rev.fasta <==
>id
MTEST
I'll add some testing before I make the pull request.
@dnanto Thanks!
Could you please also double-check that the reference peptides emitted when using the --include-reference
flag are correct?
I am also wondering what will happen to Nextclade and, in particular, whether there are side effects for mutation calling and how this stuff will be visualized in Nextclade Web. Turns out extending to more viruses is trickier than we thought. But it seems this is only relevant when there's "-" strand in the gene map, so we can figure it out later.
@ivan-aksamentov Can do, and yes, it will get trickier, especially when splicing and segmented genomes are involved. I'll finish-up the basic tests tomorrow, etc...
closing issue, see pull request -> #673
Nextalign version: 1.8.0
Reproduction:
Output:
The current behavior obtains the correct translation, but for the incorrect strand. The expected output should correspond to the reverse strand: "MTEST".