nextstrain / nextclade

Viral genome alignment, mutation calling, clade assignment, quality checks and phylogenetic placement
https://clades.nextstrain.org
MIT License
219 stars 61 forks source link

Nextalign CLI is not translating the correct sequence strand. #667

Closed dnanto closed 2 years ago

dnanto commented 2 years ago

Nextalign version: 1.8.0

Reproduction:

printf ">id\nATGACTGAATCCACTTAANNNTTAAGTGGATTCAGTCAT\n" > test.fasta
{
  echo . . gene 1 15 . + 0 gene_name=fwd
  echo . . gene 25 39 . - 0 gene_name=rev
} | tr ' ' '\t' > test.gff3
nextalign \
  -j 1 \
  --in-order \
  -i test.fasta \
  -r test.fasta \
  --genes=fwd,rev \
  --genemap test.gff3 \
  --output-dir . \
  --min-length 33

Output:

==> test.gene.fwd.fasta <==
>id
MTEST

==> test.gene.rev.fasta <==
>id
SGFSH

The current behavior obtains the correct translation, but for the incorrect strand. The expected output should correspond to the reverse strand: "MTEST".

corneliusroemer commented 2 years ago

Thanks for pointing out this inconsistency.

The strand direction from the genemap is currently not used, so technically Nextalign works as intended. I've relabelled this as an enhancement proposal.

You're the first to use Nextalign for DNA viruses ;)

dnanto commented 2 years ago

@corneliusroemer I'd love to contribute. Got the build environment set-up and started on a branch...

ivan-aksamentov commented 2 years ago

@dnanto Sorry for the troubles. Indeed, we implemented Nextclade to analyze SARS-CoV-2 initially, so other viruses got a bit neglected.

The best option it seems would be to check for gene.strand and if it's "-", then reverse-complement the result string here: https://github.com/nextstrain/nextclade/blob/54662409026f2cf69a048b39ea8f16e8d3396edd/packages/nextalign/src/translate/extractGene.cpp#L88

I hope that gene.strand is what it says it is. But that also needs to be double-checked. This field is not used anywhere currently.

I think we should have a reverse complement function somewhere in Nextclade's PCR primers code. But not sure about that. It may also use plain strings, not sequence strings (which use different char symbols). If there is such function, then it's fine to just copy the files over to Nextalign for now. I have plans to clean things up in the repo soon and I'll deduplicate it then.

Let me know if you need help.

dnanto commented 2 years ago

@ivan-aksamentov No need to apologize, I'm really grateful for these tools. That is basically what I did here. A quick test has the desired output:

root@02f0e7441320:/home# bash -x test.sh 
+ printf '>id\nATGACTGAATCCACTTAANNNTTAAGTGGATTCAGTCAT\n'
+ echo . . gene 1 15 . + 0 gene_name=fwd
+ echo . . gene 25 39 . - 0 gene_name=rev
+ tr ' ' '\t'
+ .build/Release-Static/packages/nextalign_cli/nextalign-Linux-x86_64 -j 1 --in-order -i test.fasta -r test.fasta --genes=fwd,rev --genemap test.gff3 --output-dir . --min-length 33
+ head test.gene.fwd.fasta test.gene.rev.fasta
==> test.gene.fwd.fasta <==
>id
MTEST

==> test.gene.rev.fasta <==
>id
MTEST

I'll add some testing before I make the pull request.

ivan-aksamentov commented 2 years ago

@dnanto Thanks!

Could you please also double-check that the reference peptides emitted when using the --include-reference flag are correct?

I am also wondering what will happen to Nextclade and, in particular, whether there are side effects for mutation calling and how this stuff will be visualized in Nextclade Web. Turns out extending to more viruses is trickier than we thought. But it seems this is only relevant when there's "-" strand in the gene map, so we can figure it out later.

dnanto commented 2 years ago

@ivan-aksamentov Can do, and yes, it will get trickier, especially when splicing and segmented genomes are involved. I'll finish-up the basic tests tomorrow, etc...

dnanto commented 2 years ago

closing issue, see pull request -> #673