usafsam / mad_river_wf

SARS-CoV-2 analysis workflow, using Nextflow and bbtools
Apache License 2.0
4 stars 2 forks source link

Add VADR analysis to Mad River #3

Closed friesac closed 2 years ago

friesac commented 2 years ago

It would be most ideal to have a new output directory with the results of VADR, specifically v-annotate.pl described at https://github.com/ncbi/vadr/wiki/Coronavirus-annotation#howto.

I ran the latest staphb container successfully from: https://hub.docker.com/r/staphb/vadr

I ran the below commands referenced at the #howto on my macbook pro with 2 cpus on about 500 sequences rather quickly.

v-annotate.pl --split --cpu 8 --glsearch -s -r --nomisc --mkey sarscov2 --lowsim5seq 6 --lowsim3seq 6 --alt_fail lowscore,insertnn,deletinn --mdir <sarscov2-models-dir-path> <fasta-file-to-annotate> <output-directory-to-create>

One issue I ran into was that the consensus output fasta from mad_river has too long of sequence names. So, we'll have to trim the extraneous information that comes out of iVar.

fanninpm commented 2 years ago

One issue I ran into was that the consensus output fasta from mad_river has too long of sequence names. So, we'll have to trim the extraneous information that comes out of iVar.

TIL that ivar consensus accepts the -i flag, which sets a name for that line. (The -i flag isn't documented in the manual entry for ivar consensus, and I only stumbled upon this when I was browsing the source code.) This will also obviate the need for a certain part of the performance_lineage_excel.py script that matches these unnecessarily long sample names with those from the Illumina sample sheet.

friesac commented 2 years ago

TIL that ivar consensus accepts the -I flag

Good find!

I forgot to mention above but we need to make sure fasta-trim-terminal-ambigs.pl is used first in the container to create a trimmed fasta prior to v-annotate.pl we can use the --minlen 50 --maxlen 30000 arguments described at vadr wiki. I think this is consistent with Genbank.