hoelzer / virify

A Nextflow implementation of the EBI VIRify pipeline for the detection of viruses from metagenomic assemblies.
GNU General Public License v3.0
4 stars 1 forks source link

IMG/VR blast extension #17

Closed hoelzer closed 4 years ago

hoelzer commented 4 years ago

Background

As additional validation step we can check how many of predicted viral sequences we can blast in a database of viral sequences including metagenomes: https://img.jgi.doe.gov/vr/

Idea

I just checked this again by blasting (megablast and blastn) vs. IMG/VR (including viruses from metagenomes so makes most sense to me) using 8 threads

query: 1590 contigs (all high confidence, low confidence and putative prophages from a virus enriched groundwater sample)

blastn (55 minutes) megablast (30sec)

Restrictive filtering of the blast results (aln length 80%) revealed

92/1590 (blastn) 33/1590 (megablast)

hits against IMG/VR that not really tell us some taxonomy but at least these are also reported (unclassified) viruses. The 33 are all included in the 92. For 5 of them, VIRify found some taxonomy.

So we could simply add this to the pipeline and provide some additional information about if this sequence was seen before as 'some virus'.

BLAST Database location currently: /hps/nobackup2/metagenomics/mhoelzer/nextflow-results/virify/v1/kallies_2019/IMG_VR/IMG_VR_2018-07-01_4

Downloaded from: https://genome.jgi.doe.gov/portal/IMG_VR/IMG_VR.home.html

commands

blastn -task blastn -num_threads 8 -query $ALL -db $DB -evalue 1e-10 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend qlen sstart send evalue bitscore slen" > $ALL.blast
awk '{if($4>0.8*$9){print $0}}' $ALL.blast

blastn -task megablast -num_threads 8 -query $ALL -db $DB -evalue 1e-10 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend qlen sstart send evalue bitscore slen" > $ALL.megablast
awk '{if($4>0.8*$9){print $0}}' $ALL.megablast

@mberacochea I would implement this in nextflow first and add the database to the ftp and I think a CWL translation is then straight-forward. The visualization of the hits is something to think of later

hoelzer commented 4 years ago

basic blast and metadata-combine (ruby script) implemented for first tests with the nextflow pipeline.

mberacochea commented 4 years ago

CWL workflow in progress, blast - filter - merge .cwl steps reader. Ruby script migrated to python.

Missing the integration in the workflow.

https://github.com/EBI-Metagenomics/emg-virify-pipeline/pull/new/imgvr-workflow-step

Scripts: https://github.com/EBI-Metagenomics/emg-virify-scripts/commit/18c39f08aaf30bb8289a3d1e82f7e1caa12f919b

hoelzer commented 4 years ago

@hoelzer ToDo: implement the python version of the merge script

mberacochea commented 4 years ago

@hoelzer there is one already in place https://github.com/EBI-Metagenomics/emg-virify-scripts/blob/imgvr-scripts/virify_scripts/imgvr_merge.py