simroux / VirSorter

Source code of the VirSorter tool, also available as an App on CyVerse/iVirus (https://de.iplantcollaborative.org/de/)
GNU General Public License v2.0
104 stars 30 forks source link

Issue running virsorter with diamond using conda and wrapper perl script. #64

Closed elsherbini closed 4 years ago

elsherbini commented 4 years ago

I'm having issues getting virsorter to run on some ~300 genomes, using conda and the wrapper script. I'm unable to use the docker image unfortunately on our cluster. The first error in the error log for all the genomes is that it can't find the output from diamond. I'm using diamond instead of blastp because no matter which version I installed from conda I kept getting segmentation faults on the blastp step.

I installed these dependencies in a fresh environment:

channels:
 - conda-forge
 - bioconda
dependencies:
 - mcl=14.137
 - muscle
 - blast
 - perl-bioperl
 - perl-file-which
 - hmmer=3.1b2
 - perl-parallel-forkmanager
 - perl-list-moreutils
 - diamond=0.9.14
 - metagene_annotator
 - openssl

I'm running using the wrapper script (run by snakemake so these wildcards get filled in for each genome): (I'm trying diamond because I kept getting segmentation faults with blastp no matter which version i installed)

wrapper_phage_contigs_sorter_iPlant.pl --diamond -d {wildcards.genome} -f {input} --data-dir {params.data_dir} --ncpu {threads} --db {params.db} --wdir temp/virsorte
r_out/{wildcards.genome}

The first error in the error log says that it can't find the diamond output:

No such file or directory
Error: Error opening file temp/virsorter_out/10N28655E12/r_0/db/Pool_new_unclustered.dmnd
Can't open 'temp/virsorter_out/10N28655E12/Contigs_prots_vs_Phage_Gene_unclustered.tab' for reading: 'No such file or directory' at /nfs/polzlab001/josephe/projects/find_prophage/virsorter_hail_mary
/VirSorter/Scripts/Step_2_merge_contigs_annotation.pl line 79
Can't open 'temp/virsorter_out/10N28655E12/10N28655E12_affi-contigs.csv' for reading: 'No such file or directory' at /nfs/polzlab001/josephe/projects/find_prophage/virsorter/VirSorter/Scri
pts/Step_3_highlight_phage_signal.pl line 64
Can't open 'temp/virsorter_out/10N28655E12/10N28655E12_phage-signal.csv' for reading: 'No such file or directory' at /nfs/polzlab001/josephe/projects/find_prophage/virsorter/VirSorter/Scri
pts/Step_4_summarize_phage_signal.pl line 84

I've commented out the code that cleans up the db folder, here is what's in there after a failed run:

4.0M -rwxrwxr-x 1 josephe polzlab 4.0M Feb 13 19:49 Blast_unclustered.tab
   0 -rwxrwxr-x 1 josephe polzlab    0 Feb 13 19:49 formatdb.log
4.0M -rwxrwxr-x 1 josephe polzlab 4.0M Feb 13 19:49 Phage_Clusters_current.tab
4.0M -rw-rw-r-- 1 josephe polzlab 4.0M Feb 13 19:49 phage_protein_14-03_RefseqABVir-plus-viromes.faa
   0 -rwxrwxr-x 1 josephe polzlab    0 Feb 13 19:49 Pool_clusters.hmm
4.0M -rwxrwxr-x 1 josephe polzlab 4.0M Feb 13 19:49 Pool_clusters.hmm.h3f
828K -rwxrwxr-x 1 josephe polzlab 827K Feb 13 19:49 Pool_clusters.hmm.h3i
4.0M -rwxrwxr-x 1 josephe polzlab 4.0M Feb 13 19:49 Pool_clusters.hmm.h3m
4.0M -rwxrwxr-x 1 josephe polzlab 4.0M Feb 13 19:49 Pool_clusters.hmm.h3p
4.0M -rwxrwxr-x 1 josephe polzlab 4.0M Feb 13 19:49 Pool_new_unclustered.faa
3.2M -rwxrwxr-x 1 josephe polzlab 3.2M Feb 13 19:49 Pool_new_unclustered.phr
272K -rwxrwxr-x 1 josephe polzlab 269K Feb 13 19:49 Pool_new_unclustered.pin
4.0M -rwxrwxr-x 1 josephe polzlab 4.0M Feb 13 19:49 Pool_new_unclustered.psq
4.0M -rwxrwxr-x 1 josephe polzlab 4.0M Feb 13 19:49 Pool_unclustered.faa

Here is the out log:

/nfs/polzlab001/josephe/projects/find_prophage/virsorter/VirSorter/Scripts/Step_1_contigs_cleaning_and_gene_prediction.pl 10N28654E4 temp/virsorter_out/10N28654E4/fasta temp/virsorter_out/
10N28654E4/fasta/input_sequences.fna 2
mga (/nfs/polzlab001/josephe/projects/find_prophage/.snakemake/conda/4a1c5e7a/bin/mga) temp/virsorter_out/10N28654E4/fasta/10N28654E4_nett.fasta -m > temp/virsorter_out/10N28654E4/fasta/10N28654E4_m
ga.predict
we exclude 10N28654E4_MCTB01000044_1_Vibrio_cyclitrophicus_strain_10N_286_54_E4_10N_286_54_E4_contig_7__whole_genome_shotgun_sequence-gene_60 because there is a pblm with the sequence -> too many succesive
... # about 10 genes thrown out here
/nfs/polzlab001/josephe/projects/find_prophage/.snakemake/conda/4a1c5e7a/bin/hmmsearch --tblout temp/virsorter_out/10N28654E4/Contigs_prots_vs_PFAMa.tab --cpu 10 -o temp/virsorter_out/10N28654E4/Con
tigs_prots_vs_PFAMa.out --noali /nobackup1/josephe/PROPHAGES/virsorter/virsorter-data/PFAM_27/Pfam-A.hmm temp/virsorter_out/10N28654E4/fasta/10N28654E4_prots.fasta
/nfs/polzlab001/josephe/projects/find_prophage/.snakemake/conda/4a1c5e7a/bin/hmmsearch --tblout temp/virsorter_out/10N28654E4/Contigs_prots_vs_PFAMb.tab --cpu 10 -o temp/virsorter_out/10N28654E4/Con
tigs_prots_vs_PFAMb.out --noali /nobackup1/josephe/PROPHAGES/virsorter/virsorter-data/PFAM_27/Pfam-B.hmm temp/virsorter_out/10N28654E4/fasta/10N28654E4_prots.fasta
/nfs/polzlab001/josephe/projects/find_prophage/.snakemake/conda/4a1c5e7a/bin/diamond blastp --query temp/virsorter_out/10N28654E4/fasta/10N28654E4_prots.fasta --db temp/virsorter_out/10N28654E4/r_0/
db/Pool_new_unclustered --out temp/virsorter_out/10N28654E4/r_0/Contigs_prots_vs_New_unclustered.tab --threads 10 --outfmt 6 -b 1 --more-sensitive -k 500 --evalue 0.001
diamond v0.9.14.115 | by Benjamin Buchfink <buchfink@gmail.com>
Licensed under the GNU AGPL <https://www.gnu.org/licenses/agpl.txt>
Check http://github.com/bbuchfink/diamond for updates.

#CPU threads: 10
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
#Target sequences to report alignments for: 500
Temporary directory: temp/virsorter_out/10N28654E4/r_0
Opening the database...  [0.000171s]
/nfs/polzlab001/josephe/projects/find_prophage/virsorter/VirSorter/Scripts/Step_2_merge_contigs_annotation.pl temp/virsorter_out/10N28654E4/fasta/10N28654E4_mga_final.predict temp/virsorte
r_out/10N28654E4/Contigs_prots_vs_Phage_Gene_Catalog.tab temp/virsorter_out/10N28654E4/Contigs_prots_vs_Phage_Gene_unclustered.tab temp/virsorter_out/10N28654E4/Contigs_prots_vs_PFAMa.tab temp/virsorter_ou
t/10N28654E4/Contigs_prots_vs_PFAMb.tab /nobackup1/josephe/PROPHAGES/virsorter/virsorter-data/Phage_gene_catalog_plus_viromes/Phage_Clusters_current.tab temp/virsorter_out/10N28654E4/10N28654E4_affi-contig
s.csv
/nfs/polzlab001/josephe/projects/find_prophage/virsorter/VirSorter/Scripts/Step_3_highlight_phage_signal.pl -csv temp/virsorter_out/10N28654E4/10N28654E4_affi-contigs.csv -out temp/virsort
er_out/10N28654E4/10N28654E4_phage-signal.csv -n_cpu 10 -no_c 0
## Taking information from the contig info file (temp/virsorter_out/10N28654E4/10N28654E4_affi-contigs.csv)
/nfs/polzlab001/josephe/projects/find_prophage/virsorter/VirSorter/Scripts/Step_4_summarize_phage_signal.pl temp/virsorter_out/10N28654E4/10N28654E4_affi-contigs.csv temp/virsorter_out/10N
28654E4/10N28654E4_phage-signal.csv temp/virsorter_out/10N28654E4/10N28654E4_global-phage-signal.csv temp/virsorter_out/10N28654E4/10N28654E4_new_prot_list.csv
/nfs/polzlab001/josephe/projects/find_prophage/virsorter/VirSorter/Scripts/Step_5_get_phage_fasta-gb.pl 10N28654E4 temp/virsorter_out/10N28654E4
Code 10N28654E4
The sequences will be put in:
 - temp/virsorter_out/10N28654E4/Predicted_viral_sequences/10N28654E4_cat-1.fasta
 - temp/virsorter_out/10N28654E4/Predicted_viral_sequences/10N28654E4_cat-2.fasta
 - temp/virsorter_out/10N28654E4/Predicted_viral_sequences/10N28654E4_cat-3.fasta
 - temp/virsorter_out/10N28654E4/Predicted_viral_sequences/10N28654E4_prophages_cat-4.fasta
 - temp/virsorter_out/10N28654E4/Predicted_viral_sequences/10N28654E4_prophages_cat-5.fasta
 - temp/virsorter_out/10N28654E4/Predicted_viral_sequences/10N28654E4_prophages_cat-6.fasta
Checking 'temp/virsorter_out/10N28654E4/10N28654E4_phage-signal.csv'
10N28654E4  in progress
simroux commented 4 years ago

Never seen this, but the segmentation fault in the blast step doesn't look good :-/ Is this happening with a smaller input (e.g. a subset of the 300 genomes) ? I wonder if there is something weird maybe with one of the sequences that would explain this.

The other things that may be worth trying would be checking in the "err" log, and also running the diamond line (i.e. "/nfs/polzlab001/josephe/projects/find_prophage/.snakemake/conda/4a1c5e7a/bin/diamond blastp --query temp/virsorter_out/10N28654E4/fasta/10N28654E4_prots.fasta --db temp/virsorter_out/10N28654E4/r_0/db/Pool_new_unclustered --out temp/virsorter_out/10N28654E4/r_0/Contigs_prots_vs_New_unclustered.tab --threads 10 --outfmt 6 -b 1 --more-sensitive -k 500 --evalue 0.001") separately to check the output (sometimes the full list of errors / stdout are not well captured by VirSorter)

elsherbini commented 4 years ago

Thank you for your reply! The issue seems to be that I was using an oooold version of the virsorter data that didn't include the diamond file.

simroux commented 4 years ago

Good to know, I'll add this to the Readme. Thanks !