simroux / VirSorter

Source code of the VirSorter tool, also available as an App on CyVerse/iVirus (https://de.iplantcollaborative.org/de/)
GNU General Public License v2.0
104 stars 30 forks source link

Outputs from Perl script and Cyverse version does not match #45

Closed genomics-pixel closed 4 years ago

genomics-pixel commented 5 years ago

To the developers of VirSorter,

First of all thank you for creating such a wonderful tool. I have used the downloadable perl script version and the Cyverse version of the VirSorter/1.0.3 with same parameters to analyze NC_000964 genome fasta file obtained from NCBI, but the predicted prophage regions did not match.

Cyverse

Cyverse_Virsorter_JobParameters.txt Cyverse_Virsorter_stdout_log.txt (Stderr in Cyverse was empty)

Perl script (logs_out.txt corresponds to the "out" file in "logs" directory of output)

wrapper_phage_contigs_sorter_iPlant.pl -f XXX/Bacillus_subtilis_subsp_subtilis_str_168.fasta --db 1 --wdir YYY/Bacillus_subtilis_subsp_subtilis_str_168 --ncpu 4 logs_out.txt ("err" in "logs" directory was empty)

If you could please look into this that would be great. Many thanks in advance.

simroux commented 5 years ago

Hi,

Based on the log file, the predictions are (nearly)identical, with the only difference between the 3rd prophage predicted from 2097642-2269513 vs 2097076-2286458, right ? Which database did you use in your command line run/Perl script ?

Best, Simon

genomics-pixel commented 5 years ago

Dear Simon,

Thank you very much for your quick response.

Yes, you are right. The main difference between the results from Perl Script and Cyverse script is the prediction for the 3rd prophage.

The Perl script version of VirSorter was run with the command below (XXX and YYY are abbreviated path names):

Therefore, I am assuming that the default parameter "VIRSorter" was used for the --dataset option (though I have not explicitly wrote that in the command line which was executed) and that Refseqdb was used since "--db 1" option was written in the command line.

The "$data_dir" in wrapper_phage_contigs_sorter_iPlant.pl is pointed to the directory "/data/" with the following objects: (Files downloaded on September 2015)

Again, thank you for your help!

Sincerely, genomics-pixel

simroux commented 5 years ago

Ok, so my best guess is that the viral databases used are different between CyVerse and your local installation. Could you compare the annotation files you get for the genome (the files named "...affi-contigs.csv") ?

genomics-pixel commented 5 years ago

Dear Simon,

Thank you for your help.

I could not find "...affi-contigs.csv", but there was a file named "Metric_files/VIRSorter_affi-contigs.tab" in both of the output given by the Perl script and Cyverse version of VirSorter. Hopefully these are the files you were looking for:

Cyverse version (added ".txt" at the end for the purpose of uploading to github) VIRSorter_affi-contigs.tab.txt

Perl script: wrapper_phage_contigs_sorter_iPlant.pl (added ".txt" at the end for the purpose of uploading to github) VIRSorter_affi-contigs.tab.txt

Quick check with "diff" unix command has shown that these two files are quite different.

It seems that the problem arises from differences in viral database between Cyverse and my local installation. Could you please show me where to get the viral database which Cyverse is currently using, so that we can check this?

Many thanks in advance.

Sincerely, genomics-pixel

simroux commented 5 years ago

Actually, the database seems to be ok. The difference seems to stem from a bug in the CyVerse run: in some cases, the BLAST score has an additional space in front of the digits, which I suspect is throwing off the parser (see for example VIRSorter_NC_000964_3_Bacillus_subtilis_subsp__subtilis_str__168_complete_genome-gene_4179). Most predictions will not be impacted at all, because they are based on HMM profile (the "Phage_cluster_XXX") rather than BLAST to singleton proteins, but in your case the 3rd prophage is all based on singleton matches.

Since the bug is only seen in CyVerse and not seen in the Perl version, I guess you'd have to ask CyVerse support to investigate and check where this extra space comes from ?

Best, Simon

genomics-pixel commented 4 years ago

Dear Simon,

Sorry for the late reply.

Thank you for checking the VIRSorter_affi-contigs.tab files and identifying the source of the problem. Since I have already started using the perl script version for my project, I am glad to know that the perl script version of VirSorter did not have the bug.

Thank you for your kind help!

Sincerely, genomics-pixel