Closed genomics-pixel closed 4 years ago
Hi,
Based on the log file, the predictions are (nearly)identical, with the only difference between the 3rd prophage predicted from 2097642-2269513 vs 2097076-2286458, right ? Which database did you use in your command line run/Perl script ?
Best, Simon
Dear Simon,
Thank you very much for your quick response.
Yes, you are right. The main difference between the results from Perl Script and Cyverse script is the prediction for the 3rd prophage.
The Perl script version of VirSorter was run with the command below (XXX and YYY are abbreviated path names):
Therefore, I am assuming that the default parameter "VIRSorter" was used for the --dataset option (though I have not explicitly wrote that in the command line which was executed) and that Refseqdb was used since "--db 1" option was written in the command line.
The "$data_dir" in wrapper_phage_contigs_sorter_iPlant.pl is pointed to the directory "/data/" with the following objects: (Files downloaded on September 2015)
Again, thank you for your help!
Sincerely, genomics-pixel
Ok, so my best guess is that the viral databases used are different between CyVerse and your local installation. Could you compare the annotation files you get for the genome (the files named "...affi-contigs.csv") ?
Dear Simon,
Thank you for your help.
I could not find "...affi-contigs.csv", but there was a file named "Metric_files/VIRSorter_affi-contigs.tab" in both of the output given by the Perl script and Cyverse version of VirSorter. Hopefully these are the files you were looking for:
Cyverse version (added ".txt" at the end for the purpose of uploading to github) VIRSorter_affi-contigs.tab.txt
Perl script: wrapper_phage_contigs_sorter_iPlant.pl (added ".txt" at the end for the purpose of uploading to github) VIRSorter_affi-contigs.tab.txt
Quick check with "diff" unix command has shown that these two files are quite different.
It seems that the problem arises from differences in viral database between Cyverse and my local installation. Could you please show me where to get the viral database which Cyverse is currently using, so that we can check this?
Many thanks in advance.
Sincerely, genomics-pixel
Actually, the database seems to be ok. The difference seems to stem from a bug in the CyVerse run: in some cases, the BLAST score has an additional space in front of the digits, which I suspect is throwing off the parser (see for example VIRSorter_NC_000964_3_Bacillus_subtilis_subsp__subtilis_str__168_complete_genome-gene_4179). Most predictions will not be impacted at all, because they are based on HMM profile (the "Phage_cluster_XXX") rather than BLAST to singleton proteins, but in your case the 3rd prophage is all based on singleton matches.
Since the bug is only seen in CyVerse and not seen in the Perl version, I guess you'd have to ask CyVerse support to investigate and check where this extra space comes from ?
Best, Simon
Dear Simon,
Sorry for the late reply.
Thank you for checking the VIRSorter_affi-contigs.tab files and identifying the source of the problem. Since I have already started using the perl script version for my project, I am glad to know that the perl script version of VirSorter did not have the bug.
Thank you for your kind help!
Sincerely, genomics-pixel
To the developers of VirSorter,
First of all thank you for creating such a wonderful tool. I have used the downloadable perl script version and the Cyverse version of the VirSorter/1.0.3 with same parameters to analyze NC_000964 genome fasta file obtained from NCBI, but the predicted prophage regions did not match.
Cyverse
Perl script (logs_out.txt corresponds to the "out" file in "logs" directory of output)
If you could please look into this that would be great. Many thanks in advance.