Closed genomics-pixel closed 4 years ago
Hi,
The difference between categories 1/2/3 and 4/5/6 is that for the former, the whole contig was detected as viral, while the latter categories are predicted as "integrated provirus", i.e. the contig has a mix of host and virus regions. This is why you don't have coordinates for categories 1/2/3 (because the coordinates are just from start to end of the contig), but you do have coordinates for 4/5/6
Best, Simon
Dear Simon,
Thank you very much for your reply.
If I understand correctly, VirSorter predicts the whole contig as "viral" when more than 80% of the genes are predicted to be of viral origin.
However, there maybe cases in which there are host genome fragments flanking the "viral predicted regions" (i.e. the first predicted viral gene to the last predicted viral gene), even if more than 80% of the genes are predicted to be of viral origin.
These cases may not be often, but I would still very much like to take a look into the flanking host genome fragments of the "viral predicted regions" (even for category 1,2,3).
Therefore it would be great if there is a way to know the coordinates of the first viral gene to the last viral gene even when they are predicted as category 1, 2, and 3.
Thank you always for your support.
Sincerely, genomics-pixel
In our experience, contigs in category 1/2/3 are almost always entirely viral. However, if you want to understand how VirSorter did its prediction, you can look in the XX_global-phage-signal.csv file: columns 7 to 12 show the enrichment statistics, including the coordinates of the maximum enrichment region (if no coordinates, that means the maximum enrichment was found when considering the entire contig).
Best, Simon
Dear Simon,
Thank you for your informative help!
From your message and what I have learned from looking at the results, I made a list of steps to get the coordinates for all predicted categories (1, 2, 3, 4, 5 and 6).
If you could take a look to see if the steps below are fine that would be great.
--Beginning of steps to extract coordinates--
Go on to "step 1" only If the current entry (i.e. viral predicted region) does not match to any of the following criteria.
Examine 7th to 12th column of each line of "VIRSorter_global-phage-signal.csv" and take out gene regions.
Ex.) gene_10-gene_14:XX,gene_8-gene_12:XX,gene_12-gene_16:XX (XX = gene enrichment score)
Extract the coordinate of the above found genes from the 2nd and 3rd columns of "Metric_files/VIRSorter_affi-contigs.tab".
Ex.) gene_8|1000|1500 gene_9|1600|2100 ... gene_15|4000|4500 gene_16|4600|5100
Subtract 50bp from the smallest coodrinate number and take the result as the starting point.
Ex.) 1000-50=950 -> 950bp is the starting point
Add 50bp to the largest cooridnate number and take the result as the ending point.
Ex.) 5100+50=5150 -> 5150bp is the ending point.
--end--
Again thank you for your help.
Sincerely, genomics-pixel
P.S. The above steps were written with the assumption that "VIRSorter_global-phage-signal.csv" describes about only one virus per line. If this assumption is wrong please let me know.
Hi,
This is all correct, and "VIRSorter_global-phage-signal.csv" indeed describes one virus (region) per line. Note that there can be multiple virus regions on a single contig, and these will be describe on separate lines.
Best, Simon
Dear Simon,
Thank you for checking the steps for extracting coordinates of viral region!
Unfortunately, when I have conducted the above steps, I found some problems.
Specifically, the start and end coordinates of viral predicted regions extracted with the above method did not match with the coordinates given by the headers of fasta files from "Predicted_viral_sequences/" directory in cases where the predicted viral region is identified to be both:
Below are example results from a case when the above described problem has occurred.
The header of fasta from "Predicted_viral_sequences" directory describes viral predicted region to be from "617" bp to "52,442" bp. VIRSorter_TESTCONTIG-circular_gene_82_gene_183-52442-617-cat_5 VIRSORTER_PREDICTED_VIRAL_SEQ_FOR_TESTCONTIG.fasta.txt
List of genes within the predicted viral region were taken from "VIRSorter_global-phage-signal.csv" and their location on contig was obtained from "Metric_files/VIRSorter_affi-contigs.tab". (These steps corresponds to Steps 1 and 2 of last post) With this approach viral predicted region was 517bp (567-50=517) to 152,104bp (152,054+50=152,104) TESTCONTIG_VIRSorter_global-phage-signal.csv.txt TESTCONTIG_VIRSorter_affi-contigs.tab.txt
Furthermore, a quick letter count of the above uploaded fasta file have shown the length of predicted viral region sequence to be 252,972bp , which does not match with the above numbers.
If you could help me to figure this out, I would be very grateful.
Sincerely, genomics-pixel
P.S. VirSorter (1.0.5) with the "--db2" option was used to calculate in the above example.
Hi,
So the coordinates look right to me: prophage starts at gene 82 which starts in 52492, so the "prophage" will start in 52492-50 = 52442 prophage ends at gene 183, which ends in 567, so the "prophage" will end in 567+50 = 617 (since it spans over the contig origin, its end is "before" its start).
The fasta file you got is wrong though, you're right, it should be ~ 99kb. Could you send the original input file ? I've tried to replicate this bug using the predicted viral sequence, but I get the expected length.
Thanks
Dear Simon,
I appreciate your help!
Since the whole input fasta file itself is quite large, I am sending you the fasta entry from which the prophage regions were predicted. CONTIG_FOR_SIMON.fasta.txt
(The above single fasta entry was taken out of the original input file which contains other fasta entries. If you would like to look at the whole input fasta file, it would be great if you could give me a private link to upload or an e-mail address)
I will also upload the following files. If you look through them, you will see that VirSorter actually predicts 2 viral regions which overlaps eachother CONTIG_FOR_SIMON_PREDICTED_SEQ.fasta.txt VIRSorter_global-phage-signal.csv.txt VIRSorter_affi-contigs.tab.txt
Sincerely genomcics-pixel .
Hi,
So I ran the contig, and got a single prophage prediction with an expected fasta file of 99,263 bp, which suggests the bug occurs when processing the whole input file. You can send the file (or a link for me to download it) at sroux (at) lbl dot gov.
Best, Simon
Dear Simon,
Thank you for taking a look into the contig fasta file. I have sent you an e-mail with a link and a password to download the whole input file. If you are unable to download or there are other information you need please let me know.
Again thank you for your help.
Sincerely, genomics-pixel
P.S. Just now, I conducted a quick VirSorter (ver 1.0.5 "--db 2") analysis with "CONTIG_FOR_SIMON.fasta.txt" (i.e. the subset of original whole input file which I shared few days ago), and received results with no viral region predicted in VIRSorter_global-phage-signal.csv.
wrapper_phage_contigs_sorter_iPlant.pl -f $input --db 2 --wdir $work_dir --ncpu 10
The same result was observed with Cyverse's VirSorter (1.0.3 "--db2").
However, when I did another analysis with VirSorter (ver 1.0.5 --db 2) the entire input file (the same file I shared via email), there were predicted viral region as expected.
Since you seem to have been able to detect viral regions with only one contig, I am wondering if I am doing something strange.
Thank you again for looking into this!
Hi,
After running the file "TEST_SAMPLE.contigs.fa" through VirSorter 1.05, I was not able to replicate the error you see: I have one prophage predicted for k251_103836_flag_3_multi_65_0004_len_152649_TEST_SAMPLE_TEST_CONTIG_ALPHA_COUNT_2228-circular, from gene 84 to 183, and the correct sequence of 99,263 bp is found in the output file "VIRSorter_prophages_cat-5.fasta".
My best guess is thus that something is wrong on your side, maybe you are not using the latest version of the scripts ? (re-downloading them from github may help). Other than that, I don't know what could cause this bug since I am not able to replicate it.
For your PS, VirSorter needs to be run with a specific option when most (or all) of the input sequences are viral, which is the case when running on a single viral contig. The way to run VirSorter in these cases is to add the option "--virome", which makes VirSorter use the "virome decontamination" mode, and enabled the prediction of viral sequences from datasets that are already mostly viral.
Dear Simon,
Thank you for testing the input file and telling me about the "--virome" option.
I agree that there maybe something wrong on my side. I will re-download the scripts from github and try it again.
Could you also please tell me the version of the tools listed in "dependencies"
(such as blast, hmmer, etc.) you have used for your VirSorter analysis?
I would like to retry using the same version of "dependencies" tools.
Thank you.
SIncerely, genomics-pixel
I used exactly the conda environment described in the Readme, so the version of these tools should be the latest available in conda (note that I don't believe this is the issue here since your file "affi_contigs.tab" looks good).
Dear Simon,
Thank you for getting back to me about my question.
Since I am working on a computer system which I do not have administrative access to, I will talk to the system administrators if they can re-install the program.
It may take a while to get it re-installed and have it checked, but I will come back to you once the source of the problem is identified.
Again thank you for your generous help.
Sincerely, genomics-pixel
Dear Simon,
Thank you for your generous support earlier this month.
I have had the system administrator re-install the VirSorter tool via conda with the commands below
manual: https://github.com/simroux/VirSorter#using-a-conda-virtual-environment-tested-on-ubuntu-and-centos install commands:
$ conda create --name virsorter -c bioconda mcl=14.137 muscle blast perl-bioperl perl-file-which hmmer=3.1b2 perl-parallel-forkmanager perl-list-moreutils diamond=0.9.14 $ conda install --name virsorter -c bioconda metagene_annotator
However, the same problem still persisted; two overlapping viral region being detected with one of the predicted viral region having far longer length than input.
We noticed that the specific version number for muscle, blast, metagene_annotator was not specified in the install command given within the manual.
To check if the problem is due to difference in the version of the above tools, we would appreciate it very much if you could please share with us the result of "conda list" command in your VirSorter conda enviroment
Sincerely, genomics-pixel
Hi,
Here's the conda environment I am using for VirSorter and other tools (so there is more than VirSorter requires), hope this helps:
Name | Version | Build | Channel |
---|---|---|---|
blast | 2.7.1 | boost1.64_3 | bioconda |
boost | 1.64.0 | py36_4 | conda-forge |
boost-cpp | 1.64.0 | 1 | conda-forge |
bzip2 | 1.0.6 | 1 | conda-forge |
ca-certificates | 2018.4.16 | 0 | conda-forge |
cairo | 1.14.10 | 0 | conda-forge |
certifi | 2018.4.16 | py36_0 | conda-forge |
diamond | 0.9.14 | 1 | bioconda |
expat | 2.2.5 | 0 | conda-forge |
fontconfig | 2.12.6 | 0 | conda-forge |
freetype | 2.8.1 | 0 | conda-forge |
gettext | 0.19.8.1 | 0 | conda-forge |
giflib | 5.1.4 | 0 | conda-forge |
glib | 2.55.0 | 0 | conda-forge |
gmp | 6.1.2 | 0 | conda-forge |
gnutls | 3.5.17 | 0 | conda-forge |
graphite2 | 1.3.11 | 0 | conda-forge |
graphviz | 2.38.0 | 7 | conda-forge |
harfbuzz | 1.7.6 | 0 | conda-forge |
hmmer | 3.1b2 | 3 | bioconda |
icu | 58.2 | 0 | conda-forge |
jpeg | 9b | 2 | conda-forge |
libdb | 6.1.26 | 0 | bioconda |
libffi | 3.2.1 | 3 | conda-forge |
libgcc | 7.2.0 | h69d50b8_2 | |
libgcc-ng | 7.2.0 | hdf63c60_3 | |
libgd | 2.2.5 | 3 | conda-forge |
libgfortran-ng | 7.2.0 | hdf63c60_3 | |
libiconv | 1.15 | 0 | conda-forge |
libidn11 | 1.33 | 0 | conda-forge |
libopenblas | 0.2.20 | h9ac9557_7 | |
libpng | 1.6.34 | 0 | conda-forge |
libstdcxx-ng | 7.2.0 | hdf63c60_3 | |
libtiff | 4.0.9 | 0 | conda-forge |
libtool | 2.4.6 | 0 | conda-forge |
libwebp | 0.5.2 | 7 | conda-forge |
libxml2 | 2.9.8 | 0 | conda-forge |
libxslt | 1.1.32 | 0 | conda-forge |
mcl | 14.137 | 2 | bioconda |
muscle | 3.8.1551 | 2 | bioconda |
ncurses | 5.9 | 10 | conda-forge |
nettle | 3.3 | 0 | conda-forge |
numpy | 1.14.3 | py36h28100ab_2 | |
numpy-base | 1.14.3 | py36h0ea5e3f_1 | |
openssl | 1.0.2o | 0 | conda-forge |
pango | 1.40.14 | 0 | conda-forge |
pcre | 8.41 | 1 | conda-forge |
perl | 5.22.0.1 | 0 | conda-forge |
perl-aceperl | 1.92 | 0 | bioconda |
perl-algorithm-munkres | 0.08 | 0 | bioconda |
perl-app-cpanminus | 1.7043 | pl5.22.0_0 | bioconda |
perl-appconfig | 1.71 | 0 | bioconda |
perl-archive-tar | 2.18 | pl5.22.0_2 | bioconda |
perl-array-compare | 2.11 | 0 | bioconda |
perl-bio-asn1-entrezgene | 1.72 | 1 | bioconda |
perl-bio-featureio | 1.6.905 | 0 | bioconda |
perl-bio-phylo | 0.58 | 0 | bioconda |
perl-bio-samtools | 1.43 | 0 | bioconda |
perl-bioperl | 1.6.924 | pl5.22.0_7 | bioconda |
perl-bioperl-core | 1.6.924 | pl5.22.0_2 | bioconda |
perl-bioperl-run | 1.0069 | 2 | bioconda |
perl-carp | 1.38 | pl5.22.0_0 | bioconda |
perl-cgi | 4.22 | 3 | bioconda |
perl-class-inspector | 1.28 | 0 | bioconda |
perl-class-method-modifiers | 2.11 | 1 | bioconda |
perl-clone | 0.38 | 0 | bioconda |
perl-common-sense | 3.74 | 0 | bioconda |
perl-compress-raw-bzip2 | 2.069 | 1 | bioconda |
perl-compress-raw-zlib | 2.069 | 3 | bioconda |
perl-convert-binary-c | 0.78 | 0 | bioconda |
perl-convert-binhex | 1.125 | 0 | bioconda |
perl-crypt-rc4 | 2.02 | 0 | bioconda |
perl-data-dumper | 2.161 | pl5.22.0_0 | bioconda |
perl-data-stag | 0.14 | 0 | bioconda |
perl-db-file | 1.835 | 5 | bioconda |
perl-dbd-sqlite | 1.5 | 1 | bioconda |
perl-dbi | 1.641 | pl5.22.0_0 | bioconda |
perl-devel-globaldestruction | 0.13 | 1 | bioconda |
perl-digest-hmac | 1.03 | pl5.22.0_1 | bioconda |
perl-digest-perl-md5 | 1.9 | 0 | bioconda |
perl-email-date-format | 1.005 | 0 | bioconda |
perl-encode-locale | 1.05 | pl5.22.0_4 | bioconda |
perl-error | 0.17024 | 0 | bioconda |
perl-exporter | 5.72 | pl5.22.0_0 | bioconda |
perl-exporter-tiny | 0.042 | 1 | bioconda |
perl-extutils-makemaker | 7.24 | pl5.22.0_1 | bioconda |
perl-file-listing | 6.04 | 0 | bioconda |
perl-file-slurp-tiny | 0.004 | 0 | bioconda |
perl-file-sort | 1.01 | pl5.22.0_1 | bioconda |
perl-file-which | 1.2 | 0 | bioconda |
perl-font-afm | 1.2 | 0 | bioconda |
perl-gd | 2.56 | pl5.22.0_6 | bioconda |
perl-graph | 0.9704 | 0 | bioconda |
perl-graphviz | 2.2 | 1 | bioconda |
perl-html-element-extended | 1.18 | 0 | bioconda |
perl-html-entities-numbered | 0.04 | 0 | bioconda |
perl-html-formatter | 2.14 | 0 | bioconda |
perl-html-parser | 3.72 | pl5.22.0_1 | bioconda |
perl-html-tableextract | 2.13 | 0 | bioconda |
perl-html-tagset | 3.2 | pl5.22.0_1 | bioconda |
perl-html-tidy | 1.56 | 1 | bioconda |
perl-html-tree | 5.03 | 0 | bioconda |
perl-html-treebuilder-xpath | 0.14 | 0 | bioconda |
perl-http-cookies | 6.01 | 0 | bioconda |
perl-http-daemon | 6.01 | 0 | bioconda |
perl-http-date | 6.02 | pl5.22.0_1 | bioconda |
perl-http-message | 6.11 | 0 | bioconda |
perl-http-negotiate | 6.01 | 0 | bioconda |
perl-image-info | 1.38 | 0 | bioconda |
perl-image-size | 3.3 | 0 | bioconda |
perl-io-compress | 2.069 | pl5.22.0_2 | bioconda |
perl-io-html | 1.001 | pl5.22.0_1 | bioconda |
perl-io-sessiondata | 1.03 | 0 | bioconda |
perl-io-socket-ssl | 2.024 | 0 | bioconda |
perl-io-string | 1.08 | pl5.22.0_1 | bioconda |
perl-io-stringy | 2.111 | 0 | bioconda |
perl-io-tty | 1.12 | 0 | bioconda |
perl-io-zlib | 1.1 | 1 | bioconda |
perl-ipc-run | 0.94 | 0 | bioconda |
perl-jcode | 2.07 | 0 | bioconda |
perl-json | 2.9 | 1 | bioconda |
perl-json-xs | 2.34 | 0 | bioconda |
perl-libwww-perl | 6.15 | 0 | bioconda |
perl-libxml-perl | 0.08 | 0 | bioconda |
perl-list-moreutils | 0.428 | pl5.22.0_0 | bioconda |
perl-lwp-mediatypes | 6.02 | pl5.22.0_1 | bioconda |
perl-lwp-protocol-https | 6.06 | 2 | bioconda |
perl-lwp-simple | 6.15 | 3 | bioconda |
perl-mailtools | 2.14 | 0 | bioconda |
perl-math-cdf | 0.1 | 3 | bioconda |
perl-math-derivative | 0.04 | 0 | bioconda |
perl-math-random | 0.72 | 0 | bioconda |
perl-math-spline | 0.02 | 0 | bioconda |
perl-mime-base64 | 3.15 | pl5.22.0_0 | bioconda |
perl-mime-lite | 3.03 | 0 | bioconda |
perl-mime-tools | 5.507 | 0 | bioconda |
perl-mime-types | 2.12 | 0 | bioconda |
perl-mldbm | 2.05 | 0 | bioconda |
perl-module-runtime | 0.014 | 1 | bioconda |
perl-moo | 2.001 | 1 | bioconda |
perl-net-http | 6.09 | 0 | bioconda |
perl-net-ssleay | 1.84 | pl5.22.0_0 | bioconda |
perl-ntlm | 1.09 | 1 | bioconda |
perl-ole-storage_lite | 0.19 | 0 | bioconda |
perl-parallel-forkmanager | 1.17 | 0 | bioconda |
perl-parse-recdescent | 1.967013 | 0 | bioconda |
perl-pathtools | 3.73 | 0 | bioconda |
perl-pdf-api2 | 2.025 | 2 | bioconda |
perl-postscript | 0.06 | 0 | bioconda |
perl-role-tiny | 2.000001 | 1 | bioconda |
perl-scalar-list-utils | 1.45 | 2 | bioconda |
perl-set-scalar | 1.29 | 0 | bioconda |
perl-soap-lite | 1.19 | 0 | bioconda |
perl-sort-naturally | 1.03 | 0 | bioconda |
perl-spreadsheet-parseexcel | 0.65 | 0 | bioconda |
perl-spreadsheet-writeexcel | 2.4 | 0 | bioconda |
perl-statistics-descriptive | 3.0612 | 0 | bioconda |
perl-sub-exporter-progressive | 0.001011 | 1 | bioconda |
perl-svg | 2.64 | 0 | bioconda |
perl-svg-graph | 0.02 | 0 | bioconda |
perl-task-weaken | 1.04 | 0 | bioconda |
perl-template-toolkit | 2.26 | 0 | bioconda |
perl-test-leaktrace | 0.16 | pl5.22.0_0 | bioconda |
perl-test-more | 1.001002 | pl5.22.0_0 | bioconda |
perl-test-pod | 1.51 | 0 | bioconda |
perl-threaded | 5.22.0 | pl5.22.0_12 | bioconda |
perl-tie-ixhash | 1.23 | pl5.22.0_1 | bioconda |
perl-timedate | 2.3 | 0 | bioconda |
perl-tree-dag_node | 1.29 | 0 | bioconda |
perl-type-tiny | 1.000005 | 0 | bioconda |
perl-unicode-map | 0.112 | 0 | bioconda |
perl-uri | 1.71 | pl5.22.0_1 | bioconda |
perl-www-robotrules | 6.02 | 0 | bioconda |
perl-xml-dom | 1.45 | 0 | bioconda |
perl-xml-dom-xpath | 0.14 | 0 | bioconda |
perl-xml-filter-buffertext | 1.01 | 0 | bioconda |
perl-xml-libxml | 2.0124 | 0 | bioconda |
perl-xml-libxslt | 1.94 | 0 | bioconda |
perl-xml-namespacesupport | 1.11 | 0 | bioconda |
perl-xml-parser | 2.44 | 4 | bioconda |
perl-xml-regexp | 0.04 | 0 | bioconda |
perl-xml-sax | 0.99 | 0 | bioconda |
perl-xml-sax-base | 1.08 | 0 | bioconda |
perl-xml-sax-expat | 0.51 | 0 | bioconda |
perl-xml-sax-writer | 0.56 | 0 | bioconda |
perl-xml-simple | 2.22 | 0 | bioconda |
perl-xml-twig | 3.49 | 0 | bioconda |
perl-xml-writer | 0.625 | 0 | bioconda |
perl-xml-xpath | 1.33 | 0 | bioconda |
perl-xml-xpathengine | 0.14 | 0 | bioconda |
perl-xsloader | 0.22 | pl5.22.0_0 | bioconda |
perl-yaml | 1.24 | 0 | bioconda |
pip | 9.0.3 | py36_0 | conda-forge |
pixman | 0.34.0 | 2 | conda-forge |
python | 3.6.5 | 1 | conda-forge |
readline | 7 | 0 | conda-forge |
setuptools | 39.2.0 | py36_0 | conda-forge |
sqlite | 3.20.1 | 2 | conda-forge |
tidyp | 1.04 | 0 | bioconda |
tk | 8.6.7 | 0 | conda-forge |
wheel | 0.31.0 | py36_0 | conda-forge |
xz | 5.2.3 | 0 | conda-forge |
zlib | 1.2.11 | h470a237_3 | conda-forge |
Dear Simon,
Thank you for giving us the information about the conda environment you have used for VirSorter!
We will see if matching our conda environment to yours listed above will solve the current problem. After we get the results, we will come back to you to share the outcome.
Again many thanks.
Sincerely, genomics-pixel
Dear Simon,
Thank you earlier for sharing with us the details of your conda enviroment!
Our system administrator have tried his best to create the same conda environment as yours, but we still found two viral regions (gene_82_gene_183 and gene_84_gene183) for contig "k251_103836".
Command used to run VirSorter: wrapper_phage_contigs_sorter_iPlant.pl -f [INPUT_PATH]/TEST_SAMPLE.contigs.fa --db 2 --wdir [OUTPUT_PATH]/TEST_SAMPLE.contigs.fa --ncpu 10
(Note: [INPUT_PATH]/TEST_SAMPLE.contigs.fa is the same fasta file of contigs as the one I have sent you before via e-mail)
Predicted viral regions for contig "k251_103836"
Header:">VIRSorter_k251_103836_flag_3_multi_65_0004_len_152649_TEST_SAMPLE_TEST_CONTIG_ALPHA_COUNT_2228-circular_gene_82_gene_183-52442-617-cat_5" Gene region: gene_82_gene_183 Length: 252,971bp <- longer than input contig sequence which is 103,836bp long
Header:">VIRSorter_k251_103836_flag_3_multi_65_0004_len_152649_TEST_SAMPLE_TEST_CONTIG_ALPHA_COUNT_2228-circular_gene_84_gene_183-53752-617-cat_5" Gene region: gene_84_gene_183 Length: 99,263bp
Other outputs of VirSorter: VIRSorter_global-phage-signal.csv.txt VIRSorter_affi-contigs.tab.txt
The "conda list" of our envrioment: conda_list.txt
However, when we modified the script to use "blastall -p blastp" instead of "blastp", the program has given 1 viral region for contig "k251_103836". (The reason we tried with "blastall -p blastp" cmd is because the Cyverse version of VirSorter is using this cmd instead of "blastp")
Predicted viral regions for contig "k251_103836"
Header:">VIRSorter_k251_103836_flag_3_multi_65_0004_len_152649_TEST_SAMPLE_TEST_CONTIG_ALPHA_COUNT_2228-circular_gene_82_gene_183-52442-617-cat5" Gene region: gene_82_gene_183_ Length: 100,573bp
Comparison between "wrapper_phage_contigs_sorter_iPlant.pl" script changed to use "blastall -p blastp" and the original version via "diff" unix command. VirSorter_Script_Change.txt
As you can see, the gene region detected with "blastall -p blastp" (i.e. "gene_82_gene_183") is slightly wider than the one you have detected in your test (i.e. "gene_84_gene_183").
In our tests we observed "gene_84_gene183" only when makeblastdb and blastp did not work due to different version of glibc being accidentally placed by mistake.
Therefore, we think that changing "blastp" to "blastall -p blastp" will have VirSorter in our conda environment to work as it has been intended to and that the actual viral region for this contig is "gene_82_gene_183". (Please correct us if we are wrong.)
If you could see through the above details and tell us whether or not it is good/appropriate to use the above "blastall -p blastp" implemented VirSorter for research that would be great.
Again many thanks for your help.
Sincerely, genomics-pixel
Hi,
Using "blastall -p blastp" is totally fine (and is what VirSorter used to do), so I'd recommend you run VirSorter that way.
Best, Simon
Dear Simon,
Thank you for confirming that the above change in VirSorter script in "wrapper_phage_contigs_sorter_iPlant.pl" is totally fine to use for research.
Because we are getting the gene region gene_82_gene_183 and not the gene region gene_84_gene_183 you have got, we are still worried if it is working the way it is supposed to.
Also if we use the above "blastall" using version of VirSorter, would you recommend us to have it noted in a paper.
Sincerely, Hiroki Nishiyama
Hi Hiroki,
These minor differences in prophage edge are not uncommon, and should be no cause for concern. However, you are 100% right that, to be fully reproducible, you should mention that VirSorter was run with the blast step computed using Blast 2.6 (which I suspect is the version of blast you're using) as follows: "blastall -p blastp ...". That way, a reader would be able to exactly reproduce your steps.
Best, Simon
Dear Simon,
Thank you for your swift reply!
I am terribly sorry to ask you many questions, but if you could explain a bit more on why these difference in prophage edge are not uncommon, I would appreciate it very much. (I want to be able to explain, if I am asked, about why there should be no concern with different results between "blastall -p blastp" and "blastp" when running with metagenomic assmblies from other samples.)
Thank you for your patience with me.
Sincerely, Hiroki Nishiyama
So there are two things here:
Best, Simon
Dear Simon,
Thank you very much for your detailed reply.
I now understand why there are minor difference in our test results and that these difference in viral region prediction between "blastall -p blastp", "blastp", and "diamond" is inevitable and should not be of concern.
I can not express enough gratitude for your generous help with this matter. Thank you very much.
Sincerely, Hiroki Nishiyama
Dear Simon,
Thank you for your kind help with issue#45 earlier this week.
Currently I am trying to predict viral regions within contigs using your VirSorter and I was wondering if there is a way to extract the exact starting and ending points (i.e. coordinates) of predicted "viral" region (prediction category 1, 2, 3) within the original contigs.
In cases where "prophage" (category 4, 5, 6) regions were predicted, I noticed that the fasta files in "Predicted_viral_sequences" directory have fasta headers with their cooridinates in their originating contigs.
However, when I looked at cases where "viral" regions (category 1, 2, 3) were predicted, there were no information on coordinates in the headers of fasta files in "Predicted_viral_sequences" directory.
I have tried looking through other result files to find the coordinates of "viral region" within contigs, but to no avail.
If you could help me with this matter I would be very grateful. Thank you in advance.
Sincerely, genomics-pixel