simroux / VirSorter

Source code of the VirSorter tool, also available as an App on CyVerse/iVirus (https://de.iplantcollaborative.org/de/)
GNU General Public License v2.0
104 stars 30 forks source link

How to retrieve the coordinates of "viral" predicted regions #46

Closed genomics-pixel closed 4 years ago

genomics-pixel commented 4 years ago

Dear Simon,

Thank you for your kind help with issue#45 earlier this week.

Currently I am trying to predict viral regions within contigs using your VirSorter and I was wondering if there is a way to extract the exact starting and ending points (i.e. coordinates) of predicted "viral" region (prediction category 1, 2, 3) within the original contigs.

In cases where "prophage" (category 4, 5, 6) regions were predicted, I noticed that the fasta files in "Predicted_viral_sequences" directory have fasta headers with their cooridinates in their originating contigs.

However, when I looked at cases where "viral" regions (category 1, 2, 3) were predicted, there were no information on coordinates in the headers of fasta files in "Predicted_viral_sequences" directory.

I have tried looking through other result files to find the coordinates of "viral region" within contigs, but to no avail.

If you could help me with this matter I would be very grateful. Thank you in advance.

Sincerely, genomics-pixel

simroux commented 4 years ago

Hi,

The difference between categories 1/2/3 and 4/5/6 is that for the former, the whole contig was detected as viral, while the latter categories are predicted as "integrated provirus", i.e. the contig has a mix of host and virus regions. This is why you don't have coordinates for categories 1/2/3 (because the coordinates are just from start to end of the contig), but you do have coordinates for 4/5/6

Best, Simon

genomics-pixel commented 4 years ago

Dear Simon,

Thank you very much for your reply.

If I understand correctly, VirSorter predicts the whole contig as "viral" when more than 80% of the genes are predicted to be of viral origin.

However, there maybe cases in which there are host genome fragments flanking the "viral predicted regions" (i.e. the first predicted viral gene to the last predicted viral gene), even if more than 80% of the genes are predicted to be of viral origin.

These cases may not be often, but I would still very much like to take a look into the flanking host genome fragments of the "viral predicted regions" (even for category 1,2,3).

Therefore it would be great if there is a way to know the coordinates of the first viral gene to the last viral gene even when they are predicted as category 1, 2, and 3.

Thank you always for your support.

Sincerely, genomics-pixel

simroux commented 4 years ago

In our experience, contigs in category 1/2/3 are almost always entirely viral. However, if you want to understand how VirSorter did its prediction, you can look in the XX_global-phage-signal.csv file: columns 7 to 12 show the enrichment statistics, including the coordinates of the maximum enrichment region (if no coordinates, that means the maximum enrichment was found when considering the entire contig).

Best, Simon

genomics-pixel commented 4 years ago

Dear Simon,

Thank you for your informative help!

From your message and what I have learned from looking at the results, I made a list of steps to get the coordinates for all predicted categories (1, 2, 3, 4, 5 and 6).

If you could take a look to see if the steps below are fine that would be great.

--Beginning of steps to extract coordinates--

  1. Go on to "step 1" only If the current entry (i.e. viral predicted region) does not match to any of the following criteria.

    • The sequence is appointed as "circular"
      • EDIT: I've noticed that some contigs are annotated as having circular prophage region even though they are predicted as category 4, 5, and 6. Therefor, if I understand correctly, "circular" does not necessary mean that the whole contig is of viral origin
    • 7th to 12th column of each line of "VIRSorter_global-phage-signal.csv" has no "gene" word
  2. Examine 7th to 12th column of each line of "VIRSorter_global-phage-signal.csv" and take out gene regions.

    Ex.) gene_10-gene_14:XX,gene_8-gene_12:XX,gene_12-gene_16:XX (XX = gene enrichment score)

  3. Extract the coordinate of the above found genes from the 2nd and 3rd columns of "Metric_files/VIRSorter_affi-contigs.tab".

    Ex.) gene_8|1000|1500 gene_9|1600|2100 ... gene_15|4000|4500 gene_16|4600|5100

  4. Subtract 50bp from the smallest coodrinate number and take the result as the starting point.

    Ex.) 1000-50=950 -> 950bp is the starting point

  5. Add 50bp to the largest cooridnate number and take the result as the ending point.

    Ex.) 5100+50=5150 -> 5150bp is the ending point.

--end--

Again thank you for your help.

Sincerely, genomics-pixel

P.S. The above steps were written with the assumption that "VIRSorter_global-phage-signal.csv" describes about only one virus per line. If this assumption is wrong please let me know.

simroux commented 4 years ago

Hi,

This is all correct, and "VIRSorter_global-phage-signal.csv" indeed describes one virus (region) per line. Note that there can be multiple virus regions on a single contig, and these will be describe on separate lines.

Best, Simon

genomics-pixel commented 4 years ago

Dear Simon,

Thank you for checking the steps for extracting coordinates of viral region!

Unfortunately, when I have conducted the above steps, I found some problems.

Specifically, the start and end coordinates of viral predicted regions extracted with the above method did not match with the coordinates given by the headers of fasta files from "Predicted_viral_sequences/" directory in cases where the predicted viral region is identified to be both:

  1. circular and
  2. having a gene region which extends from one end to the other end (These problems were observed in prediction categories 4,5,6. However, because the fasta headers for category 1, 2 and 3 do not include coordinates, the same problem may have also occurred in these prediction categories.)

Below are example results from a case when the above described problem has occurred.

The header of fasta from "Predicted_viral_sequences" directory describes viral predicted region to be from "617" bp to "52,442" bp. VIRSorter_TESTCONTIG-circular_gene_82_gene_183-52442-617-cat_5 VIRSORTER_PREDICTED_VIRAL_SEQ_FOR_TESTCONTIG.fasta.txt

List of genes within the predicted viral region were taken from "VIRSorter_global-phage-signal.csv" and their location on contig was obtained from "Metric_files/VIRSorter_affi-contigs.tab". (These steps corresponds to Steps 1 and 2 of last post) With this approach viral predicted region was 517bp (567-50=517) to 152,104bp (152,054+50=152,104) TESTCONTIG_VIRSorter_global-phage-signal.csv.txt TESTCONTIG_VIRSorter_affi-contigs.tab.txt

Furthermore, a quick letter count of the above uploaded fasta file have shown the length of predicted viral region sequence to be 252,972bp , which does not match with the above numbers.

If you could help me to figure this out, I would be very grateful.

Sincerely, genomics-pixel

P.S. VirSorter (1.0.5) with the "--db2" option was used to calculate in the above example.

simroux commented 4 years ago

Hi,

So the coordinates look right to me: prophage starts at gene 82 which starts in 52492, so the "prophage" will start in 52492-50 = 52442 prophage ends at gene 183, which ends in 567, so the "prophage" will end in 567+50 = 617 (since it spans over the contig origin, its end is "before" its start).

The fasta file you got is wrong though, you're right, it should be ~ 99kb. Could you send the original input file ? I've tried to replicate this bug using the predicted viral sequence, but I get the expected length.

Thanks

genomics-pixel commented 4 years ago

Dear Simon,

I appreciate your help!

Since the whole input fasta file itself is quite large, I am sending you the fasta entry from which the prophage regions were predicted. CONTIG_FOR_SIMON.fasta.txt

(The above single fasta entry was taken out of the original input file which contains other fasta entries. If you would like to look at the whole input fasta file, it would be great if you could give me a private link to upload or an e-mail address)

I will also upload the following files. If you look through them, you will see that VirSorter actually predicts 2 viral regions which overlaps eachother CONTIG_FOR_SIMON_PREDICTED_SEQ.fasta.txt VIRSorter_global-phage-signal.csv.txt VIRSorter_affi-contigs.tab.txt

Sincerely genomcics-pixel .

simroux commented 4 years ago

Hi,

So I ran the contig, and got a single prophage prediction with an expected fasta file of 99,263 bp, which suggests the bug occurs when processing the whole input file. You can send the file (or a link for me to download it) at sroux (at) lbl dot gov.

Best, Simon

genomics-pixel commented 4 years ago

Dear Simon,

Thank you for taking a look into the contig fasta file. I have sent you an e-mail with a link and a password to download the whole input file. If you are unable to download or there are other information you need please let me know.

Again thank you for your help.

Sincerely, genomics-pixel

P.S. Just now, I conducted a quick VirSorter (ver 1.0.5 "--db 2") analysis with "CONTIG_FOR_SIMON.fasta.txt" (i.e. the subset of original whole input file which I shared few days ago), and received results with no viral region predicted in VIRSorter_global-phage-signal.csv.

wrapper_phage_contigs_sorter_iPlant.pl -f $input --db 2 --wdir $work_dir --ncpu 10

The same result was observed with Cyverse's VirSorter (1.0.3 "--db2").

However, when I did another analysis with VirSorter (ver 1.0.5 --db 2) the entire input file (the same file I shared via email), there were predicted viral region as expected.

Since you seem to have been able to detect viral regions with only one contig, I am wondering if I am doing something strange.

Thank you again for looking into this!

simroux commented 4 years ago

Hi,

After running the file "TEST_SAMPLE.contigs.fa" through VirSorter 1.05, I was not able to replicate the error you see: I have one prophage predicted for k251_103836_flag_3_multi_65_0004_len_152649_TEST_SAMPLE_TEST_CONTIG_ALPHA_COUNT_2228-circular, from gene 84 to 183, and the correct sequence of 99,263 bp is found in the output file "VIRSorter_prophages_cat-5.fasta".

My best guess is thus that something is wrong on your side, maybe you are not using the latest version of the scripts ? (re-downloading them from github may help). Other than that, I don't know what could cause this bug since I am not able to replicate it.

For your PS, VirSorter needs to be run with a specific option when most (or all) of the input sequences are viral, which is the case when running on a single viral contig. The way to run VirSorter in these cases is to add the option "--virome", which makes VirSorter use the "virome decontamination" mode, and enabled the prediction of viral sequences from datasets that are already mostly viral.

genomics-pixel commented 4 years ago

Dear Simon,

Thank you for testing the input file and telling me about the "--virome" option.

I agree that there maybe something wrong on my side. I will re-download the scripts from github and try it again.

Could you also please tell me the version of the tools listed in "dependencies" (such as blast, hmmer, etc.) you have used for your VirSorter analysis?
I would like to retry using the same version of "dependencies" tools.

Thank you.

SIncerely, genomics-pixel

simroux commented 4 years ago

I used exactly the conda environment described in the Readme, so the version of these tools should be the latest available in conda (note that I don't believe this is the issue here since your file "affi_contigs.tab" looks good).

genomics-pixel commented 4 years ago

Dear Simon,

Thank you for getting back to me about my question.

Since I am working on a computer system which I do not have administrative access to, I will talk to the system administrators if they can re-install the program.

It may take a while to get it re-installed and have it checked, but I will come back to you once the source of the problem is identified.

Again thank you for your generous help.

Sincerely, genomics-pixel

genomics-pixel commented 4 years ago

Dear Simon,

Thank you for your generous support earlier this month.

I have had the system administrator re-install the VirSorter tool via conda with the commands below

manual: https://github.com/simroux/VirSorter#using-a-conda-virtual-environment-tested-on-ubuntu-and-centos install commands:

$ conda create --name virsorter -c bioconda mcl=14.137 muscle blast perl-bioperl perl-file-which hmmer=3.1b2 perl-parallel-forkmanager perl-list-moreutils diamond=0.9.14 $ conda install --name virsorter -c bioconda metagene_annotator

However, the same problem still persisted; two overlapping viral region being detected with one of the predicted viral region having far longer length than input.

We noticed that the specific version number for muscle, blast, metagene_annotator was not specified in the install command given within the manual.

To check if the problem is due to difference in the version of the above tools, we would appreciate it very much if you could please share with us the result of "conda list" command in your VirSorter conda enviroment

Sincerely, genomics-pixel

simroux commented 4 years ago

Hi,

Here's the conda environment I am using for VirSorter and other tools (so there is more than VirSorter requires), hope this helps:

Name Version Build Channel
blast 2.7.1 boost1.64_3 bioconda
boost 1.64.0 py36_4 conda-forge
boost-cpp 1.64.0 1 conda-forge
bzip2 1.0.6 1 conda-forge
ca-certificates 2018.4.16 0 conda-forge
cairo 1.14.10 0 conda-forge
certifi 2018.4.16 py36_0 conda-forge
diamond 0.9.14 1 bioconda
expat 2.2.5 0 conda-forge
fontconfig 2.12.6 0 conda-forge
freetype 2.8.1 0 conda-forge
gettext 0.19.8.1 0 conda-forge
giflib 5.1.4 0 conda-forge
glib 2.55.0 0 conda-forge
gmp 6.1.2 0 conda-forge
gnutls 3.5.17 0 conda-forge
graphite2 1.3.11 0 conda-forge
graphviz 2.38.0 7 conda-forge
harfbuzz 1.7.6 0 conda-forge
hmmer 3.1b2 3 bioconda
icu 58.2 0 conda-forge
jpeg 9b 2 conda-forge
libdb 6.1.26 0 bioconda
libffi 3.2.1 3 conda-forge
libgcc 7.2.0 h69d50b8_2
libgcc-ng 7.2.0 hdf63c60_3
libgd 2.2.5 3 conda-forge
libgfortran-ng 7.2.0 hdf63c60_3
libiconv 1.15 0 conda-forge
libidn11 1.33 0 conda-forge
libopenblas 0.2.20 h9ac9557_7
libpng 1.6.34 0 conda-forge
libstdcxx-ng 7.2.0 hdf63c60_3
libtiff 4.0.9 0 conda-forge
libtool 2.4.6 0 conda-forge
libwebp 0.5.2 7 conda-forge
libxml2 2.9.8 0 conda-forge
libxslt 1.1.32 0 conda-forge
mcl 14.137 2 bioconda
muscle 3.8.1551 2 bioconda
ncurses 5.9 10 conda-forge
nettle 3.3 0 conda-forge
numpy 1.14.3 py36h28100ab_2
numpy-base 1.14.3 py36h0ea5e3f_1
openssl 1.0.2o 0 conda-forge
pango 1.40.14 0 conda-forge
pcre 8.41 1 conda-forge
perl 5.22.0.1 0 conda-forge
perl-aceperl 1.92 0 bioconda
perl-algorithm-munkres 0.08 0 bioconda
perl-app-cpanminus 1.7043 pl5.22.0_0 bioconda
perl-appconfig 1.71 0 bioconda
perl-archive-tar 2.18 pl5.22.0_2 bioconda
perl-array-compare 2.11 0 bioconda
perl-bio-asn1-entrezgene 1.72 1 bioconda
perl-bio-featureio 1.6.905 0 bioconda
perl-bio-phylo 0.58 0 bioconda
perl-bio-samtools 1.43 0 bioconda
perl-bioperl 1.6.924 pl5.22.0_7 bioconda
perl-bioperl-core 1.6.924 pl5.22.0_2 bioconda
perl-bioperl-run 1.0069 2 bioconda
perl-carp 1.38 pl5.22.0_0 bioconda
perl-cgi 4.22 3 bioconda
perl-class-inspector 1.28 0 bioconda
perl-class-method-modifiers 2.11 1 bioconda
perl-clone 0.38 0 bioconda
perl-common-sense 3.74 0 bioconda
perl-compress-raw-bzip2 2.069 1 bioconda
perl-compress-raw-zlib 2.069 3 bioconda
perl-convert-binary-c 0.78 0 bioconda
perl-convert-binhex 1.125 0 bioconda
perl-crypt-rc4 2.02 0 bioconda
perl-data-dumper 2.161 pl5.22.0_0 bioconda
perl-data-stag 0.14 0 bioconda
perl-db-file 1.835 5 bioconda
perl-dbd-sqlite 1.5 1 bioconda
perl-dbi 1.641 pl5.22.0_0 bioconda
perl-devel-globaldestruction 0.13 1 bioconda
perl-digest-hmac 1.03 pl5.22.0_1 bioconda
perl-digest-perl-md5 1.9 0 bioconda
perl-email-date-format 1.005 0 bioconda
perl-encode-locale 1.05 pl5.22.0_4 bioconda
perl-error 0.17024 0 bioconda
perl-exporter 5.72 pl5.22.0_0 bioconda
perl-exporter-tiny 0.042 1 bioconda
perl-extutils-makemaker 7.24 pl5.22.0_1 bioconda
perl-file-listing 6.04 0 bioconda
perl-file-slurp-tiny 0.004 0 bioconda
perl-file-sort 1.01 pl5.22.0_1 bioconda
perl-file-which 1.2 0 bioconda
perl-font-afm 1.2 0 bioconda
perl-gd 2.56 pl5.22.0_6 bioconda
perl-graph 0.9704 0 bioconda
perl-graphviz 2.2 1 bioconda
perl-html-element-extended 1.18 0 bioconda
perl-html-entities-numbered 0.04 0 bioconda
perl-html-formatter 2.14 0 bioconda
perl-html-parser 3.72 pl5.22.0_1 bioconda
perl-html-tableextract 2.13 0 bioconda
perl-html-tagset 3.2 pl5.22.0_1 bioconda
perl-html-tidy 1.56 1 bioconda
perl-html-tree 5.03 0 bioconda
perl-html-treebuilder-xpath 0.14 0 bioconda
perl-http-cookies 6.01 0 bioconda
perl-http-daemon 6.01 0 bioconda
perl-http-date 6.02 pl5.22.0_1 bioconda
perl-http-message 6.11 0 bioconda
perl-http-negotiate 6.01 0 bioconda
perl-image-info 1.38 0 bioconda
perl-image-size 3.3 0 bioconda
perl-io-compress 2.069 pl5.22.0_2 bioconda
perl-io-html 1.001 pl5.22.0_1 bioconda
perl-io-sessiondata 1.03 0 bioconda
perl-io-socket-ssl 2.024 0 bioconda
perl-io-string 1.08 pl5.22.0_1 bioconda
perl-io-stringy 2.111 0 bioconda
perl-io-tty 1.12 0 bioconda
perl-io-zlib 1.1 1 bioconda
perl-ipc-run 0.94 0 bioconda
perl-jcode 2.07 0 bioconda
perl-json 2.9 1 bioconda
perl-json-xs 2.34 0 bioconda
perl-libwww-perl 6.15 0 bioconda
perl-libxml-perl 0.08 0 bioconda
perl-list-moreutils 0.428 pl5.22.0_0 bioconda
perl-lwp-mediatypes 6.02 pl5.22.0_1 bioconda
perl-lwp-protocol-https 6.06 2 bioconda
perl-lwp-simple 6.15 3 bioconda
perl-mailtools 2.14 0 bioconda
perl-math-cdf 0.1 3 bioconda
perl-math-derivative 0.04 0 bioconda
perl-math-random 0.72 0 bioconda
perl-math-spline 0.02 0 bioconda
perl-mime-base64 3.15 pl5.22.0_0 bioconda
perl-mime-lite 3.03 0 bioconda
perl-mime-tools 5.507 0 bioconda
perl-mime-types 2.12 0 bioconda
perl-mldbm 2.05 0 bioconda
perl-module-runtime 0.014 1 bioconda
perl-moo 2.001 1 bioconda
perl-net-http 6.09 0 bioconda
perl-net-ssleay 1.84 pl5.22.0_0 bioconda
perl-ntlm 1.09 1 bioconda
perl-ole-storage_lite 0.19 0 bioconda
perl-parallel-forkmanager 1.17 0 bioconda
perl-parse-recdescent 1.967013 0 bioconda
perl-pathtools 3.73 0 bioconda
perl-pdf-api2 2.025 2 bioconda
perl-postscript 0.06 0 bioconda
perl-role-tiny 2.000001 1 bioconda
perl-scalar-list-utils 1.45 2 bioconda
perl-set-scalar 1.29 0 bioconda
perl-soap-lite 1.19 0 bioconda
perl-sort-naturally 1.03 0 bioconda
perl-spreadsheet-parseexcel 0.65 0 bioconda
perl-spreadsheet-writeexcel 2.4 0 bioconda
perl-statistics-descriptive 3.0612 0 bioconda
perl-sub-exporter-progressive 0.001011 1 bioconda
perl-svg 2.64 0 bioconda
perl-svg-graph 0.02 0 bioconda
perl-task-weaken 1.04 0 bioconda
perl-template-toolkit 2.26 0 bioconda
perl-test-leaktrace 0.16 pl5.22.0_0 bioconda
perl-test-more 1.001002 pl5.22.0_0 bioconda
perl-test-pod 1.51 0 bioconda
perl-threaded 5.22.0 pl5.22.0_12 bioconda
perl-tie-ixhash 1.23 pl5.22.0_1 bioconda
perl-timedate 2.3 0 bioconda
perl-tree-dag_node 1.29 0 bioconda
perl-type-tiny 1.000005 0 bioconda
perl-unicode-map 0.112 0 bioconda
perl-uri 1.71 pl5.22.0_1 bioconda
perl-www-robotrules 6.02 0 bioconda
perl-xml-dom 1.45 0 bioconda
perl-xml-dom-xpath 0.14 0 bioconda
perl-xml-filter-buffertext 1.01 0 bioconda
perl-xml-libxml 2.0124 0 bioconda
perl-xml-libxslt 1.94 0 bioconda
perl-xml-namespacesupport 1.11 0 bioconda
perl-xml-parser 2.44 4 bioconda
perl-xml-regexp 0.04 0 bioconda
perl-xml-sax 0.99 0 bioconda
perl-xml-sax-base 1.08 0 bioconda
perl-xml-sax-expat 0.51 0 bioconda
perl-xml-sax-writer 0.56 0 bioconda
perl-xml-simple 2.22 0 bioconda
perl-xml-twig 3.49 0 bioconda
perl-xml-writer 0.625 0 bioconda
perl-xml-xpath 1.33 0 bioconda
perl-xml-xpathengine 0.14 0 bioconda
perl-xsloader 0.22 pl5.22.0_0 bioconda
perl-yaml 1.24 0 bioconda
pip 9.0.3 py36_0 conda-forge
pixman 0.34.0 2 conda-forge
python 3.6.5 1 conda-forge
readline 7 0 conda-forge
setuptools 39.2.0 py36_0 conda-forge
sqlite 3.20.1 2 conda-forge
tidyp 1.04 0 bioconda
tk 8.6.7 0 conda-forge
wheel 0.31.0 py36_0 conda-forge
xz 5.2.3 0 conda-forge
zlib 1.2.11 h470a237_3 conda-forge
genomics-pixel commented 4 years ago

Dear Simon,

Thank you for giving us the information about the conda environment you have used for VirSorter!

We will see if matching our conda environment to yours listed above will solve the current problem. After we get the results, we will come back to you to share the outcome.

Again many thanks.

Sincerely, genomics-pixel

genomics-pixel commented 4 years ago

Dear Simon,

Thank you earlier for sharing with us the details of your conda enviroment!

Our system administrator have tried his best to create the same conda environment as yours, but we still found two viral regions (gene_82_gene_183 and gene_84_gene183) for contig "k251_103836".

Command used to run VirSorter: wrapper_phage_contigs_sorter_iPlant.pl -f [INPUT_PATH]/TEST_SAMPLE.contigs.fa --db 2 --wdir [OUTPUT_PATH]/TEST_SAMPLE.contigs.fa --ncpu 10

(Note: [INPUT_PATH]/TEST_SAMPLE.contigs.fa is the same fasta file of contigs as the one I have sent you before via e-mail)

Predicted viral regions for contig "k251_103836"

Header:">VIRSorter_k251_103836_flag_3_multi_65_0004_len_152649_TEST_SAMPLE_TEST_CONTIG_ALPHA_COUNT_2228-circular_gene_82_gene_183-52442-617-cat_5" Gene region: gene_82_gene_183 Length: 252,971bp <- longer than input contig sequence which is 103,836bp long

Header:">VIRSorter_k251_103836_flag_3_multi_65_0004_len_152649_TEST_SAMPLE_TEST_CONTIG_ALPHA_COUNT_2228-circular_gene_84_gene_183-53752-617-cat_5" Gene region: gene_84_gene_183 Length: 99,263bp

Other outputs of VirSorter: VIRSorter_global-phage-signal.csv.txt VIRSorter_affi-contigs.tab.txt

The "conda list" of our envrioment: conda_list.txt

However, when we modified the script to use "blastall -p blastp" instead of "blastp", the program has given 1 viral region for contig "k251_103836". (The reason we tried with "blastall -p blastp" cmd is because the Cyverse version of VirSorter is using this cmd instead of "blastp")

Predicted viral regions for contig "k251_103836"

Header:">VIRSorter_k251_103836_flag_3_multi_65_0004_len_152649_TEST_SAMPLE_TEST_CONTIG_ALPHA_COUNT_2228-circular_gene_82_gene_183-52442-617-cat5" Gene region: gene_82_gene_183_ Length: 100,573bp

Comparison between "wrapper_phage_contigs_sorter_iPlant.pl" script changed to use "blastall -p blastp" and the original version via "diff" unix command. VirSorter_Script_Change.txt

As you can see, the gene region detected with "blastall -p blastp" (i.e. "gene_82_gene_183") is slightly wider than the one you have detected in your test (i.e. "gene_84_gene_183").

In our tests we observed "gene_84_gene183" only when makeblastdb and blastp did not work due to different version of glibc being accidentally placed by mistake.

Therefore, we think that changing "blastp" to "blastall -p blastp" will have VirSorter in our conda environment to work as it has been intended to and that the actual viral region for this contig is "gene_82_gene_183". (Please correct us if we are wrong.)

If you could see through the above details and tell us whether or not it is good/appropriate to use the above "blastall -p blastp" implemented VirSorter for research that would be great.

Again many thanks for your help.

Sincerely, genomics-pixel

simroux commented 4 years ago

Hi,

Using "blastall -p blastp" is totally fine (and is what VirSorter used to do), so I'd recommend you run VirSorter that way.

Best, Simon

genomics-pixel commented 4 years ago

Dear Simon,

Thank you for confirming that the above change in VirSorter script in "wrapper_phage_contigs_sorter_iPlant.pl" is totally fine to use for research.

Because we are getting the gene region gene_82_gene_183 and not the gene region gene_84_gene_183 you have got, we are still worried if it is working the way it is supposed to.

Also if we use the above "blastall" using version of VirSorter, would you recommend us to have it noted in a paper.

Sincerely, Hiroki Nishiyama

simroux commented 4 years ago

Hi Hiroki,

These minor differences in prophage edge are not uncommon, and should be no cause for concern. However, you are 100% right that, to be fully reproducible, you should mention that VirSorter was run with the blast step computed using Blast 2.6 (which I suspect is the version of blast you're using) as follows: "blastall -p blastp ...". That way, a reader would be able to exactly reproduce your steps.

Best, Simon

genomics-pixel commented 4 years ago

Dear Simon,

Thank you for your swift reply!

I am terribly sorry to ask you many questions, but if you could explain a bit more on why these difference in prophage edge are not uncommon, I would appreciate it very much. (I want to be able to explain, if I am asked, about why there should be no concern with different results between "blastall -p blastp" and "blastp" when running with metagenomic assmblies from other samples.)

Thank you for your patience with me.

Sincerely, Hiroki Nishiyama

simroux commented 4 years ago

So there are two things here:

Best, Simon

genomics-pixel commented 4 years ago

Dear Simon,

Thank you very much for your detailed reply.

I now understand why there are minor difference in our test results and that these difference in viral region prediction between "blastall -p blastp", "blastp", and "diamond" is inevitable and should not be of concern.

I can not express enough gratitude for your generous help with this matter. Thank you very much.

Sincerely, Hiroki Nishiyama