simroux / VirSorter

Source code of the VirSorter tool, also available as an App on CyVerse/iVirus (https://de.iplantcollaborative.org/de/)
GNU General Public License v2.0
104 stars 30 forks source link

Multiple predicted viral regions in similar area of contig. #48

Closed genomics-pixel closed 4 years ago

genomics-pixel commented 4 years ago

Dear Simon,

Thank you for your help with the earlier issue.

Though it seemed that the problems were solved, one issue unfortunately persisted. (I made a new issue as this may be a different problem and because the last one was quite long.)

When I have run VirSorter/1.0.5 on some of my other assemblies I noticed that multiple viral regions were predicted on nearly identical area of input contig.

Ex.) >VIRSorter_k251_224725_flag_0_multi_37_0000_len_22303_ID_SAMPLE2_COUNT3641gene_9_gene_20-8464-21493-cat_4 >VIRSorter_k251_224725_flag_0_multi_37_0000_len_22303_ID_SAMPLE2_COUNT3641gene_9_gene_18-8464-18107-cat_4

I have tried using blastall and also the following commands with diamond and the "--no_c" option, but the same thing appeared.

In other cases I have seen nearly identical pair of predicted viral regions predicted in different categorized (ex. 4 and 5).

Is this the expected behavior of VirSorter or am I getting something wrong?

Sincerely, genomics-pixel

Edit: I went through your paper on VirSorter published in PeerJ and came across "Sequence metrics summary" paragraph, which states that VirSorter merges overlapping predictions. I am not sure, but perhaps the current problem is related with this part of the prediction step?

genomics-pixel commented 4 years ago

Dear Simon,

My apologies for sending another message in succession. I took a further look into the analysis results and summarized what I have found.

Before getting into the details, the following command was executed. (the input file is the same as the one used in above post but with shortened fasta header for simplicity)

As mentioned in the above post, I noticed that there were multiple predicted viral regions in similar area of the same contig. In the case of "k251z78343" contig, there were two very similar regions: "gene 93 to gene 229" and "gene 93 to gene 448": >VIRSorter_k251z78343_gene_93_gene_248-71397-207802-cat5 >VIRSorterk251z78343_gene_93_gene_229-71397-188491-cat_5

You can also see that each predictions have treated this annotation-sparse region differently by looking into the "VIRSorter_global-phage-signal.csv" file (i.e., "gene_149-gene_248" region is picked up only in the second longer prediction).

The following is a subset of "VIRSorter_global-phage-signal.csv" (Perhaps better to copy&paste to excel for better visibility.)

## Contig_id,Nb genes contigs,Fragment,Nb genes,Category,Nb phage hallmark genes,Phage gene enrichment sig,Non-Caudovirales phage gene enrichment sig,Pfam depletion sig,Uncharacterized enrichment sig,Strand switch depletion sig,Short genes enrichment sig VIRSorter_k251z78343,260,VIRSorter_k251z78343-gene_93-gene_229,137,2,,gene_130-gene_229:4.87224408777739,gene_93-gene_192:50.00000351237890,gene_93-gene_192:50.00000351237890,gene_93-gene_192:50.00000351237890,gene_93-gene_192:50.00000351237890,gene_93-gene_192:50.00000351237890 VIRSorter_k251z78343,260,VIRSorter_k251z78343-gene_93-gene_248,156,2,,gene_130-gene229:6.86349848266885;gene_149-gene_248:2.97349330748608_,gene_93-gene_192:50.00000351237890,gene_93-gene_192:50.00000351237890,gene_93-gene_192:50.00000351237890,gene_93-gene_192:50.00000351237890,gene_93-gene_192:50.00000351237890

In the "VIRSorter_k251z78343" rows/entries of "Metrics_files/VIRSorter_affi-contigs.tab", I noticed that there was a region with very few annotated genes (only 2 annotated genes) in between gene_229 and gene_241 which could be the source of difference in the prediction.

(A paragraph was deleted after the last edit to avoid confusion)

I am terribly sorry to have to ask you for your help again. VirSorter is a fantastic software and I would really appreciate it if you could assist me in solving this matter.

Thank you in advance.

Sincerely, genomics-pixel

P.S. If you need the entire output directory, please tell me.

simroux commented 4 years ago

Hi,

I agree that this looks like bug: could you send a fasta file with these two contigs so I can try to reproduce it ? I don't see anything obvious that would explain this right now unfortunately.

Best, Simon

genomics-pixel commented 4 years ago

Dear Simon,

Thank you for taking the time to look into this problem.

I have sent you an e-mail with the link and password to download the entire output directory of VirSorter (1.0.5).

The command ran to obtain this output is the following: wrapper_phage_contigs_sorter_iPlant.pl -f [input contig fasta file] --db 2 --diamond -wdir [output contig fasta file] --ncpu 4

If you have not recieved my e-mail or if you have any questions please let me know. Again, I deeply appreciate your help.

Sincerely, genomics-pixel

simroux commented 4 years ago

Hi ! There was a bug in 1.0.5 with some prophage predictions that your test case uncovered. I just pushed a new version of the script that should have solved it, so in theory if you pull this latest version and run it on the same file you should see:

Let me know if this new version didn't fix things on your side, or if you find anything else !

Best, Simon

genomics-pixel commented 4 years ago

Dear Simon,

Great to know that the bug is fixed!

I will have the system administrator install the updated version of VirSorter on our computer cluster and test it on my side as well.

Once the test is finished I will get back to you as soon as possible.

Many thanks for your kind help.

Sincerely, genomics-pixel

genomics-pixel commented 4 years ago

Dear Simon,

Sorry for my reply being late. I had a high fever and could not work for a while.

I used your new version of VirSorter to test my data and the results looked good! (i.e. only one predicted viral region for k251z78343 and only an r_0 and r_1 folder in VirSorter output directory).

I will now try to use it on the rest of my assemblies, but I have one question: Would it make sense to use VirSorter to analyse assemblies from each samples individually instead of using VirSorter to analyze a single fasta which contains all assemblies?

I have more than one thousand assemblies from different samples along with more than a hundred co-assemblies and its quite difficult to process them all at once.

Sincerely, genomics-pixel

simroux commented 4 years ago

Hi, It shouldn't matter much whether you run VirSorter on individual assemblies or on a pool of all contigs, so for simplicity (and speed), I would suggest processing them separately as you can run all of them in parallel.

Best, Simon

genomics-pixel commented 4 years ago

Dear Simon,

Thank you for your helpful advice! I am glad that there is no need to pool the assemblies together.

Again many thanks!

Sincerely, genomics-pixel