jiarong / VirSorter2

customizable pipeline to identify viral sequences from (meta)genomic data
GNU General Public License v2.0
204 stars 28 forks source link

Inconsistency in prophage detection #143

Closed davidtong28 closed 1 year ago

davidtong28 commented 1 year ago

Hi Jiarong,

Thank you for creating this amazing tool. I have been using it on my dataset of Illumina-sequenced bacteria (56 very similar Campylobacter jejuni, isolated from the same location and time) to detect prophages. The commands I used were as follow: virsorter run -i $file -w ../prophage/$file --min-length 1500 -j 8 all Then I used mash and hierarcal clustering to merge similar prophages into groups. I found that while most of the groups consistently appeared in all my isolates, there were some of the groups does not appear to show in certain isolates, but rather scatters sporatically. Since I assumed that these bacteria should be almost identical, I was interested in what caused the inconsistency. I looked into this, and here is what I found. I did BLASTn using one of the predicted prophage sequences as reference, and all the assembled contigs as query. Every assembly seems to have a 99-100% hit to the prophage sequence. It turns out that when the prophage hits a long contig (usually the 6-7th contig), the tool does not predict that sequence as a virus. When the prophage hits a short contig (usually the 18-23rd contig), the tool recognizes that sequence as a virus. This is quite strange because in other groups, Virsorter seems to be able to consistently predict prophages that are integrated into the bacterial genome (usually located in contig 1-2), and label them as "partial". I was curious if there can be an explaination (I'm guessing maybe has to do with the circular sequence detection algorithm?) for this observation. Also I was wondering if there is anything I could do when setting up the command line criteria to prevent this. Wish you all the best!

Sincerely,

David

jiarong commented 1 year ago

Hi David, thanks for the feedback. Quite interesting case. 1) I do not quite get the meaning of 6-7th contigs. Is that the rank of contig lengths among all contigs or only among those aligned to query prophage sequence? 3) can you send me one example (fasta sequence) of longer prophage contig missed by VirSorter2 and one example of shorter prophage contig found by VirSorter2?

davidtong28 commented 1 year ago

Hi Jiarong,

Thanks for the fast reply. Sorry I forgot to mention. The contigs were rearranged by size, 1st contig being the biggest. The 7th contig is around 90000bp and the 21st contig is around 13000bp, basically the same length as the phage. In the attached files, the phage was detected by Virsorter2 from the 21st contig of Isolate A, while also having a blast hit to the 7th contig of Isolate B which was not detected by Virsorter2. IsolateA_contig00021.txt IsolateB_contig00007.txt phage_from_IsolateA.txt I have not yet figured out why the prophage seemingly have integrated into the chromosome in isolate B but showed as a single (possibly circular) contig in isolate A (it did not have another hit on the chromosome). I might have to look back to the raw reads to see if the two isolates are different. For assembly I used shovill https://github.com/tseemann/shovill.

Sincerely,

David

jiarong commented 1 year ago

Hi David, Sorry for the delay. I just took a look at the sequences. The phage hit is quite weak w/ a score of 0.68 and w/o a phage hallmark gene. This is likely a fragmented/degraded prophage. The current prophage setting in VirSorter2 is conservative, requiring at least a hallmark gene and a minimal of ~10kb, to avoid getting virus-like genomic islands from microbes, with assumption that most prophage should be completely flanked by host sequences. The shorter one gets in as phage coz it's not treated as prophage (ie. free virus, not sure it's due to captured at lytic state or just assembly issue), and thus not subject to the conservative prophage requirement. To include the long one, first the most updated development version is needed, and then you need to tune some setting as below:

virsorter config --set GROUP_INFO.dsDNAphage.MIN_GENOME_SIZE=3000       # the sliding window for prophage detection
virsorter config --set PROVIRUS_MIN_PEAK_PROBA=0.6        # the max probability from the prophage search, should be larger or equals to the score cutoff (--min-score)
virsorter config --set PROVIRUS_MIN_HALLMARK_CNT=0      # not requiring any hallmark genes.
davidtong28 commented 1 year ago

Hi Jiarong, thank you for your response. That was very helpful. I was wondering what arguments should I put if only interested in prophages?

jiarong commented 1 year ago

It depends on how fragmented your genome is. If your genome is quite fragmented, the default is recommended. If you genome is nearly assembled in one sequence, you can only screen for those w/ suffix "partial" in the output. The tuning I suggested the previous post is only tuned for that weak hit, not applicable to general cases.