refresh-bio / PHIST

Phage-Host Interaction Search Tool
GNU General Public License v3.0
27 stars 2 forks source link

a question of k-mer based prediction #7

Closed Changhai996 closed 1 year ago

Changhai996 commented 2 years ago

Dear agudys: Thanks for your great work about this tool. When I am doing the host prediciton using by PHIST, there raises a question: I use Virsorter2 to find some putative viral sequence in MAGs, should I cut the viral region from the MAG then do the prediction? Because as I know some binning tools is also based on the k-mer frequencies, is that the same principle of PHIST? If so, it sounds that there is no-need for the prediction for the viral contigs in MAGs, but their are still remaining of Contamination in MAGs. Hope for your suggestions, thanks a lot!

aziele commented 2 years ago

Hi @Changhai996,

Thanks for your interest in PHIST. The tool predicts hosts based on the number of common k-mer matches between phage and bacterial genomes, omitting information on both the k-mer frequencies and the k-mers that are different between phage and host. Therefore, if you work on viral metagenome-assembled genomes (vMAG), theoretically, you don't need to cut the viral contig out of the vMAG. If there are k-mers in the contig that are also present in host, PHIST will find them, whether you use a contig or a vMAG sequence. On the other hand, if your vMAGs contain contaminants with bacterial sequences, PHIST will most likely predict hosts as the bacteria from which the contaminants originate.

Changhai996 commented 2 years ago

Hi aziele: Thanks for your kindly reply, I'm sorry I didn't make it clear that I used Virsorter2 to find some putative viral sequence in archaea MAGs, not the vMAG. I have try to predict the host by PHIST with/without cutting the viral contig from archaea MAGs. Some results showed that k-mers only matched the viral contigs in the MAGs (not cutting out), and also when cutting out the viral contig the host predicted is not targetting to the MAG harboring this viral contig but other MAG in the same/other phylum or order MAGs. That puzzling me, or should I classify that if the viral contig k-mer match to other contigs in the same MAGs that can tell this MAG is it host, else, this viral contig would a contaminants in the MAG?

aziele commented 2 years ago

Hi @Changhai996,

Could you provide a little more details on the sequences you provide to PHIST in the <virus_dir> and <host_dir> directories? In PHIST, as in other host prediction tools, it is crucial to differentiate viral sequences from the sequences of potential hosts. Perhaps it would be a good idea to extract only viral sequences from your MAGs, save these virus-only sequences in <virus_dir>, and then use all archaeal genomes available in RefSeq/GenBank as <host_dir>?

Changhai996 commented 2 years ago

Hi aziele: Sure, I have uploaded my viral sequences and the MAGs (where the viral contigs from). The virus-host interactions prediction was set with 3685 MAGs including 367 MAGs (in the host folder) and 2,143 RefSeq archaeal genomes and 1,077 representative bacterial genomes (ncbi ID is provided in the PHIST_other_host_info file). Here is the link https://drive.google.com/drive/folders/1_dP8whU2dewfzJ9jnR_N3xETnmmK7YEv?usp=sharing

Changhai996 commented 2 years ago

Yes, in my opinion, it should separate the viral contigs from the sequences of potential hosts, but in my case, these viral sequences are from the archaea MAGs, and that means they may have the same oligonucleotide composition(contigs binning into the same MAGs were according to the oligonucleotide frequency). So, in some situations, it would prove that the MAGs are their host because of the binning algorithm. And if I used the viral sequences to perform the virus-host interactions(not remove those viral sequences from detected MAGs), the result will show the archaea MAGs are their hosts, and the PHIST result does show that. That is why I would like to know whether should I remove those viral contigs from archaea MAGs😂, or sounds that I needn't use this method to predict the host but the binning step has already done.

aziele commented 2 years ago

Hi @Changhai996 Thank you for clarifying this; I understand now. I think the binning step itself indicates that these archaeal MAGs (containing viral contigs) represent actual hosts of that viruses. Of course, assuming the binning is correct (by the way, metaspades is great). In this case, host prediction wouldn't give you anything new, and I would assume that the MAGs harboring viral contigs represent hosts. If the binning is correct, these viral contigs might actually be proviruses integrated into archaeal genomes, which is extremely interesting because there are only several cases known in Archaea as far as I know. It would be interesting to see if these viral contigs have genes indicative of a temperate lifestyle (e.g., integrase, recombinase, transposase genes). You could also run PHIST for these viral contigs using Archaea/Bacteria genomes from RefSeq and GenBank only (without MAGs) to check if PHIST's host predictions support the hosts that you got from the MAGs.

Changhai996 commented 2 years ago

Hi aziele: yes, thanks for your kindly reply again, I would like to try without those archaea MAGs and verify whether these viral sequences are contamination !