jessieren / VirFinder

VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data
Other
132 stars 23 forks source link

Training with new viral genomes #8

Closed Starfishgames closed 5 years ago

Starfishgames commented 6 years ago

Dear all, I'm very interested in this tool. I'm actually trying to understand whether it's possible to expand the number of viral genomes to produce another training dataset by adding metagenomically-identified contigs representing putatively complete phage genomes from environmental datasets (e.g. Pacific Ocean Virome, Tara Oceans). Is the host gene sequence mandatory for creating the new model? Best regards

jessieren commented 6 years ago

Hi Starfishgames,

Thank you very much for your interest in VirFinder.

Yes, in order to train the classifier both viruses (positive examples) and bacteria hosts (negative examples) are needed. VirFinder has the function for training models using users' database.

Hope that helps :)

Best wishes, Jie

Starfishgames commented 6 years ago

Dear Jie, Thanks for your reply. So, bacterial sequences used for the training do not need to be the "real" host of the viruses I'm using for training, if I understand correctly. I might use, for instance, metagenomically-identified viral genomes for positive examples and prokaryotic genomes from the proGenomes database as a negative examples. Am I right? Thanks in advance

jessieren commented 6 years ago

You are welcome!

Yes, you are right. The positive examples are viruses and negative examples are non-viruses such as prokaryotic genomes. In my paper, I used phages and prokaryotic genomes as my positive and negative examples. The hosts of the phages are prokaryotes, but they do not need to be corresponding to each other in the training dataset. All we need are a set of positive and a set of negative examples.

Best wishes, Jie

On Tue, Mar 27, 2018 at 3:40 AM, Starfishgames notifications@github.com wrote:

Dear Jie, Thanks for your reply. So, bacterial sequences used for the training do not need to be the "real" host of the viruses I'm using for training, if I understand correctly. I might use, for instance, metagenomically-identified viral genomes for positive examples and prokaryotic genomes from the proGenomes database as a negative examples. Am I right? Thanks in advance

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_jessieren_VirFinder_issues_8-23issuecomment-2D376479160&d=DwMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=irWyXBTJAqCxHN7GNey4-g&m=EErlNMBZVs9NuFlKjUS6GgnPda72YBQsBjG0ji7R7pY&s=uloRP5f1PBTWGxNZAyq-A9T207ovHg7xN6p9n9Bi9dY&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AHgpvC8uUUKslt5OtyUmtMqMqdCHt0stks5tihc5gaJpZM4S67Tz&d=DwMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=irWyXBTJAqCxHN7GNey4-g&m=EErlNMBZVs9NuFlKjUS6GgnPda72YBQsBjG0ji7R7pY&s=FOd2SrhMqLCvp3-AB17dOmIrn2frQbEUU4x7o_pzqOo&e= .

-- Jie Jessie Ren Postdoc in Computational Biology and Bioinformatics University of Southern California 1050 Childs Way, RRI 201

ghost commented 6 years ago

ca I train with a eukaryotic host genome, such as human?

srosales712 commented 6 years ago

I'd also like to know if you can train the model with a eukaryotic host genome.

jessieren commented 6 years ago

Thank you for the interest in using VirFinder! (Sorry for my late reply)

It is possible to train VirFinder using viruses with eukaryotic host genomes. It is not known how well the model can predict. Since the size of the human genome is huge (compared to the amount of total viral sequences), I suggest to first downsample the human genome first. The program uses the equal amount of DNA bps for viral sequences and host sequences.