viralInformatics / VIGA

13 stars 1 forks source link

Eukariotic viruses from contigs #7

Open SergeyBaikal opened 3 weeks ago

SergeyBaikal commented 3 weeks ago

Dear authors! Thank you for the program. Is it possible to realize identification of eukaryotic viruses from contigs bypassing assembly?

viralInformatics commented 2 weeks ago

Thank you for your question. If you already have contigs, such as the file Yourcontig.fasta, you can directly run the following commands to obtain eukaryotic viruses:

diamond blastx against virus protein database:

diamond blastx -q Yourcontig.fasta --db Diamond_VirusProtein_db -o Diamond_out.vp.txt -e 0.00001 --outfmt 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore slen stitle salltitles qcovhsp nident staxids cat Diamond_out.vp.txt | awk '{print $1}' | sort | uniq > Diamond_out.vp1nd cat Yourcontig.fasta | seqkit grep -f Diamond_out.vp1nd > Diamond_out.fa rm Diamond_out.vp1nd

diamond blastx against nr database:

diamond blastx -q Diamond_out.fa --db Diamond_nr_db -o Diamond_out.nr.txt -e 0.00001 --outfmt 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore slen stitle salltitles qcovhsp nident staxids cat Diamond_out.nr.txt | sort -k1,1 -k12,12gr -k11,11g -k3,3gr | sort -u -k1,1 --merge > Diamond_out.besthit.txt python softdir/filter.py Diamond_out.besthit.txt Diamond_out.virussure.txt Classify_out name softdir/.. outdir

Diamond_out.virussure.txt is the result of viruses, and Classify_out is the classification result processed through species, genus, and family levels.

SergeyBaikal commented 2 weeks ago

Thank you this is very useful thing! I just tried it, but I see most phage members, not eukaryotic viruses. I attach a few lines for review, file for example Diamond_out.virussure.txt

I understood why I see phages in my output. The file final_out.fa contains viruses that not only infect eukaryotes, but also occur in them. Is that right? So, I have to create the eukaryote virus files myself, without phages (final_out.fa, genus_len, taxid_ref_3, virustaxid).

I also see cyanophages in the results. For example Synechococcus phage S-CBS4. And I don't understand how it could have gotten into the result. Maybe I'm doing something wrong, please advise me.

Sorry, but I would recommend adding a flag that extracts only eukaryotic viruses.

viralInformatics commented 2 weeks ago

We sincerely apologize for the inconvenience. In the Diamond_out.virussure.txt file, we provide the alignment results for all viruses, including phage and eukaryotic viruses, but based on the file you provided, it does not contain eukaryotic viruses. Note that the first column of the taxid_ref_3 file in the "db" folder contains the taxid information for the eukaryotic viruses we have curated. Please note that this data does not include all viral genomes, as it has been filtered to remove some redundant databases. The final selection of eukaryotic virus species (AAI ≥ 90%; coverage ≥ 80%) and genera (AAI ≥ 70%; coverage ≥ 60%) is stored in the Classify_out folder. If you have a more refined and specific file, you can replace the taxid_ref_3 file and the final_out.fa file with it using the same format.

We encountered a similar situation when building the eukaryotic virus database. We included virus families that infect protists, which are considered the most primitive known eukaryotes, such as the Lavidaviridae family (https://viralzone.expasy.org/7839). However, these hosts can also be infected by certain phages, which might help clarify your question. And i checked and did not find Kyanoviridae, which includes Synechococcus phage S-CBS4, in the taxid_ref_3. I’m not sure where I might have misunderstood.

SergeyBaikal commented 2 weeks ago

I see, you just wrote that the file Diamond_out.virussure.txt contains eukaryotic viruses. That's what confused me. Thanks for the answers!

SergeyBaikal commented 2 weeks ago

Dear authors! Working with the program, I found several taxa in the file taxid_ref_3 that belong to phages. 1589751 1273752 1262528 2878009

And also TaxId belonging to virophages. Although yes, you write that you kept the virophages in the file. 543939 1411887