mtisza1 / Cenote-Taker2

Cenote-Taker2: Discover and Annotate Divergent Viral Contigs (Please use Cenote-Taker 3 instead)
MIT License
56 stars 7 forks source link

Taxonomic rank information of blastx.out #19

Closed neptuneyt closed 1 year ago

neptuneyt commented 2 years ago

Dear CT2 team, I get the contig taxonomic classification from *blastx.out, for example:

contig1 Viruses; Duplodnaviria; Heunggongvirae; Uroviricota; Caudoviricetes; Caudovirales; Myoviridae; Brunovirus; Salmonella virus SEN34 (9 levels) contig2 Viruses; Duplodnaviria; Heunggongvirae; Uroviricota; Caudoviricetes; Caudovirales; Lilyvirus; Paenibacillus virus Lily (8 levels) contig3 Viruses; Riboviria; Orthornavirae; Negarnaviricota; Polyploviricotina; Insthoviricetes; Articulavirales; Orthomyxoviridae; Alphainfluenzavirus; Influenza A virus; H5N1 subtype (11 levels)

I want split it to different taxonomic ranks (eg, realm, kingdom, plylum, class, ...) by ;, but I find each classification reult seems not identity with the same rank level. How colud I fix the problem. Looking forward your reply. Thank a lot.

mtisza1 commented 2 years ago

Hi,

I'm about to reply to your other taxonomy issue as well #18 . You may want to use the API key that I discuss in that comment to perform the queries below.

Basically, NCBI doesn't output the same taxonomic ranks for all sequences. (Taxonomy for viruses is hard, and assigning these ranks is not straight forward). It has been too difficult for me to work with the different taxonomic ranks for Cenote-Taker 2, so that's why the BLASTX-based taxonomy only predicts a taxonomic rank at the "family" level. I guess you have additional purposes for this information, so I've worked out a solution for you. Keep in mind that these ranks are NOT necessarily for your query sequence. Instead they are for the RefSeq sequence with the best hit from the BLASTX query. To run this, change to the output directory of your run of interest (the directory with the "*CONTIG_SUMMARY.tsv" file). Optionally, activate your API key to allow more queries. Then:

$ conda activate cenote-taker2_env
$ find -type f -name "*blastx.out" | while read BLASTX ; do ktClassifyBLAST -o ${BLASTX%.out}.tab ${BLASTX} ; taxid=$( tail -n1 ${BLASTX%.out}.tab | cut -f2 ) ; (cat ${BLASTX} ; echo "@@@@@@@@@@@" ; efetch -db taxonomy -id $taxid -format xml | xtract -pattern Taxon -block "*/Taxon" -tab "\n" -element TaxId,ScientificName,Rank) > ${BLASTX%.out}.tax_ranks.txt ; done

To list the output files: $ find -type f -name "*tax_ranks.txt"

This will create a "tax_ranks.txt" file that has the info from the blastx.out file on the first 2 lines, a separator line with @@@@@@@@@@@, then a tab-separated table where each line is a rank from the lineage and has the TaxID, Scientific Name, and Rank in the respective columns.

$ cat ./no_end_contigs_with_viral_domain/test_xtr_ct2.tax_guide.blastx.tax_ranks.txt
test_xtr_ct2_3  gi|13449259|ref|NP_085473.1| replicase [Acinetobacter phage AP205]  33.333  2.73e-56    459
Viruses; Riboviria; Orthornavirae; Lenarviricota; Leviviricetes; Norzivirales; Duinviridae; Apeevirus; Apeevirus quebecense
@@@@@@@@@@@
10239   Viruses superkingdom
2559587 Riboviria   clade
2732396 Orthornavirae   kingdom
2732407 Lenarviricota   phylum
2842243 Leviviricetes   class
2842247 Norzivirales    order
2842318 Duinviridae family
2842379 Apeevirus   genus
2843696 Apeevirus quebecense    species

You can change this one-liner to your liking to get a more suitable output file, but I think this will at least get you started with the information you wanted.

I hope this gets you the information you want. For me to make a more sophisticated taxonomy module, I'd need to commit a lot of time and energy. It's something I would do for "Cenote-Taker 3" but I don't have any timeline for doing it.

Best,

Mike

neptuneyt commented 2 years ago

Thank you for your lightning reply. I'll try with your method.