mtisza1 / Cenote-Taker2

Cenote-Taker2: Discover and Annotate Divergent Viral Contigs (Please use Cenote-Taker 3 instead)
MIT License
56 stars 7 forks source link

Find taxonomy via best ORF #10

Closed Roli-Wilhelm closed 2 years ago

Roli-Wilhelm commented 3 years ago

Hi,

I am struggling to find where the taxonomic classification is output. The CONTIG_SUMMARY has an ORGANISM_NAME, but no taxonomy. Is one meant to trace the ORGANISM_NAME to the database to find the taxonomy? Sorry if I have missed something obvious. I am essentially looking for the output from the step where BLASTp is used to query the best ORF against RefSeq.

Command: run_cenote-taker2.py -c contig.file.fasta -r contig.names -p False -t 70 -m 200 -am True -db rna_virus -hh hhsearch

Thanks in advance!

mtisza1 commented 3 years ago

Hi Roli,

Thanks for opening the issue. I think I understand your question.

Let's say you have a putative virus contig where CENOTE_NAME = mycontig1 and END_FEATURE = None. The BLASTP and taxonomy info will be in the file: no_end_contigs_with_viral_domain/mycontig1.tax_guide.blastx.out. With a few special exceptions this file will have a line with the BLASTP information of the best hit in RefSeq, and the second line will be the hierarchical taxonomy of that hit. (If the END_FEATURE = DTR, the file will be DTR_contigs_with_viral_domain/mycontig1.tax_guide.blastx.out).

I see how this can be a little confusing as this file name has "blastx" in it, but I basically just have the script overwrite the "provisional taxonomy" blastx search file that was created earlier. Some putative virus contigs don't encode genes that are "useful" for taxonomy, e.g. terminase, major capsid protein, RdRp. These use taxonomical information from the "provisional taxonomy" blastx search.

More broadly, virus taxonomy is a really big challenge, and while Cenote-Taker 2 does a pretty good job, it's not perfect or particularly sophisticated. By default, only family-level taxonomy is "guessed". However, if you use the BLASTN settings with GenBank nt database (e.g. --known_strains blast_knowns --blastn_db /path/to/nt), you can get species level taxonomy for contigs that are closely related to viruses deposited in Genbank.

Let me know if this was helpful.

Mike

Roli-Wilhelm commented 3 years ago

Hi Mike,

Thanks for the quick and helpful response. I was able to find some taxonomic information in the files you referenced. I will also give the BLASTN method a try. Out of curiosity, where does the program output the family level classifications? Should I be seeing that information in the 'CONTIG_SUMMARY' file, or must one consult the 'tax_guide' files?

I tested three contigs from known virus and the results were encouraging, but hard to parse. Here's a quick summary of what I found.

I found that:

(1) All had "END_FEATURE = None. (2) Two of the three had zero HALLMARK genes detected. Both of these had accurate BLASTx-based annotations in the "tax_guide.blastx.out" to the species level. However, I had to take the NCBI accession in the tax_guide file and manually query NCBI to get the taxonomy. (3) The third virus had two hallmark genes, but the "*tax_guide.blastx.out" reported simply 'unclassified virus' even though the HALLMARK_NAMES information correctly identified it as a "Picornavirus-capsid-protein-domain_like".

Am I going about this correctly? Is there a more expedient way to parse the putative classifications?

Thanks in advance for your assistance, Roli


From: Mike Tisza @.> Sent: Tuesday, June 8, 2021 10:12 AM To: mtisza1/Cenote-Taker2 @.> Cc: Roland Conrad Wilhelm @.>; Author @.> Subject: Re: [mtisza1/Cenote-Taker2] Find taxonomy via best ORF (#10)

Hi Roli,

Thanks for opening the issue. I think I understand your question.

Let's say you have a putative virus contig where CENOTE_NAME = mycontig1 and END_FEATURE = None. The BLASTP and taxonomy info will be in the file: no_end_contigs_with_viral_domain/mycontig1.tax_guide.blastx.out. With a few special exceptions this file will have a line with the BLASTP information of the best hit in RefSeq, and the second line will be the hierarchical taxonomy of that hit. (If the END_FEATURE = DTR, the file will be DTR_contigs_with_viral_domain/mycontig1.tax_guide.blastx.out).

I see how this can be a little confusing as this file name has "blastx" in it, but I basically just have the script overwrite the "provisional taxonomy" blastx search file that was created earlier. Some putative virus contigs don't encode genes that are "useful" for taxonomy, e.g. terminase, major capsid protein, RdRp. These use taxonomical information from the "provisional taxonomy" blastx search.

More broadly, virus taxonomy is a really big challenge, and while Cenote-Taker 2 does a pretty good job, it's not perfect or particularly sophisticated. By default, only family-level taxonomy is "guessed". However, if you use the BLASTN settings with GenBank nt database (e.g. --known_strains blast_knowns --blastn_db /path/to/nt), you can get species level taxonomy for contigs that are closely related to viruses deposited in Genbank.

Let me know if this was helpful.

Mike

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/mtisza1/Cenote-Taker2/issues/10#issuecomment-856803867, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABKKGSVKLV5NK75K63D7TPLTRYQL7ANCNFSM46ILQNMA.

mtisza1 commented 3 years ago

Roli,

Thanks for the follow up. This seems concerning. I could answer your questions better if you could compress and send me your output directory (and a log file of the run if you made one). Is this possible?

Sending to my email michael.tisza@gmail.com would be easy, or you could include a link to a file upload. I can take a look and make sure I can answer your questions.

Mike

mtisza1 commented 3 years ago

Hi Roli,

I got your email in which you sent me your output files. There are 2 things going on here.

1) Your efetch tool is not working. Since you didn't have a log file for me, I'm not totally sure the nature of the error. Efetch pings the NCBI server to pull the taxonomy information of a blast hit, and this info was not present in your "tax_guide.blastx.out" files. Maybe you know the reason for this? Internet issues?

2) Indeed, my RNA virus hallmark models failed on 2 of the 3 RdRps! I already have an update to these models on my to-do list. In short, my initial screening of putative hallmark gene HMMs was culling a lot of polyproteins (which means most RdRps) because the HMM of the whole polypeptide hits a lot of non-virus genes. I tried to go back and extract just the RdRp region for many of these, but it looks like I missed some. I am aiming to have an updated version of the HMMs available next week. Sorry about that.

I'll let you know when the new HMMs are live.

Mike

Roli-Wilhelm commented 3 years ago

Thanks Mike!

(1) efetch runs in the cenote-taker2 conda environment, could you provide me an example of the command that is failing?

(2) Great!

Roli


From: Mike Tisza @.> Sent: Friday, June 11, 2021 2:56 PM To: mtisza1/Cenote-Taker2 @.> Cc: Roland Conrad Wilhelm @.>; Author @.> Subject: Re: [mtisza1/Cenote-Taker2] Find taxonomy via best ORF (#10)

Hi Roli,

I got your email in which you sent me your output files. There are 2 things going on here.

  1. Your efetch tool is not working. Since you didn't have a log file for me, I'm not totally sure the nature of the error. Efetch pings the NCBI server to pull the taxonomy information of a blast hit, and this info was not present in your "tax_guide.blastx.out" files. Maybe you know the reason for this? Internet issues?

  2. Indeed, my RNA virus hallmark models failed on 2 of the 3 RdRps! I already have an update to these models on my to-do list. In short, my initial screening of putative hallmark gene HMMs was culling a lot of polyproteins (which means most RdRps) because the HMM of the whole polypeptide hits a lot of non-virus genes. I tried to go back and extract just the RdRp region for many of these, but it looks like I missed some. I am aiming to have an updated version of the HMMs available next week. Sorry about that.

I'll let you know when the new HMMs are live.

Mike

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/mtisza1/Cenote-Taker2/issues/10#issuecomment-859777797, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABKKGSR7ZVJVP6FZBDQVBKDTSJL4DANCNFSM46ILQNMA.

mtisza1 commented 3 years ago

OK Roli,

Thanks for checking efetch. My next hypothesis is that krona tools are not working, or your krona taxonomy database did not set up correctly.

Can you try to run this command in the directory testers/no_end_contigs_with_viral_domain/:

conda activate cenote-taker2_env
ktClassifyBLAST -o testers2.tax_guide.blastx.tab testers2.tax_guide.blastx.out
cat testers2.tax_guide.blastx.tab

should produce this:

#queryID    taxID   Avg. log e-value
testers2    1811230 -450

If this is throwing some error about the taxonomy database, try this series of commands with a job with 4 or more CPUs (it will take about 20-40 minutes, if I recall):

KRONA_DIR=$( which python | sed 's/bin\/python/opt\/krona/g' )
cd ${KRONA_DIR}
sh updateTaxonomy.sh
cd ${KRONA_DIR}
sh updateAccessions.sh
Roli-Wilhelm commented 3 years ago

Hi Mike,

That fixed it! I did receive an error related to the taxonomy database. After updating, I my initial command produced output that correctly annotated the contigs that did not contain hallmark. The third test sequence contained an RdRp and so I expect this may work after you've update the HMMs.

Thank you very much for your assistance! R


From: Mike Tisza @.> Sent: Monday, June 14, 2021 10:08 AM To: mtisza1/Cenote-Taker2 @.> Cc: Roland Conrad Wilhelm @.>; Author @.> Subject: Re: [mtisza1/Cenote-Taker2] Find taxonomy via best ORF (#10)

OK Roli,

Thanks for checking efetch. My next hypothesis is that krona tools are not working, or your krona taxonomy database did not set up correctly.

Can you try to run this command in the directory testers/no_end_contigs_with_viral_domain/:

conda activate cenote-taker2_env ktClassifyBLAST -o testers2.tax_guide.blastx.tab testers2.tax_guide.blastx.out cat testers2.tax_guide.blastx.tab

should produce this:

queryID taxID Avg. log e-value

testers2 1811230 -450

If this is throwing some error about the taxonomy database, try this series of commands with a job with 4 or more CPUs (it will take about 20-40 minutes, if I recall):

KRONA_DIR=$( which python | sed 's/bin\/python/opt\/krona/g' ) cd ${KRONA_DIR} sh updateTaxonomy.sh cd ${KRONA_DIR} sh updateAccessions.sh

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/mtisza1/Cenote-Taker2/issues/10#issuecomment-860714790, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABKKGSW5GIGCGREKHJEU7J3TSYEPFANCNFSM46ILQNMA.

mtisza1 commented 3 years ago

Hi Roli,

I've updated the Hallmark database to include several additional RdRp models. I was able to find your sequences with this update. Here's what to do:

conda activate cenote-taker2_env
cd Cenote-Taker2
git pull
python update_ct2_databases.py --hmm True

Let me know if you have any more questions. Thanks again for your interest and follow up.

All the best,

Mike