mtisza1 / Cenote-Taker2

Cenote-Taker2: Discover and Annotate Divergent Viral Contigs (Please use Cenote-Taker 3 instead)
MIT License
56 stars 7 forks source link

BlastN "no high coverage hits" #22

Closed M-K1 closed 1 year ago

M-K1 commented 2 years ago

Hello Mike,

I'm using Cenote-Taker2 to identify viral contig and detect certain virus species. I've been working with your test data in order to see if I could get your tool working on the server but have one problem with the results. With the default parameters, as you advised on the wiki, the organism name is something I can't really work with. You provide the option to perform blastn to get a more specific result, but when I look at these results, the blast result is always "no high coverage hits" when using my own data or your provided test data. I've looked into this problem and came across issue https://github.com/mtisza1/Cenote-Taker2/issues/15 but still couldn't get BLASTN_INFO to display anything else besides the aforementioned result. I've read in your paper that the pipeline:

marks contigs with at least 90 per cent average nucleotide identity to existing database entries.

Looking at the blastn results in the intermediate files only shows % identities over 90%, so I am wondering whether I am doing something wrong. Could you elaborate on how Cenote-Taker2 uses blastn?

My command python run_cenote-taker2.py -c testcontigs_DNA_ct2.fasta -r test_DNA_ct_3 -p True -m 16 -t 16 --known_strains blast_knowns --blastn_db /lustre/BIF/nobackup/kon001/thesis/Databases/NCBI_NT/nt | tee test_DNA_ct_3_output.log

Log file test_DNA_ct_3_output.log

Thx in advance,

Matthijs

mtisza1 commented 2 years ago

Hi Matthijs,

Thank you for opening this issue.

First, let me apologize for the delay in replying. I've been extremely busy lately, and I've had to decide to not reply to Cenote-Taker 2 issues temporarily. I will be "back" to quick responses and updates(!) at the end of February.

I looked at your log and I can't see anything funny going on. Based on what you said, BLASTN ran and produced the appropriate alignments. I think the "no high coverage hits" could be occurring if your installation of the krona databases didn't work or if efetch is not properly connecting to the NCBI server.

To check if the krona database is installed, activate the cenote-taker2_env and find any file ending in .blastn_intraspecific.out (e.g. in the DTR_contigs_with_viral_domain of your output). Input a command like so:

ktClassifyBLAST -o test1.tab test_blastn_1ct2.blastn_intraspecific.out

If this doesn't work, you'll have to install/update the krona databases. Change to the main Cenote-Taker2 directory and use these commands (This requires at least 4 CPUs for some reason and will take 20-40 minutes, so please have those resources available):

KRONA_DIR=$( which python | sed 's/bin\/python/opt\/krona/g' )
cd ${KRONA_DIR}
sh updateTaxonomy.sh
cd ${KRONA_DIR}
sh updateAccessions.sh

To check efetch, activate the cenote-taker2_env and input this command:

efetch -db taxonomy -id 133704 -format xml | xtract -pattern Taxon -block "*/Taxon" -tab "\n" -element TaxId,ScientificName,Rank

Other explanations are possible, however, and you can email a compressed file of the output directory of the test run to inspect.

best,

Mike

M-K1 commented 2 years ago

Hey Mike,

Thanks for your advise, running updateTaxonomy.sh and updateAccessions.sh of the cenote-taker2 conda environment allowed me to correctly get BLASTn outputs. Another thing I'm wondering is what the ORGANISM_NAME is based on, as I have searched the usual databases and I couldn't find a match. Can you tell me what these names are based on?

Thx,

Matthijs