marbl / Krona

Interactively explore metagenomes and more from a web browser.
https://github.com/marbl/Krona/wiki
466 stars 102 forks source link

Accessions not found in local database #143

Open jw51 opened 4 years ago

jw51 commented 4 years ago

I recently updated my nr and krona tools databases and when I used ktClassifyBLAST to classify my blast results I got a warning that hundreds of accession numbers were not found in my local database. I checked a few and they all seem to correspond to TPA proteins (third party annotations) coming from the same paper (A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Parks et al 2018) and have not been recently added to NCBI ... However, they are not found in the earlier version of my databases. I probably don't think it is related to krona tools itself, as the accession numbers are indeed not found in the all.accession2taxid.sorted file... Did NCBI only recently added these proteins to the nr database, but not to the prot.accesion2taxid.gz file? Anybody else experiencing the same issue?

DanielePietrucci89 commented 4 years ago

Hi jw51! I'm experiencing the same issue. I've dowloaded the latest version of KronaTools and I've updated the Taxonomy and the Accessions. The local database is missing a lot of Accessions, that I supsected correspond to TPA proteins. Also protein of different experiments are not present. For example, the accession WP_121469408.1 (https://www.ncbi.nlm.nih.gov/protein/WP_121469408.1), which is an hypotetical protein associated to the taxID 1560005 (https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=1560005) is not present in all.accession2taxid.sorted file

I tried with:

grep "WP_121469408.1" all.accession2taxid.sorted

but the accession is not in the file. It's a problem, because I cant classify a lot of my sequences

jw51 commented 4 years ago

I haven't found a solution yet...

DanielePietrucci89 commented 4 years ago

It's strange, because if I download the file "prot.accession2taxid.gz" directly from the NCBI ftp ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/

And I write:

zgrep "WP_121469408.1" prot.accession2taxid.gz

I can find the accession in the file.

Consequently, in the NCBI database the accession is reported. I have no idea why it is not present in all.accession2taxid.sorted after running UpdateAccession.sh.

If you can provide to me some accession of TPA proteins I can check if they are present in the "prot.accession2taxid.gz" directly dowloaded from the NCBI

jw51 commented 4 years ago

Hi, sorry for the delay. Some TPA examples include: HCN16260, HCF14441, HAZ50255, HAM16036.

jw51 commented 4 years ago

Hi, I'm still having the same issue.

I downloaded an novel nr-database and Krona database, installed KronaTools again and installed KronaTools through conda. But nothing solved the issue.

Could anybody confirm that they are NOT having this issue with a recent Krona database? Or are we really the only ones...

ondovb commented 4 years ago

@MetaDan89 what are some of the others missing? WP_121469408.1 should be there, but the accessions are stored without the version (the .1 suffix) if you're trying to grep. Try: >ktGetTaxIDFromAcc WP_121469408.1 1560005

If the missing ones are all TPAs that are not in the NCBI tax lookup, then there's not a lot that can be done about them unfortunately.

jw51 commented 4 years ago

Here are some other ones: NAH76232 EFG1045630 MSM27211 EEZ5301706 EEW2049631 EFJ9541260 EES1712301 MTV91234. None of them are TPAs and their origin is known (species is in the name), so I don't really see why there can't be a taxid for them?

Is there anyway for ktClassifyBLAST to drop these hits without taxid? As all of my sequences become "root" if they have hits like this, while they still have good hits with taxonomic identifiers as well...

ondovb commented 4 years ago

These hits are now dropped by default in the latest code (see #150). Official release coming soon.

ondovb commented 3 years ago

This is now in a release (v2.8). Also see this post in #150 for a temporary database fix that will allow all the above accessions to be found. NCBI will likely move these into the main database soon.

vappiah commented 1 year ago

Dear Developers,

This is to let you know, I am experiencing a similar problem. I am using Krona 2.8.1 and below is the message I got

[ WARNING ] The following taxonomy IDs were not found in the local database and were set to root (if they were recently added to NCBI, use updateTaxonomy.sh to update the local database): 155 116 383 12 5 18029 18232 3 17051 36 58 15 16726 17020 17754 17989 4 17292 66 70 Writing krona.html..

Please advice. Thanks