taxoniq / taxoniq

Taxon Information Query - fast, offline querying of NCBI Taxonomy and related data
https://taxoniq.github.io/
MIT License
52 stars 2 forks source link

Accession id not found #17

Open marieBvr opened 2 years ago

marieBvr commented 2 years ago

Hi, I have been using Taxoniq successfully for some time but now many of the accessions I am looking for are not being found. I use the following command line: t = taxoniq.Taxon(accession_id="PSS23614.1")

And here is the list of accessions that are not found: 2022-07-12 10:30:32,131 ERROR Taxid not found with taxoniq for PSS23614.1 2022-07-12 10:30:32,131 ERROR Taxid not found with taxoniq for RVX10864.1 2022-07-12 10:30:32,131 ERROR Taxid not found with taxoniq for CBI30819.3 2022-07-12 10:30:32,131 ERROR Taxid not found with taxoniq for XP_034697749.1 2022-07-12 10:30:32,131 ERROR Taxid not found with taxoniq for XP_034697748.1 2022-07-12 10:30:32,132 ERROR Taxid not found with taxoniq for XP_002266388.1 2022-07-12 10:30:32,132 ERROR Taxid not found with taxoniq for RVX04110.1 2022-07-12 10:30:32,132 ERROR Taxid not found with taxoniq for XP_034708330.1 2022-07-12 10:30:32,133 ERROR Taxid not found with taxoniq for AQK92860.1 2022-07-12 10:30:32,133 ERROR Taxid not found with taxoniq for QKY74088.1 2022-07-12 10:30:32,133 ERROR Taxid not found with taxoniq for PWZ38118.1 2022-07-12 10:30:32,133 ERROR Taxid not found with taxoniq for AGH55661.1 2022-07-12 10:30:32,133 ERROR Taxid not found with taxoniq for PPR94295.1 2022-07-12 10:30:32,133 ERROR Taxid not found with taxoniq for RVX22827.1 2022-07-12 10:30:32,133 ERROR Taxid not found with taxoniq for RVW56456.1

I have updated NT database following the documentation but it didn't change anything... Any idea ?

Thanks, Marie

aretchless commented 6 months ago

I have a similar issue. In my case, it looks like Taxoniq treats the 'version' as an essential part of the accession id, and does not include older versions. For instance, I have results from a search against an 1-year old database, and it includes a hit to the following sequence: 'NC_023861.1'. This is not recognized by taxoniq (just installed using the nr/nt databases from version 0.6.1). I searched for NC_023861.1 on NCBI Nucleotide, and saw that it has been replaced by a newer sequence. Taxoniq recognizes the newer version (NC_023861.2), but does not recognize the root accession number (NC_023861)

marieBvr commented 6 months ago

Hi @aretchless, If it ever helps: I stopped using taxoniq even though it was very effective.

I use NCBITaxa from ete3 which is much more up to date. I also use Entrez command from Biopython (esummary).

kislyuk commented 1 week ago

Hi @marieBvr - it looks like you are trying to use Taxoniq to query for protein sequence accession IDs. Taxoniq was never designed to work with protein sequences and their IDs, only nucleotide ones. I spot checked your list and all of the accession IDs seem to produce results in nr, not in nt. Let me know if that sounds incorrect.

I will note that one reason Taxoniq has not been updated as frequently is that unlike ete3 (ETE Toolkit) and Entrez, Taxoniq receives no funding and is updated in my spare time. Perhaps I should apply for a grant from my ex-employer, CZI, where I originally developed Taxoniq!

@aretchless your issue is different. This is an unfortunate side effect of an optimization that I made, where an unversioned accession ID is assumed to be version 1 by default. I could add a heuristic to scan for subsequent versions if there is no hit for an unversioned ID query, but the heuristic would be inherently limited because it's unclear at what version number to give up, and a lengthy sweep will introduce a performance hit in some applications.

Instead, the better solution seems to be to reindex all accessions at index build time, so that an unversioned accession ID is not assumed to be version 1, but instead always points to the latest version. This should be achievable at zero cost to the index size. I'll work on that next.