ropensci / taxadb

:package: Taxonomic Database
https://docs.ropensci.org/taxadb
Other
43 stars 13 forks source link

NCBI taxonomy - missing records #76

Closed wykhuh closed 3 years ago

wykhuh commented 4 years ago

Hi,

I'm playing around with taxadb and noticed that only 300K of the 2 million NCBI taxa are in the taxadb NCBI database. Any idea why the only 15% of the NCBI taxa are included in taxadb NCBI database?

cboettig commented 4 years ago

NCBI data is created by https://github.com/ropensci/taxadb/blob/master/data-raw/ncbi.R, you're welcome to take a look as that will give you the most precise answer.

e.g. NCBI recognizes over 3 million names, but only a bit over 2 million unique ids, NCBI assigns a taxonID at every rank level, and recognizes over 40 ranks, plus many names that have no rank associated. taxadb follows Darwin Core, and recognizes only 7 ranks. taxadb also only includes taxonomic names that are either considered 'accepted' or synonyms that can be mapped to an accepted name.

We're due to run that again to produce 2020 snapshots soon anyway, current snapshots are from 2019, so I'll try and report in more detail after updating.

wykhuh commented 3 years ago

Hi @cboettig . I work with an eDNA project that uses NCBI as the taxonomy, so I'm aware of the quirky nature of NCBI taxonomy. :-) I wrote recursive scripts in both ruby and python to process the NCBI dumps in order to assign the accepted "scientific name" taxonomic names at the 7 taxonomic ranks for every taxon id.

I looked at the ncbi-R and was a little confused about what is going. From what I can tell, multiple intermediary data frames are created in order to avoid a recursive loop. One of these intermediary data frames aren't doing what is expected, which results in only 300K records.

recursive_ncbi_ids seems ok. It has 2.2 million records. ncbi_taxonid only has 197K records, which seems wrong. ncbi_taxonid is based on ncbi_long, which has 112 million records after joining ncbi, long_hierarchy and expand, so the problem might be in that code block.

cboettig commented 3 years ago

@wykhuh yup, thanks for the update, found the bug! looks like we dropped things that didn't have a child taxa, here https://github.com/ropensci/taxadb/blob/master/data-raw/ncbi.R#L148. should have a fix up soon.

cboettig commented 3 years ago

okay, the 2019 ncbi data has been patched. Try:

 taxadb::td_create("ncbi", overwrite = TRUE)

Check entries:

library(dplyr)
taxadb::taxa_tbl("ncbi") %>% summarise(n())

# # Source:   lazy query [?? x 1]
# Database: duckdb_connection
#    `n()`
#    <dbl>
# 3085711

thanks again for the bug report.