ropensci / taxadb

:package: Taxonomic Database
https://docs.ropensci.org/taxadb
Other
43 stars 13 forks source link

Unable to create "slb" and "wd" database #102

Open brunobrr opened 2 years ago

brunobrr commented 2 years ago

Hi,

I tried to download "slb" and "wd" databases using different versions (2022, 2021, 2020, 2019) but all returned an error. For example:

td_create(provider = "slb", version = 2019, overwrite = TRUE)

"could not find 2019_dwc_slb, 2019_common_slb 
  checking for older versions.
2019_dwc_slb not available2019_common_slb not available"

By inspecting the number of records of each database I noticed that the latest versions of "ncbi" and "col" have fewer records than older versions. I expected that the latest version had more records than the older ones.

taxadb::taxa_tbl("ncbi", version = 2022) %>% summarise(n())    #2950147
taxadb::taxa_tbl("ncbi", version = 2021) %>% summarise(n())    #3461657

taxadb::taxa_tbl("col", version = 2022) %>% summarise(n())    #807599 
taxadb::taxa_tbl("col", version = 2021) %>% summarise(n())    #3615220

Finally, I noticed that probably due to issues related to my internet connection sometimes databases are created with fewer records than expected. For example, "ncbi" (v. 2022) had 32831 records instead of 2950147. I recognize that it is not a real issue, but maybe would be useful to check if the database has the expected number of records before performing queries. Just an idea.

cboettig commented 2 years ago

Thanks for the report, very helpful.

More precisely, it looks like the 2022 versions of NCBI have only the species names tables, names that resolve only to a higher taxon rank are not listed in the scientificName column (though still available from the dedicated rank columns):

> taxadb::taxa_tbl("ncbi") %>% count(taxonRank)
# Source:   lazy query [?? x 2]
# Database: duckdb_connection
  taxonRank       n
  <chr>       <dbl>
1 species   2950147

so I think we need to fix the 2022 tables for NCBI and COL