ropensci / taxadb

:package: Taxonomic Database
https://docs.ropensci.org/taxadb
Other
43 stars 13 forks source link

Single-name requests on NCBI are very slow #112

Open tpoisot opened 1 year ago

tpoisot commented 1 year ago

I am benchmarking our NCBITaxonomy package for Julia, and I am finding the time it takes for taxadb to return a result to be very slow. Here is the code I used:

library(taxadb)

nm = "Pan"

td_create('ncbi')

start_time <- Sys.time()
filter_name(nm, provider="ncbi")
end_time <- Sys.time()

end_time - start_time

This gives on average 10-12 seconds for a name. On the same request (using local flat files), our Julia package returns the results in 20ms (down to ~30μs with the most aggressive optimisations), and taxopy gets it done in about 100ms.

I had a brief chat with @karinorman who suggested the R code is right, any idea of what could be causing this long time to get the names back?

tpoisot commented 1 year ago

Using get_ids(nm, provider="ncbi") got it down to 4 seconds, but returns an ITIS identifier.

cboettig commented 1 year ago

Thanks, agreed that is very slow, but short answer is you will get the best performance with taxa_tbl(), e.g.


bench::bench_time({
  ncbi <- taxadb::taxa_tbl("ncbi")
  ncbi |> dplyr::filter(genus == "Pan") |> dplyr::collect()
})
#> process    real 
#>   355ms   198ms

bench::bench_time({
  ncbi <- taxadb::taxa_tbl("ncbi")
  ncbi |> dplyr::filter(grepl("Pan", scientificName)) |> dplyr::collect()
})
#>  process     real 
#>    2.29s 307.78ms
tpoisot commented 1 year ago

Thanks Carl - I am super uncomfortable with benchmarks for all of these reasons. In order to present a fair benchmark, here's the use case our package solves:

"We have a string of characters, which is the NCBI taxonomy node that matches them" (we do a lot more than that, but this is the most basic use-case).

What do you think would be the canonical way to do this, using taxadb, with the default options?

cboettig commented 1 year ago

Totally agree with you about benchmarks being challenging to pin down, as performance will vary across scope of queries one supports, hardware, etc. My own philosophy is that less is more, and benchmarks should be defined relative to the extensive existing art.

taxadb is interested in providing the scope of operations defined by SQL, and specifically the duckdb flavor of SQL. Specifically, this means support for operations like JOINS and regular expression matching. The benchmark of a single string match is particularly not helpful to me, because it is significantly narrower scope -- as you know, retrieving data associated with a key is the task of a key-value store, and can and should exploit different architecture (e.g. something like LMDB).

Here is a taxadb example seeking to resolve all the names in BioTIME against its NCBI cache, though as this is just a filtering join, one should definitely be able to out-perform this search of some 44K names:

library(dplyr)
download.file("https://biotime.st-andrews.ac.uk/downloads/BioTIMEQuery_24_06_2021.zip", "biotime.zip")
archive::archive_extract("biotime.zip")
biotime <- arrow::open_dataset("BioTIMEQuery_24_06_2021.csv", format="csv")

sp <- biotime |> select(scientificName = GENUS_SPECIES) |> distinct() |> collect()

bench::bench_time({
  ncbi <- taxadb::taxa_tbl("ncbi")
  ncbi |> inner_join(sp, copy=TRUE) |> collect()
})
#> Joining, by = "scientificName"
#> process    real 
#>    3.3s   1.33s

Our goal with taxadb is essentially to do nothing and be nothing -- the thinnest possible wrapper around the 'best' open source solutions that are already well optimized and continue to be optimized by well-funded professional teams like duckdb or arrow (it would be trivial to swap the duckdb for arrow here, duckdb itself already has bindings for that, though arrow lacks the full SQL expressiveness of duckdb).

(The very slow benchmarks above are clear examples where we've fallen short of this -- historical spandrels from when before better options were available. with less code, they should become faster again. Maybe better yet we should remove those functions entirely.)

I think this is as important for user experience as for performance. I think it is much better when users do not have to learn any custom functions (like get_names() ) to work with a database. So this is another area where taxadb needs to strive to be closer to nothing. Most researchers / data scientists are or should be familiar with the basic operations that can and can be done with SQL (i.e. with dplyr for R users, which is so much a SQL translation tool that PRQL apes it).

So what's a fair comparison? I think the way duckdb benchmarks itself against the TPC-H and TPC-DS benchmarks is probably a fair comparison of scope. There are similar such benchmarks for arrow. Obviously there's a lot of thought and engineering going in to this, and the great thing is that the benefits are far reaching rather than specific to one data structure or even one language. Maybe in a while some other tool will displace both arrow and duckdb in both performance and support across all popular languages that those two libraries have, and then we can again migrate taxadb to that new backend.

okay end of sermon, apologies for getting carried away!

tpoisot commented 1 year ago

I see - I get the point of taxadb now. We are focused on situations where the name might not be fully known, or corrupted, or uncertain, or reported with a lot of variance, which seems to be a different use-case.

(and I do love a good sermon about research software!)

cboettig commented 1 year ago

Right, perhaps that is the case. As you know, some forms of partial matching are well-defined operations in the standard vocab of data analysis / SQL (i.e. regex), so a taxadb seeks to leverage the fact that users might be familiar with the space of such data cleaning that can be done with existing string matching tools. What taxadb doesn't do is any clever taxonomic-specific logic