Single-name requests on NCBI are very slow

tpoisot commented 1 year ago

I am benchmarking our NCBITaxonomy package for Julia, and I am finding the time it takes for taxadb to return a result to be very slow. Here is the code I used:

library(taxadb)

nm = "Pan"

td_create('ncbi')

start_time <- Sys.time()
filter_name(nm, provider="ncbi")
end_time <- Sys.time()

end_time - start_time

This gives on average 10-12 seconds for a name. On the same request (using local flat files), our Julia package returns the results in 20ms (down to ~30μs with the most aggressive optimisations), and taxopy gets it done in about 100ms.

I had a brief chat with @karinorman who suggested the R code is right, any idea of what could be causing this long time to get the names back?

tpoisot commented 1 year ago

Using get_ids(nm, provider="ncbi") got it down to 4 seconds, but returns an ITIS identifier.

cboettig commented 1 year ago

Thanks, agreed that is very slow, but short answer is you will get the best performance with taxa_tbl(), e.g.


bench::bench_time({
  ncbi <- taxadb::taxa_tbl("ncbi")
  ncbi |> dplyr::filter(genus == "Pan") |> dplyr::collect()
})
#> process    real 
#>   355ms   198ms

bench::bench_time({
  ncbi <- taxadb::taxa_tbl("ncbi")
  ncbi |> dplyr::filter(grepl("Pan", scientificName)) |> dplyr::collect()
})
#>  process     real 
#>    2.29s 307.78ms

get_ids() is using the default ITIS because the argument is called db not provider on that one function, because that gave it drop-in compatibility with taxize. I agree that choice was unfortunate, it would be much better to forget about taxize and be compatible with the other taxadb names!
The performance of the helper functions is orders of magnitude slower than the direct method above mostly due to the old logic in those functions, which allowed us to support other 'backends' like SQLite, and leverage the fast joins in MonetDBLite. Now that duckdb has replaced MonetDBLite and made SQLite irrelevant, we should obviously re-write the helper methods to use the simpler logic, though really the functions aren't necessary for anyone that already knows the patterns of dplyr with remote sources.
Overall comment on benchmarks: note that with the taxa_tbl() based queries above, there is essentially zero R code involved -- duckdb queries parquet using standard SQL, R is involved only in meta-programming. Note that duckdb is has bindings to Julia, Python and other languages, and extensive benchmarks, and can also query remote parquets without even downloading them (though downloading first is usually faster for obvious reasons). So one could probably recreate taxadb functionality in Julia in less than 30 lines of code.

tpoisot commented 1 year ago

Thanks Carl - I am super uncomfortable with benchmarks for all of these reasons. In order to present a fair benchmark, here's the use case our package solves:

"We have a string of characters, which is the NCBI taxonomy node that matches them" (we do a lot more than that, but this is the most basic use-case).

What do you think would be the canonical way to do this, using taxadb, with the default options?

cboettig commented 1 year ago

Totally agree with you about benchmarks being challenging to pin down, as performance will vary across scope of queries one supports, hardware, etc. My own philosophy is that less is more, and benchmarks should be defined relative to the extensive existing art.

taxadb is interested in providing the scope of operations defined by SQL, and specifically the duckdb flavor of SQL. Specifically, this means support for operations like JOINS and regular expression matching. The benchmark of a single string match is particularly not helpful to me, because it is significantly narrower scope -- as you know, retrieving data associated with a key is the task of a key-value store, and can and should exploit different architecture (e.g. something like LMDB).

Here is a taxadb example seeking to resolve all the names in BioTIME against its NCBI cache, though as this is just a filtering join, one should definitely be able to out-perform this search of some 44K names:

library(dplyr)
download.file("https://biotime.st-andrews.ac.uk/downloads/BioTIMEQuery_24_06_2021.zip", "biotime.zip")
archive::archive_extract("biotime.zip")
biotime <- arrow::open_dataset("BioTIMEQuery_24_06_2021.csv", format="csv")

sp <- biotime |> select(scientificName = GENUS_SPECIES) |> distinct() |> collect()

bench::bench_time({
  ncbi <- taxadb::taxa_tbl("ncbi")
  ncbi |> inner_join(sp, copy=TRUE) |> collect()
})
#> Joining, by = "scientificName"
#> process    real 
#>    3.3s   1.33s

Our goal with taxadb is essentially to do nothing and be nothing -- the thinnest possible wrapper around the 'best' open source solutions that are already well optimized and continue to be optimized by well-funded professional teams like duckdb or arrow (it would be trivial to swap the duckdb for arrow here, duckdb itself already has bindings for that, though arrow lacks the full SQL expressiveness of duckdb).

(The very slow benchmarks above are clear examples where we've fallen short of this -- historical spandrels from when before better options were available. with less code, they should become faster again. Maybe better yet we should remove those functions entirely.)

I think this is as important for user experience as for performance. I think it is much better when users do not have to learn any custom functions (like get_names() ) to work with a database. So this is another area where taxadb needs to strive to be closer to nothing. Most researchers / data scientists are or should be familiar with the basic operations that can and can be done with SQL (i.e. with dplyr for R users, which is so much a SQL translation tool that PRQL apes it).

So what's a fair comparison? I think the way duckdb benchmarks itself against the TPC-H and TPC-DS benchmarks is probably a fair comparison of scope. There are similar such benchmarks for arrow. Obviously there's a lot of thought and engineering going in to this, and the great thing is that the benefits are far reaching rather than specific to one data structure or even one language. Maybe in a while some other tool will displace both arrow and duckdb in both performance and support across all popular languages that those two libraries have, and then we can again migrate taxadb to that new backend.

okay end of sermon, apologies for getting carried away!

tpoisot commented 1 year ago

I see - I get the point of taxadb now. We are focused on situations where the name might not be fully known, or corrupted, or uncertain, or reported with a lot of variance, which seems to be a different use-case.

(and I do love a good sermon about research software!)

cboettig commented 1 year ago

Right, perhaps that is the case. As you know, some forms of partial matching are well-defined operations in the standard vocab of data analysis / SQL (i.e. regex), so a taxadb seeks to leverage the fact that users might be familiar with the space of such data cleaning that can be done with existing string matching tools. What taxadb doesn't do is any clever taxonomic-specific logic

ropensci / taxadb

Single-name requests on NCBI are very slow #112