Best practices for querying multiple taxa

alexkrohn commented 5 months ago

Hi there.

I have a data frame with thousands of species-level detections from various US states. For each taxon in each state, I'd like to query NatureServ to extract the state-level status.

Querying each taxon individually with ns_search_spp is very slow. What is the best practice to query multiple taxa at once?

Simple example:


library(dplyr)

species.df <- data.frame(species = c("Turdus migratorius",
                                     "Terrapene carolina",
                                     "Thamnophis sirtialis"),
                         state = c("FL", "GA", "NJ"))

get_status <- function(spp, state){
  # Do the search
  nat.serv.result <- ns_search_spp(text = spp)

  # Extract the State status
  state.statuses <-  result$results$nations[[1]]$subnations[[1]]
  state.statuses[which(state.statuses$subnationCode == state),]$roundedSRank
}

tictoc::tic()
mapply(get_status, species.df$species, species.df$state)
tictoc::toc()

# 3.112 seconds

That is ~1 second per species. Is there a faster/better way to do this if I have thousands of species-state combinations?

This ignores that there are multiple nations for Thamnophis sirtalis, including multiple entries for the US, so there is probably a better way to just find the entries from subnationCode == "NJ". I haven't figured out the best way to expand out those nested dfs from the list...

Sys.info()
                                                                                                sysname                                                                                                 release 
                                                                                               "Darwin"                                                                                                "22.4.0" 
                                                                                                version                                                                                                nodename 
"Darwin Kernel Version 22.4.0: Mon Mar  6 20:59:58 PST 2023; root:xnu-8796.101.5~3/RELEASE_ARM64_T6020"                                                                               "Alexs-MacBook-Pro.local" 
                                                                                                machine                                                                                                   login 
                                                                                                "arm64"                                                                                                  "root" 
                                                                                                   user                                                                                          effective_user 
                                                                                            "alexkrohn"                                                                                             "alexkrohn"

ChristopherTracey commented 5 months ago

Interesting. We tend to use this package internally in a similar fashion to you example. I tested your example and consistently got returns of around 0.3 secs/species. It's interesting that its performing significantly slower for you.

This package/function was built for single/few species queries. I'll take a look for an alternative solution.

alexkrohn commented 5 months ago

Interesting! I wonder why the lag time is so high for me.

I've been getting it to work by:

1) Parallelizing the calls to ns_search to do many at once.

2) Grouping the df by species and state to only call unique combinations from ns_search.

I'm very curious if there is a better way.

Secondarily, I've noticed that the results from ns_search sometimes differ from what is displayed on the NatureServ website.

For example:

nat.serv.result <- ns_search_spp(text = "Ameiurus natalis")

lapply(ns.result$results$nations, function(x){
        x %>%
          select(subnations) %>%
          do.call(bind_rows,.)
      }) %>%
        do.call(bind_rows, .) %>%

        # Finally, keep only the relevant state and pull the status
        filter(subnationCode == "FL")

 # subnationCode roundedSRank exotic native
# 1            FL           S4  FALSE   TRUE
# 2            FL           S3  FALSE   TRUE

However, looking on NatureServe, the map shows the Yellow Bullhead as not ranked. Any idea why that might be, and why there are two rankings? (Sorry for asking two questions in one issue!)

ChristopherTracey commented 5 months ago

I was going to suggest parallelization as a short term solution -- I'm glad its working out. We'll take a look at ways to increase the query speed, but honestly it will be sometime in July before we can get to that given current capacity.

On the Yellow bullhead. I noticed there may be a type in your code on line 3: Should lapply(ns.result$results$nations, function(x){ read as lapply(nat.serv.result$results$nations, function(x){ instead? Making that correction and running your code reports back the correct rank of SNR for FL.

alexkrohn commented 5 months ago

D'oh! That's correct and helpful. Thanks! I look forward to hearing if you figure out ways to speed things along.

ChristopherTracey commented 5 months ago

@alexkrohn one alternative way to query this faster is to use the ns_id("ELEMENT_GLOBAL.2.154701") function as this links directly to the record instead of searching for matches to your species query. The main issue is that you need the EGT_UID (e.g. "ELEMENT_GLOBAL.2.154701") which isn't readily published. However, you could go to NatureServe Explorer, do an advanced query for a particular taxonomic group such as vertebrates, export the xls file, and then do some text processing on the url field within the xls to get the EGT_UIDs. Then you just need to match those up to your species list. It would be a little upfront work, but the the response time on ns_id() is really fast.

ropensci / natserv

Best practices for querying multiple taxa #31