ropensci / taxizedb

Tools for Working with Taxonomic SQL Databases
Other
30 stars 7 forks source link

Is there a `parents` function? #51

Closed Midnighter closed 3 years ago

Midnighter commented 4 years ago

I am in need of accessing the name of a parent at a specific rank and I was wondering if there is a built-in or a better way to achieve this. What I'm currently doing is the following:

tax_ids <- c(186803, 541000, 216572, 186804, 31979,  186806)
taxonomy <- classification(tax_ids)

get_rank_name <- function(hierarchy, rank_name) {
  if (is.atomic(hierarchy)) {
    return(NA_character_)
  }
  result <- hierarchy %>% dplyr::filter(
    rank == rank_name
    ) %>% dplyr::pull(name) %>% unique()
  if (length(result) == 0) {
    return(NA_character_)
  }
  return(result)
}

rank_names <- purrr::map_chr(taxonomy, get_rank_name, rank_name = "order")

         186803          541000          216572          186804           31979          186806 
"Clostridiales" "Clostridiales" "Clostridiales" "Clostridiales" "Clostridiales" "Clostridiales"

The above kinda works but it'd be neat to have this built into taxize/taxizedb. Also, if the requested rank does not exist exactly, it'd be nice to get an inbetween rank, for example, superorder or suborder when requesting order.

sckott commented 4 years ago

thanks for the issue.

in taxize there's tax_name(), which isn't specifically for parents, but can be used for that. eg.,

tax_name(sci = "Helianthus annuus", get = "family", db = "ncbi")
#>     db             query     family
#> 1 ncbi Helianthus annuus Asteraceae

but there's nothing like taxize::tax_name() in taxizedb.

Would an equivalent of taxize::tax_name() for taxizedb work?

Midnighter commented 4 years ago

That looks pretty good. Two questions about it:

  1. Could it also accept taxonomic identifiers in stead of the name?
  2. How could a missing rank/level be handled?
sckott commented 4 years ago

In the taxize ver of tax_name there's no handling of missing data at the user supplied rank. We could try to add that. I guess we'd need to have a parameter to toggle whether the user wants the next rank lower or higher than the target rank.

Should be no problem to use an id instead of a name - the taxize version currently only handles names though

sckott commented 4 years ago

added a fxn taxa_at - remotes::install_github("ropensci/taxizedb@parent") - let me know what you think

Midnighter commented 4 years ago

Thank you so much for quickly adding such a function. I'm adding some feedback here:

I got a number of messages that looked like:

No results found. Consider checking the spelling or alternative classification

it'd be helpful to know for which input ID or name this was the case.

I ran taxa_at on a vector with ~6900 NCBI IDs and compared it to classification. I know they don't quite do the same but I thought the work done must be somewhat similar. taxa_at took ~140 seconds whereas classification only needed ~30 seconds.

Overall taxa_at does what I asked for and I really like that you added the missing argument. Thank you very much.

sckott commented 3 years ago

Those warnings should correspond to empty/zero row data.frame's in the output list.

sckott commented 3 years ago

@Midnighter Reinstall from the parent branch - remember to restart R. Try again, it should be faster.

Midnighter commented 3 years ago

Now taxa_at is three seconds faster than classification on my system. Awesome! :smiley: Thank you for such a quick implementation.

Midnighter commented 3 years ago

Haha, I just saw that your code uses classification internally so I guess 3 s are still within variance. I only measured once.

sckott commented 3 years ago

yes, uses classification internally. glad its useful