ropensci / taxize

A taxonomic toolbelt for R
https://docs.ropensci.org/taxize
Other
270 stars 61 forks source link

taxize::downstream() result is incomplete? #848

Closed oharac closed 4 years ago

oharac commented 4 years ago

A new bug using taxize with the WoRMS database. I noticed that many of the species for which I have traits data (independent of WoRMS) are not being found in the dataset I've created using downstream(). It appears that the downstream() result is incomplete, missing species names for which classification() returns valid results. EDIT: this seems to happen with children() as well.

Reproducible example:

library(taxize)

### Get upstream classification for octopus filosus
test_spp <- 'octopus filosus'
classification(test_spp, db = 'worms')[[1]]
# # A tibble: 11 x 3
#    name            rank            id
#    <chr>           <chr>        <int>
#  1 Animalia        Kingdom          2
#  2 Mollusca        Phylum          51
#  3 Cephalopoda     Class        11707
#  4 Coleoidea       Subclass     11709
#  5 Octopodiformes  Superorder  325343
#  6 Octopoda        Order        11718
#  7 Incirrata       Suborder     11733
#  8 Octopodoidea    Superfamily  14672
#  9 Octopodidae     Family       11782
# 10 Octopus         Genus       138268
# 11 Octopus filosus Species     341972

### get downstream using the genus ID returned for octopus filosus
x <- downstream(138268, db = 'worms', downto = 'species')
x <- x[[1]]

### Check whether the species name is in the downstream results from the genus name/id
test_spp %in% tolower(x$name)
# [1] FALSE

A quick inspection of the downstream() results (x) shows an alphabetical list of octopus species that abruptly ends after Octopus carolinensis and a handful of Octopus followed by a parenthetical (alternate genus). Not sure if that truncation is relevant... other examples didn't seem to truncate like that.

This is not just at the lowest level. Balaena mysticetus is class Mammalia (id 1837) and order Cetartiodactyla, but downstream(1837, db = 'worms', downto = 'order') returns only Didelphimorphia. Which makes me wonder whether there are marine opposums. But more problematic is that it seems like downstream() depends on successful results at each intermediate step downward, so downstream('mammalia', db = 'worms', downto = [anything]) will return an empty set.

test_spp2 <- 'balaena mysticetus'
classification(test_spp2, db = 'worms')[[1]]
# # A tibble: 14 x 3
#    name               rank            id
#    <chr>              <chr>        <int>
#  1 Animalia           Kingdom          2
#  2 Chordata           Phylum        1821
#  3 Vertebrata         Subphylum   146419
#  4 Gnathostomata      Superclass    1828
#  5 Tetrapoda          Superclass    1831
#  6 Mammalia           Class         1837
#  7 Theria             Subclass    380416
#  8 Cetartiodactyla    Order       370511
#  9 Cetancodonta       Suborder    370545
# 10 Cetacea            Infraorder    2688
# 11 Mysticeti          Superfamily 148724
# 12 Balaenidae         Family      136978
# 13 Balaena            Genus       137012
# 14 Balaena mysticetus Species     137086

### get downstream using the class ID returned
downstream(1837, db = 'worms', downto = 'order')[[1]]
#        id            name  rank
# 1 1451682 Didelphimorphia order

downstream('mammalia', db = 'worms', downto = 'family')[[1]]
# data frame with 0 columns and 0 rows

children('mammalia', db = 'worms')[[1]]
# A tibble: 1 x 3
#   childtaxa_id childtaxa_name  childtaxa_rank
#          <int> <chr>           <chr>         
# 1      1451682 Didelphimorphia Order   
Session Info ```r R version 3.6.3 (2020-02-29) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.5 LTS Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] taxize_0.9.98.92 here_0.1 RColorBrewer_1.1-2 forcats_0.5.0 stringr_1.4.0 [6] dplyr_1.0.2 purrr_0.3.4 readr_1.3.1 tidyr_1.1.0 tibble_3.0.3 [11] ggplot2_3.3.2 tidyverse_1.3.0 loaded via a namespace (and not attached): [1] httr_1.4.2 bit64_4.0.5 jsonlite_1.7.1 foreach_1.5.0 bold_1.1.0 [6] modelr_0.1.8 assertthat_0.2.1 triebeard_0.3.0 urltools_1.7.3 worrms_0.4.2 [11] blob_1.2.1 cellranger_1.1.0 yaml_2.2.1 pillar_1.4.6 RSQLite_2.2.0 [16] backports_1.1.8 lattice_0.20-41 glue_1.4.2 uuid_0.1-4 digest_0.6.25 [21] rvest_0.3.5 colorspace_1.4-1 htmltools_0.5.0 plyr_1.8.6 pkgconfig_2.0.3 [26] httpcode_0.3.0 broom_0.5.6 haven_2.3.1 scales_1.1.1 generics_0.0.2 [31] DT_0.13 ellipsis_0.3.1 withr_2.2.0 cli_2.0.2 magrittr_1.5 [36] crayon_1.3.4 readxl_1.3.1 memoise_1.1.0 evaluate_0.14 fs_1.4.1 [41] fansi_0.4.1 nlme_3.1-148 xml2_1.3.2 tools_3.6.3 data.table_1.13.0 [46] hms_0.5.3 lifecycle_0.2.0 munsell_0.5.0 reprex_0.3.0 compiler_3.6.3 [51] rlang_0.4.7 taxizedb_0.2.2.93 grid_3.6.3 conditionz_0.1.0 iterators_1.0.12 [56] rstudioapi_0.11 htmlwidgets_1.5.1 rappdirs_0.3.1 crosstalk_1.1.0.1 rmarkdown_2.3 [61] gtable_0.3.0 codetools_0.2-16 DBI_1.1.0 reshape_0.8.8 curl_4.3 [66] R6_2.4.1 zoo_1.8-8 lubridate_1.7.9 knitr_1.28 utf8_1.1.4 [71] bit_4.0.4 rprojroot_1.3-2 hoardr_0.5.2 ape_5.4-1 stringi_1.5.3 [76] parallel_3.6.3 crul_1.0.0 Rcpp_1.0.5 vctrs_0.3.4 dbplyr_1.4.4 [81] tidyselect_1.1.0 xfun_0.14 ```
oharac commented 4 years ago

A little more digging using the WoRMS API directly reveals that the above issue of dropped taxa is actually two different problems.

Problem 1: Truncated list of octopus spp:

The truncated list seems to be because the children API endpoint returns a max of 50 records. But trying the offset option in taxize::downstream doesn't seem to be working (nor does marine_only, see below) - perhaps I'm messing up the syntax?

> downstream(138268, db = 'worms', downto = 'species')
# $`138268`
#         id                              name    rank
# 1   341938                  Octopus abaculus species
# 2   341939                 Octopus aculeatus species
# 3   341940                    Octopus adamsi species

> downstream(138268, db = 'worms', downto = 'species', offset = 50)
# $`138268`
#         id                              name    rank
# 1   341938                  Octopus abaculus species
# 2   341939                 Octopus aculeatus species
# 3   341940                    Octopus adamsi species
# ... 50 more rows deleted for brevity

Also, oddly, children with this same ID shows no data:

> children(138268, db = 'worms')
Error: (204) No Content - 138268

It would be great if downstream (and children) in taxize would automatically iterate over the offsets to grab all records, perhaps as the default behavior. Failing that, I could use a while loop in my code but only if the offset option is working.

Problem 2: Dropping taxa downstream

Here's the tree from WoRMS from phylum down to order for the mammals:

Screen Shot 2020-09-30 at 11 53 08 AM

Looks like the children query on mammalia is missing the subclass Theria, as well as the oddball record of Rodentia at this level.

children('mammalia', db = 'worms')
# $mammalia
# # A tibble: 1 x 3
#  childtaxa_id childtaxa_name  childtaxa_rank
#          <int> <chr>           <chr>         
# 1      1451682 Didelphimorphia Order         

And children on Theria (id 380416) returns only two of the four orders on the tree.

children(380416, db = 'worms')
# $`380416`
# # A tibble: 2 x 3
#   childtaxa_id childtaxa_name  childtaxa_rank
#          <int> <chr>           <chr>         
# 1         2687 Carnivora       Order         
# 2       370511 Cetartiodactyla Order 

These same results are occurring from the WoRMS API directly: e.g. for mammals, id 1837, marine only = TRUE (default)

When setting marine only = FALSE, you get the full tree, but also a lot of non-marine taxa. This seems to be something that WoRMS needs to fix on its end, for the long term.

In the short term, I suppose I could get all the taxa and then somehow figure out which ones to drop later on. However, the marine_only option in taxize::children and taxize::downstream again seems not to be working for me:

children(1837, db = 'worms')
# $`1837`
# # A tibble: 1 x 3
#   childtaxa_id childtaxa_name  childtaxa_rank
#          <int> <chr>           <chr>         
# 1      1451682 Didelphimorphia Order   

children(1837, db = 'worms', marine_only = FALSE)
# $`1837`
# # A tibble: 1 x 3
#   childtaxa_id childtaxa_name  childtaxa_rank
#          <int> <chr>           <chr>         
# 1      1451682 Didelphimorphia Order  

downstream(1837, db = 'worms', downto = 'class')
# $`1837`
# data frame with 0 columns and 0 rows

downstream(1837, db = 'worms', downto = 'class', marine_only = FALSE)
# $`1837`
# data frame with 0 columns and 0 rows
oharac commented 4 years ago

Update: the dropped marine species appear to be due to incorrect environment flags (isMarine, isBrackish, isFreshwater, isTerrestrial) in the WoRMS database for subclass Theria, class Sirenia, and class Rodentia (I believe also for class Didelphimorphia, those pesky marine possums). I emailed info@marinespecies.org to let them know - here is a brief response:

We have been aware of this for several months. And since many people have reported the same issue recently, we are started working on this, this week. We need to do some proper check before we implement this, but hopefully we can put the fix online somewhere next week.

sckott commented 4 years ago

Thanks for the report @oharac

Thanks for reaching out to WORMS folks!

I added ... in two places so that one can pass through marine_only parameter now. The offset parameter is a bit complicated. In children() we use a while loop to get all results, so the user shouldn't need to adjust offset.

In downstream() with db=worms, we then use worms_downstream, which uses the parameter name start instead of offset. So with downstream() you'd need to use start i think. Unfortunately, worms pagination is quite painful since they don't provide total count of things found so you just have to keep requesting data with an offset until you don't get anymore

oharac commented 4 years ago

Thanks @sckott - until the WoRMS data is fixed, I'll try using the marine_only parameter to get all the mammals and then cut out non-marine branches manually.

I'll also give the start parameter a shot to build a while loop for the downstream calls... will that ensure intermediate steps also don't get truncated? e.g. if I call downstream(familyname, downto = 'species', db = 'worms') on a family that has 75 genera, should it work properly?

sckott commented 4 years ago

intermediate parameter only governs whether the intermediate taxa before getting to the target rank in downto are returned to you or not. The parameter doesn't affect the main work of getting your target taxa.

Just committed, i changed internals of children and downstream when db="worms" to use a new internal fxn worms_children_all to get all results for any one query.

try e.g., downstream(138268, db = "worms", downto = "species") or children(138268, db = "worms"),

note that with the children(138268, db = "worms") example, it gets 300 taxa, whereas the page says it has 306 children, not sure why its missing 6

oharac commented 4 years ago

@sckott Your fixes seem to be working great, thanks so much. However, now I'm seeing the same isMarine = null issue on the WoRMS side in Elasmobranchii as well, which (with default of marine_only = TRUE) is going to drop all but 7 species (out of 1200+) trying to go downstream from the class level (children() or downstream() in taxize), and finding this in such important taxa I can only imagine the same issue is occurring elsewhere in less charismatic taxa.

I emailed these new problems (class Elasmobranchii and some subclasses/infraclasses/orders below it) to the WoRMS folks as well, but seems like a pretty serious systemic issue with their database, so I wanted to post here to warn people to be very wary of these functions that access the WoRMS API until they can figure it out.

sckott commented 4 years ago

Thanks again for letting them know and posting here for others to see.

sckott commented 4 years ago

@oharac did you hear back on this yet?

oharac commented 4 years ago

Nothing yet, and still encountering the same error. I'll ping them again to see if there is an estimated timeframe.

sckott commented 4 years ago

okay, thanks

oharac commented 4 years ago

Received an email from someone on the WoRMS data management team saying they have executed the fix for this issue this morning. A quick check via their API page showed a promising result, though I have yet to do a more thorough investigation or try interfacing via the taxize package. I'll get to that this evening and report back, hopefully with all positive news!

oharac commented 4 years ago

The new WoRMS database fix seems to be playing nice with taxize, with spot checks on previously problematic taxa returning non-empty results. Now instead of empty results, I'm getting too many results - e.g., downstream('elasmobranchii', db = 'worms', downto = 'species', marine_only = TRUE)[[1]] returns 3300 elasmobranch species, though the true number should be closer to ~1250 - I assume many are synonyms or unaccepted names, etc, but need to dig further.

oharac commented 4 years ago

Confirmed a more reasonable number of valid elasmobranch spp (1240, instead of 3379) by comparing the status of each returned ID, looking only at those spp whose status is "accepted" (other possible statuses I see: "unaccepted", "alternate representation", "interim unpublished"). Here's a quick function to determine status (for any rank), accessing the WoRMS API AphiaRecordsByAphiaIDs endpoint:

check_status <- function(check_ids) {
  check_ids <- check_ids[!is.na(check_ids)]
  n_chunks <- ceiling(length(check_ids) / 50)
  records_list <- vector('list', length = n_chunks)
  for(i in 1:n_chunks) {
    indices <- ((i-1)*50 + 1):min(i*50, length(check_ids))
    ids_chunk <- check_ids[indices]
    ids_param <- paste0('aphiaids[]=', ids_chunk, collapse = '&')
    records_url <- paste0('https://www.marinespecies.org/rest/AphiaRecordsByAphiaIDs?', ids_param)
    records_list[[i]] <- jsonlite::fromJSON(records_url) %>%
      select(id = AphiaID, sciname = scientificname, status) %>%
      distinct()
  }
  records_df <- records_list %>% 
    bind_rows()
  return(records_df)
}

### example:
x <- taxize::downstream('elasmobranchii', db = 'worms', downto = 'species')[[1]]
y <- check_status(x$id) %>%
  filter(status == 'accepted')
z <- x %>%
  filter(id %in% y$id)
sckott commented 4 years ago

Thanks for your work on this! Lots of people will appreciate this work i'm sure.

So are there any outstanding issues?

oharac commented 4 years ago

Thanks for your work on this as well! I am not seeing any particular issues remaining on this related to the WoRMS API or the taxize functionality. I have found a few taxa that seem commonly used but do not show up in WoRMS - e.g. there is no record for family Aetobatidae, instead genus Aetobatus is found in Myliobatidae. I would expect that to perhaps be "unaccepted" but not entirely absent. I imagine all taxonomic databases will have a few issues like that, however, and I've been able to work around that problem easily.

sckott commented 4 years ago

Okay, great. closing