Closed oharac closed 4 years ago
A little more digging using the WoRMS API directly reveals that the above issue of dropped taxa is actually two different problems.
The truncated list seems to be because the children API endpoint returns a max of 50 records. But trying the offset
option in taxize::downstream doesn't seem to be working (nor does marine_only
, see below) - perhaps I'm messing up the syntax?
> downstream(138268, db = 'worms', downto = 'species')
# $`138268`
# id name rank
# 1 341938 Octopus abaculus species
# 2 341939 Octopus aculeatus species
# 3 341940 Octopus adamsi species
> downstream(138268, db = 'worms', downto = 'species', offset = 50)
# $`138268`
# id name rank
# 1 341938 Octopus abaculus species
# 2 341939 Octopus aculeatus species
# 3 341940 Octopus adamsi species
# ... 50 more rows deleted for brevity
Also, oddly, children
with this same ID shows no data:
> children(138268, db = 'worms')
Error: (204) No Content - 138268
It would be great if downstream
(and children
) in taxize
would automatically iterate over the offsets to grab all records, perhaps as the default behavior. Failing that, I could use a while
loop in my code but only if the offset
option is working.
Here's the tree from WoRMS from phylum down to order for the mammals:
Looks like the children
query on mammalia is missing the subclass Theria, as well as the oddball record of Rodentia at this level.
children('mammalia', db = 'worms')
# $mammalia
# # A tibble: 1 x 3
# childtaxa_id childtaxa_name childtaxa_rank
# <int> <chr> <chr>
# 1 1451682 Didelphimorphia Order
And children
on Theria (id 380416) returns only two of the four orders on the tree.
children(380416, db = 'worms')
# $`380416`
# # A tibble: 2 x 3
# childtaxa_id childtaxa_name childtaxa_rank
# <int> <chr> <chr>
# 1 2687 Carnivora Order
# 2 370511 Cetartiodactyla Order
These same results are occurring from the WoRMS API directly: e.g. for mammals, id 1837, marine only = TRUE (default)
When setting marine only = FALSE, you get the full tree, but also a lot of non-marine taxa. This seems to be something that WoRMS needs to fix on its end, for the long term.
In the short term, I suppose I could get all the taxa and then somehow figure out which ones to drop later on. However, the marine_only
option in taxize::children
and taxize::downstream
again seems not to be working for me:
children(1837, db = 'worms')
# $`1837`
# # A tibble: 1 x 3
# childtaxa_id childtaxa_name childtaxa_rank
# <int> <chr> <chr>
# 1 1451682 Didelphimorphia Order
children(1837, db = 'worms', marine_only = FALSE)
# $`1837`
# # A tibble: 1 x 3
# childtaxa_id childtaxa_name childtaxa_rank
# <int> <chr> <chr>
# 1 1451682 Didelphimorphia Order
downstream(1837, db = 'worms', downto = 'class')
# $`1837`
# data frame with 0 columns and 0 rows
downstream(1837, db = 'worms', downto = 'class', marine_only = FALSE)
# $`1837`
# data frame with 0 columns and 0 rows
Update: the dropped marine species appear to be due to incorrect environment flags (isMarine
, isBrackish
, isFreshwater
, isTerrestrial
) in the WoRMS database for subclass Theria, class Sirenia, and class Rodentia (I believe also for class Didelphimorphia, those pesky marine possums). I emailed info@marinespecies.org to let them know - here is a brief response:
We have been aware of this for several months. And since many people have reported the same issue recently, we are started working on this, this week. We need to do some proper check before we implement this, but hopefully we can put the fix online somewhere next week.
Thanks for the report @oharac
Thanks for reaching out to WORMS folks!
I added ...
in two places so that one can pass through marine_only
parameter now. The offset parameter is a bit complicated. In children()
we use a while loop to get all results, so the user shouldn't need to adjust offset.
In downstream()
with db=worms, we then use worms_downstream
, which uses the parameter name start
instead of offset
. So with downstream()
you'd need to use start
i think. Unfortunately, worms pagination is quite painful since they don't provide total count of things found so you just have to keep requesting data with an offset until you don't get anymore
Thanks @sckott - until the WoRMS data is fixed, I'll try using the marine_only
parameter to get all the mammals and then cut out non-marine branches manually.
I'll also give the start
parameter a shot to build a while
loop for the downstream calls... will that ensure intermediate steps also don't get truncated? e.g. if I call downstream(familyname, downto = 'species', db = 'worms')
on a family that has 75 genera, should it work properly?
intermediate
parameter only governs whether the intermediate taxa before getting to the target rank in downto
are returned to you or not. The parameter doesn't affect the main work of getting your target taxa.
Just committed, i changed internals of children
and downstream
when db="worms"
to use a new internal fxn worms_children_all
to get all results for any one query.
try e.g., downstream(138268, db = "worms", downto = "species")
or children(138268, db = "worms")
,
note that with the children(138268, db = "worms")
example, it gets 300 taxa, whereas the page says it has 306 children, not sure why its missing 6
@sckott Your fixes seem to be working great, thanks so much. However, now I'm seeing the same isMarine = null
issue on the WoRMS side in Elasmobranchii as well, which (with default of marine_only = TRUE
) is going to drop all but 7 species (out of 1200+) trying to go downstream from the class level (children()
or downstream()
in taxize
), and finding this in such important taxa I can only imagine the same issue is occurring elsewhere in less charismatic taxa.
I emailed these new problems (class Elasmobranchii and some subclasses/infraclasses/orders below it) to the WoRMS folks as well, but seems like a pretty serious systemic issue with their database, so I wanted to post here to warn people to be very wary of these functions that access the WoRMS API until they can figure it out.
Thanks again for letting them know and posting here for others to see.
@oharac did you hear back on this yet?
Nothing yet, and still encountering the same error. I'll ping them again to see if there is an estimated timeframe.
okay, thanks
Received an email from someone on the WoRMS data management team saying they have executed the fix for this issue this morning. A quick check via their API page showed a promising result, though I have yet to do a more thorough investigation or try interfacing via the taxize
package. I'll get to that this evening and report back, hopefully with all positive news!
The new WoRMS database fix seems to be playing nice with taxize
, with spot checks on previously problematic taxa returning non-empty results. Now instead of empty results, I'm getting too many results - e.g., downstream('elasmobranchii', db = 'worms', downto = 'species', marine_only = TRUE)[[1]]
returns 3300 elasmobranch species, though the true number should be closer to ~1250 - I assume many are synonyms or unaccepted names, etc, but need to dig further.
Confirmed a more reasonable number of valid elasmobranch spp (1240, instead of 3379) by comparing the status of each returned ID, looking only at those spp whose status is "accepted" (other possible statuses I see: "unaccepted", "alternate representation", "interim unpublished"). Here's a quick function to determine status (for any rank), accessing the WoRMS API AphiaRecordsByAphiaIDs
endpoint:
check_status <- function(check_ids) {
check_ids <- check_ids[!is.na(check_ids)]
n_chunks <- ceiling(length(check_ids) / 50)
records_list <- vector('list', length = n_chunks)
for(i in 1:n_chunks) {
indices <- ((i-1)*50 + 1):min(i*50, length(check_ids))
ids_chunk <- check_ids[indices]
ids_param <- paste0('aphiaids[]=', ids_chunk, collapse = '&')
records_url <- paste0('https://www.marinespecies.org/rest/AphiaRecordsByAphiaIDs?', ids_param)
records_list[[i]] <- jsonlite::fromJSON(records_url) %>%
select(id = AphiaID, sciname = scientificname, status) %>%
distinct()
}
records_df <- records_list %>%
bind_rows()
return(records_df)
}
### example:
x <- taxize::downstream('elasmobranchii', db = 'worms', downto = 'species')[[1]]
y <- check_status(x$id) %>%
filter(status == 'accepted')
z <- x %>%
filter(id %in% y$id)
Thanks for your work on this! Lots of people will appreciate this work i'm sure.
So are there any outstanding issues?
Thanks for your work on this as well! I am not seeing any particular issues remaining on this related to the WoRMS API or the taxize
functionality. I have found a few taxa that seem commonly used but do not show up in WoRMS - e.g. there is no record for family Aetobatidae, instead genus Aetobatus is found in Myliobatidae. I would expect that to perhaps be "unaccepted" but not entirely absent. I imagine all taxonomic databases will have a few issues like that, however, and I've been able to work around that problem easily.
Okay, great. closing
A new bug using
taxize
with the WoRMS database. I noticed that many of the species for which I have traits data (independent of WoRMS) are not being found in the dataset I've created usingdownstream()
. It appears that thedownstream()
result is incomplete, missing species names for whichclassification()
returns valid results. EDIT: this seems to happen withchildren()
as well.Reproducible example:
A quick inspection of the
downstream()
results (x
) shows an alphabetical list of octopus species that abruptly ends after Octopus carolinensis and a handful of Octopus followed by a parenthetical (alternate genus). Not sure if that truncation is relevant... other examples didn't seem to truncate like that.This is not just at the lowest level. Balaena mysticetus is class Mammalia (id 1837) and order Cetartiodactyla, but
downstream(1837, db = 'worms', downto = 'order')
returns only Didelphimorphia. Which makes me wonder whether there are marine opposums. But more problematic is that it seems likedownstream()
depends on successful results at each intermediate step downward, sodownstream('mammalia', db = 'worms', downto = [anything])
will return an empty set.Session Info
```r R version 3.6.3 (2020-02-29) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.5 LTS Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] taxize_0.9.98.92 here_0.1 RColorBrewer_1.1-2 forcats_0.5.0 stringr_1.4.0 [6] dplyr_1.0.2 purrr_0.3.4 readr_1.3.1 tidyr_1.1.0 tibble_3.0.3 [11] ggplot2_3.3.2 tidyverse_1.3.0 loaded via a namespace (and not attached): [1] httr_1.4.2 bit64_4.0.5 jsonlite_1.7.1 foreach_1.5.0 bold_1.1.0 [6] modelr_0.1.8 assertthat_0.2.1 triebeard_0.3.0 urltools_1.7.3 worrms_0.4.2 [11] blob_1.2.1 cellranger_1.1.0 yaml_2.2.1 pillar_1.4.6 RSQLite_2.2.0 [16] backports_1.1.8 lattice_0.20-41 glue_1.4.2 uuid_0.1-4 digest_0.6.25 [21] rvest_0.3.5 colorspace_1.4-1 htmltools_0.5.0 plyr_1.8.6 pkgconfig_2.0.3 [26] httpcode_0.3.0 broom_0.5.6 haven_2.3.1 scales_1.1.1 generics_0.0.2 [31] DT_0.13 ellipsis_0.3.1 withr_2.2.0 cli_2.0.2 magrittr_1.5 [36] crayon_1.3.4 readxl_1.3.1 memoise_1.1.0 evaluate_0.14 fs_1.4.1 [41] fansi_0.4.1 nlme_3.1-148 xml2_1.3.2 tools_3.6.3 data.table_1.13.0 [46] hms_0.5.3 lifecycle_0.2.0 munsell_0.5.0 reprex_0.3.0 compiler_3.6.3 [51] rlang_0.4.7 taxizedb_0.2.2.93 grid_3.6.3 conditionz_0.1.0 iterators_1.0.12 [56] rstudioapi_0.11 htmlwidgets_1.5.1 rappdirs_0.3.1 crosstalk_1.1.0.1 rmarkdown_2.3 [61] gtable_0.3.0 codetools_0.2-16 DBI_1.1.0 reshape_0.8.8 curl_4.3 [66] R6_2.4.1 zoo_1.8-8 lubridate_1.7.9 knitr_1.28 utf8_1.1.4 [71] bit_4.0.4 rprojroot_1.3-2 hoardr_0.5.2 ape_5.4-1 stringi_1.5.3 [76] parallel_3.6.3 crul_1.0.0 Rcpp_1.0.5 vctrs_0.3.4 dbplyr_1.4.4 [81] tidyselect_1.1.0 xfun_0.14 ```