ropensci / rgbif

Interface to the Global Biodiversity Information Facility API
https://docs.ropensci.org/rgbif
Other
155 stars 50 forks source link

Taxa present in backbone not returned with `name_backbone` search #533

Closed mikeroswell closed 2 years ago

mikeroswell commented 2 years ago

Though I suspect this issue is upstream of rgbif, I'm posting this issue here since you all seem very savvy with this stuff. I was checking names in a table using rgbif and got some strange non-matches for taxa that appear to be in the backbone, e.g.

https://www.gbif.org/species/5329010 https://www.gbif.org/species/5372513 This one should be a synonym https://www.gbif.org/species/8687391

3434 other plant names entered as binomials matched as either "ACCEPTED" or "SYNONYM"

Thanks for the work you do to maintain this very helpful package and interface with the GBIF API!

rgbif::name_backbone(name = "Sagittaria australis") # matchtype = NONE
# A tibble: 1 × 4
#  confidence matchType synonym verbatim_name       
#*      <int> <chr>     <lgl>   <chr>               
#1        100 NONE      FALSE   Sagittaria australis

 rgbif::name_backbone(name = "Salix occidentalis") # Good
# A tibble: 1 × 4
#  confidence matchType synonym verbatim_name     
#*      <int> <chr>     <lgl>   <chr>             
#1        100 NONE      FALSE   Salix occidentalis

rgbif::name_backbone(name = "Salix occidentalis") # Good
# A tibble: 1 × 4
#  confidence matchType synonym verbatim_name     
#*      <int> <chr>     <lgl>   <chr>             
#1        100 NONE      FALSE   Salix occidentalis
Session Info ```r sessionInfo() R version 4.2.0 (2022-04-22) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Monterey 12.4 Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.9 purrr_0.3.4 readr_2.1.2 tidyr_1.2.0 tibble_3.1.7 ggplot2_3.3.6 [9] tidyverse_1.3.1 loaded via a namespace (and not attached): [1] tidyselect_1.1.2 haven_2.5.0 colorspace_2.0-3 vctrs_0.4.1 generics_0.1.2 utf8_1.2.2 rlang_1.0.2 [8] pillar_1.7.0 httpcode_0.3.0 glue_1.6.2 withr_2.5.0 DBI_1.1.2 dbplyr_2.2.0 modelr_0.1.8 [15] readxl_1.4.0 uuid_1.1-0 lifecycle_1.0.1 plyr_1.8.7 munsell_0.5.0 gtable_0.3.0 cellranger_1.1.0 [22] rvest_1.0.2 tzdb_0.3.0 curl_4.3.2 fansi_1.0.3 triebeard_0.3.0 urltools_1.7.3 broom_0.8.0 [29] Rcpp_1.0.8.3 scales_1.2.0 backports_1.4.1 oai_0.3.2 jsonlite_1.8.0 fs_1.5.2 hms_1.1.1 [36] stringi_1.7.6 grid_4.2.0 cli_3.3.0 tools_4.2.0 magrittr_2.0.3 lazyeval_0.2.2 crul_1.2.0 [43] crayon_1.5.1 whisker_0.4 pkgconfig_2.0.3 ellipsis_0.3.2 data.table_1.14.2 xml2_1.3.3 reprex_2.0.1 [50] lubridate_1.8.0 rstudioapi_0.13 assertthat_0.2.1 httr_1.4.3 rgbif_3.7.2 R6_2.5.1 conditionz_0.1.0 [57] compiler_4.2.0 ```
jhnwllr commented 2 years ago

Hello @mikeroswell ,

This is a known annoyance about GBIF name matching. I have called it the "too many choices" problem. https://data-blog.gbif.org/post/2022-03-24-reasons-why-names-don-t-match-to-the-gbif-backbone/

When there are more than two variants of a canonical name ("Sagittaria australis"), the gbif name matcher will return matchType=NONE, meaning that GBIF could not decide between the two variants, so returns no match. I personally think this behavior could be improved, but right now that is the situation.

If you try this, with verbose=TRUE, you should get back more choices.

rgbif::name_backbone(name = "Sagittaria australis",verbose=TRUE)

NA Sagittaria australis (J.G.Sm.) Small
Sagittaria australis Pomp & Wilbert, 1988

So to fix the problem use verbose=TRUE. You might need to filter for quality after the matches though.

mikeroswell commented 2 years ago

Thanks so much for the explanation! I will try to remember that this happens. I agree that this is not a great situation and it would be helpful if there were different matchTypes for "too many choices" vs. bad names. Perhaps, in the rgbif universe, this is some kind of wrapper around the results of the "verbose" query that can parse when matching with homonyms occurs or the like. I'm not yet familiar enough with the kinds of output to know just what that would look like but if I end up writing any code to solve this problem I'll share it here or in a PR as appropriate. Thanks!

jhnwllr commented 2 years ago

@mikeroswell I think the main issue with always checking for too many choices from the start or always using verbose=TRUE is it is going to be slower when dealing with a lot of names. So it is not really possible or desirable to add an extra call each time. Additionally, it doesn't really solve the issue, since rgbif still wouldn't know which taxon to match to.

mikeroswell commented 2 years ago

Yes, would be good to avoid defaulting to verbose for lots of reasons... what if it was conditional? I haven't done any benchmarking on this but my intuition is that, compared to dealing with the ambiguity of matchType = "NONE" being a bad name vs. a likely good one with homonymy, this would be a net time saver from the researcher perspective, if not from the API's.

library(dplyr)
# write a function to check if matchType == "NONE" means "too many options"
nbRobust <- function(name){
  naive <- rgbif::name_backbone(name = name)
  if(naive$matchType == "NONE"){
    tooMany <- rgbif::name_backbone(name = name, verbose = TRUE)
    firstAccept<-tooMany[match("EXACT", tooMany$matchType), ]
    if(is.null(firstAccept)){
      return(data.frame(naive, matchNote = "matching problem"))}
    else{
      return(data.frame(firstAccept, matchNote = "First matching accepted taxon returned but other exact matches may exist"))
    }
  }
  if(naive$matchType =="EXACT"){return(data.frame(naive, matchNote = "unambiguous"))}
  else{return(data.frame(naive, matchNote = "possible synonymy or misspelling"))}
}

nbRobust("Sagittaria australis")
nbRobust("Sagittarius australis")
nbRobust("Rudbeckia hirta")
chcklst<-c("Athyrium angustum", "Sagittaria australis", "Sagittarius australis","Rudbeckia hirta", "Pyrola elliptica", "Rhus glabra", "Chamaecrista nictitans")
test.df<-purrr::map_dfr(chcklst, function(taxName){
  nbRobust(taxName)

})
# how many screwy names?
test.df %>% 
  group_by(matchNote) %>% 
  summarize(n())
# how many accepted vs. synonym?
test.df %>% 
  filter(matchNote=="unambiguous") %>% 
  group_by(status) %>% 
  summarize(n())
jhnwllr commented 2 years ago

I will write a little warning about here: https://docs.ropensci.org/rgbif/articles/taxonomic_names.html

In general, I think it is not worth a custom function and it is better left to the user to sort out how to handle unmatched names. It might be possible in the future for GBIF to include it into their API.

https://github.com/ropensci/rgbif/issues/536

mikeroswell commented 2 years ago

Hi John, Thanks for thinking this through together!

I agree that exactly how people will want to deal with unmatched names is likely to be highly idiosyncratic, and I didin't think the little snippet I pasted above was going to be your fix.

But I think it would be helpful for rgbif to help users disambiguate the "too many choices" NONE from the "wtf" NONE. I totally understand that the GBIF API is a bit clunky and makes this hard, but I would make a feature request that rgbif could help in some way by at least flagging that EXACT matches exist. IMO, the problem with only adding a warning in the vignette is that a typical user needs to follow a long chain of problems to get there (unless they are really good at reading manuals and remembering what they say before they have an issue; that person is definitely not me), and I would say that the NONE when EXACT exists is not expected behavior for a naïve user.

One potential middle ground would be to add a specific warning on the name_ functions if they return NONE that this "too many choices" situation exists, is associated with homonymy, maybe that they could try X (where X is some workflow that you think is a reasonable way to wrangle the GBIF API given this issue), rather than adding X to the actual source code of your package.

Thanks so much! It's amazing to have resources like rgbif at the click of a button and package maintainers willing to engage like this in near real-time!