ropensci / taxize

A taxonomic toolbelt for R
https://docs.ropensci.org/taxize
Other
267 stars 60 forks source link

unclear why genus is being outputed in matched_name2 when two data sources do not include the species. #889

Closed jtmiller28 closed 2 years ago

jtmiller28 commented 2 years ago
sources <- gnr_datasources()

Get_Disc_Life <- sources$id[sources$title == 'Discover Life Bee Species Guide']

Get_Disc_Life_ITIS <- c(sources$id[sources$title == 'Discover Life Bee Species Guide'], sources$id[sources$title == 'Integrated Taxonomic Information SystemITIS'])

Corrected_Names <-gnr_resolve(sci = unique_names$scientific_name, data_source_ids = Get_Disc_Life_ITIS, preferred_data_sources = Get_Disc_Life, best_match_only = TRUE, canonical = TRUE, resolve_once = TRUE)

When I run gnr_resolve() there are some rows under matched_name2 that include only the genus and not the specificEpithet. I am unsure why this is, shouldn't the name be completely dropped when it isn't defined out to the specificEpithet level? Example Output in Image Below: image

Thank you for your time!

Session Info:
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
 [1] taxize_0.9.99   sqldf_0.4-11    RSQLite_2.2.8   gsubfn_0.7     
 [5] proto_1.0.0     forcats_0.5.1   stringr_1.4.0   dplyr_1.0.7    
 [9] purrr_0.3.4     readr_2.0.0     tidyr_1.1.3     tibble_3.1.3   
[13] ggplot2_3.3.5   tidyverse_1.3.1 leafgl_0.1.1    rgdal_1.5-27   
[17] sp_1.4-5        leaflet_2.0.4.1 sf_1.0-1       

loaded via a namespace (and not attached):
 [1] nlme_3.1-153            fs_1.5.0               
 [3] bold_1.2.0              lubridate_1.8.0        
 [5] bit64_4.0.5             httr_1.4.2             
 [7] tools_4.1.0             backports_1.2.1        
 [9] utf8_1.2.2              R6_2.5.1               
[11] KernSmooth_2.23-20      DBI_1.1.1              
[13] colorspace_2.0-2        withr_2.4.2            
[15] tidyselect_1.1.1        curl_4.3.2             
[17] bit_4.0.4               compiler_4.1.0         
[19] chron_2.3-56            cli_3.0.1              
[21] rvest_1.0.2             xml2_1.3.2             
[23] triebeard_0.3.0         scales_1.1.1           
[25] classInt_0.4-3          proxy_0.4-26           
[27] digest_0.6.27           rmarkdown_2.11         
[29] pkgconfig_2.0.3         htmltools_0.5.2        
[31] dbplyr_2.1.1            fastmap_1.1.0          
[33] htmlwidgets_1.5.4       rlang_0.4.11           
[35] readxl_1.3.1            httpcode_0.3.0         
[37] rstudioapi_0.13         generics_0.1.1         
[39] zoo_1.8-9               jsonlite_1.7.2         
[41] crosstalk_1.2.0         magrittr_2.0.1         
[43] Rcpp_1.0.7              munsell_0.5.0          
[45] fansi_0.5.0             ape_5.6-1              
[47] lifecycle_1.0.1         stringi_1.7.3          
[49] yaml_2.2.1              MASS_7.3-54            
[51] plyr_1.8.6              grid_4.1.0             
[53] blob_1.2.2              parallel_4.1.0         
[55] crayon_1.4.2            lattice_0.20-45        
[57] conditionz_0.1.0        haven_2.4.3            
[59] hms_1.1.1               knitr_1.36             
[61] pillar_1.6.4            uuid_1.0-2             
[63] tcltk_4.1.0             codetools_0.2-18       
[65] crul_1.2.0              reprex_2.0.1           
[67] glue_1.4.2              evaluate_0.14          
[69] leaflet.providers_1.9.0 data.table_1.14.0      
[71] modelr_0.1.8            urltools_1.7.3         
[73] foreach_1.5.1           vctrs_0.3.8            
[75] tzdb_0.1.2              cellranger_1.1.0       
[77] gtable_0.3.0            reshape_0.8.8          
[79] assertthat_0.2.1        cachem_1.0.6           
[81] xfun_0.29               mime_0.12              
[83] broom_0.7.10            e1071_1.7-7            
[85] class_7.3-19            iterators_1.0.13       
[87] memoise_2.0.0           units_0.7-2            
[89] ellipsis_0.3.2  
zachary-foster commented 2 years ago

Thanks for the report, I will take a closer look soon. I reproduced the issue:

taxize::gnr_resolve(sci = "Bombus cascadensis", data_source_ids = 202, preferred_data_sources = 202, best_match_only = TRUE, canonical = TRUE, resolve_once = TRUE)
#> # A tibble: 1 × 5
#>   user_supplied_name submitted_name     data_source_title    score matched_name2
#> * <chr>              <chr>              <chr>                <dbl> <chr>        
#> 1 Bombus cascadensis Bombus cascadensis Discover Life Bee S…  0.75 Bombus

Created on 2022-03-09 by the reprex package (v2.0.1)

zachary-foster commented 2 years ago

It looks like it only shows the genus because there is no "Bombus cascadensis" in the database:

https://www.discoverlife.org/mp/20q?guide=Bee_genera

Apparently, that's what the GNR API returns in that case:

https://resolver.globalnames.org/name_resolvers.json?names=Bombus%20cascadensis&data_source_ids=202&resolve_once=true&best_match_only=true&preferred_data_sources=202

I am not sure it would be a good idea to change this output on the taxize side of things, since the function is primarily an interface for the GNR service and that is what is returned by that service. Although we can explore that option if there is demand for it.

jtmiller28 commented 2 years ago

That makes sense, I agree it is more on the user side for interpretation. Sorry for forgetting to close the issue! Thanks for clearing it up!