matildabrown / rWCVP

Generating Summaries, Reports and Plots from the World Checklist of Vascular Plants
https://matildabrown.github.io/rWCVP/
GNU General Public License v3.0
21 stars 0 forks source link

How to resolve synonyms programatically? #63

Open MarcRieraDominguez opened 8 months ago

MarcRieraDominguez commented 8 months ago

Hi! Congratulations for the great package! I am experimenting with it to recover species' distributions, and I came across a difficulty: how to resolve synonyms programatically? The wcvp_match_names() function informs the user of whether the supplied name is a synonym, but it does not provide the accepted name.

For example, Abelia triflora (https://powo.science.kew.org/taxon/urn:lsid:ipni.org:names:148232-1), which is a synonym of Zabelia triflora (https://powo.science.kew.org/taxon/urn:lsid:ipni.org:names:150126-1). I can tell that Abelia triflora is a synonym by using wcvp_match_names(), but I lacked information to resolve the synonymy programatically.

Moreover, wcvp_distribution() only accepts as inputs a species name, which implies ignoring the wealth of information provided by wcvp_match_names(): author names, numerical ID codes. I suppose wcvp_distribution() resolves the supplied names to Accepted, or Artificial Hybrid, or the appropiate Synonym (I haven't looked at the source code, lazy me xd). Could the wcvp_distribution() be enhanced to provide the user with more control on what is being searched?

I have used rWCVPdata_0.4.1, rWCVP_1.2.6.

Many thanks in advance!

rWCVP::wcvp_match_names(data.frame(species.test = "Abelia triflora"),
                 name_col = "species.test",
                 id_col = NULL, author_col = NULL, join_cols = NULL, fuzzy = TRUE, progress_bar = TRUE)
#> 
#> ── Matching names to WCVP ──────────────────────────────────────────────────────
#> ℹ Using the `species.test` column
#> ! No author information supplied - matching on taxon name only
#> 
#> ── Exact matching  names ──
#> 
#> ── Matching complete! ──
#> 
#> ✔ Matched 1 of  names
#> ℹ Exact (without author): 1
#> ! Names with multiple matches: 0
#>      species.test             match_type multiple_matches match_similarity
#> 1 Abelia triflora Exact (without author)            FALSE                1
#>   match_edit_distance wcvp_id       wcvp_name   wcvp_authors wcvp_rank
#> 1                   0 2609524 Abelia triflora R.Br. ex Wall.   Species
#>   wcvp_status wcvp_homotypic wcvp_ipni_id wcvp_accepted_id
#> 1     Synonym           TRUE     148232-1          2470477
rWCVP::wcvp_distribution("Abelia triflora",
                  taxon_rank = "species",
                  native = TRUE, introduced = TRUE,
                  extinct = FALSE, location_doubtful = FALSE)
#> Error in `rWCVP::wcvp_distribution()`:
#> ! No distribution for that taxon. Are the rank and spelling both
#>   correct?
#> Backtrace:
#>     ▆
#>  1. └─rWCVP::wcvp_distribution(...)
#>  2.   └─cli::cli_abort("No distribution for that taxon. Are the rank and spelling both correct?")
#>  3.     └─rlang::abort(...)

rWCVP::wcvp_match_names(data.frame(species.test = "Zabelia triflora"),
                 name_col = "species.test",
                 id_col = NULL, author_col = NULL, join_cols = NULL, fuzzy = TRUE, progress_bar = TRUE)
#> 
#> ── Matching names to WCVP ──────────────────────────────────────────────────────
#> ℹ Using the `species.test` column
#> ! No author information supplied - matching on taxon name only
#> 
#> ── Exact matching  names ──
#> 
#> ── Matching complete! ──
#> 
#> ✔ Matched 1 of  names
#> ℹ Exact (without author): 1
#> ! Names with multiple matches: 0
#>       species.test             match_type multiple_matches match_similarity
#> 1 Zabelia triflora Exact (without author)            FALSE                1
#>   match_edit_distance wcvp_id        wcvp_name
#> 1                   0 2470477 Zabelia triflora
#>                                  wcvp_authors wcvp_rank wcvp_status
#> 1 (R.Br. ex Wall.) Makino ex Hisauti & H.Hara   Species    Accepted
#>   wcvp_homotypic wcvp_ipni_id wcvp_accepted_id
#> 1             NA     150126-1          2470477
rWCVP::wcvp_distribution("Zabelia triflora",
                  taxon_rank = "species",
                  native = TRUE, introduced = TRUE,
                  extinct = FALSE, location_doubtful = FALSE)
#> Simple feature collection with 6 features and 5 fields
#> Geometry type: GEOMETRY
#> Dimension:     XY
#> Bounding box:  xmin: 60.50417 ymin: 21.13951 xmax: 116.1341 ymax: 38.47211
#> Geodetic CRS:  WGS 84
#>            LEVEL3_NAM LEVEL3_COD LEVEL2_COD LEVEL1_COD occurrence_type
#> 1         Afghanistan        AFG         34          3          native
#> 2 China South-Central        CHC         36          3          native
#> 3               Tibet        CHT         36          3          native
#> 4               Nepal        NEP         40          4          native
#> 5            Pakistan        PAK         40          4          native
#> 6       West Himalaya        WHM         40          4          native
#>                         geometry
#> 1 POLYGON ((71.26777 38.30211...
#> 2 POLYGON ((103.1547 34.07671...
#> 3 POLYGON ((90.51833 28.08, 9...
#> 4 POLYGON ((81.80307 30.36361...
#> 5 MULTIPOLYGON (((73.68497 36...
#> 6 POLYGON ((74.97414 36.98481...

sessionInfo()
#> R version 4.2.0 (2022-04-22 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 22631)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=Spanish_Spain.utf8  LC_CTYPE=Spanish_Spain.utf8   
#> [3] LC_MONETARY=Spanish_Spain.utf8 LC_NUMERIC=C                  
#> [5] LC_TIME=Spanish_Spain.utf8    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] httr_1.4.3             tidyr_1.3.0            bit64_4.0.5           
#>  [4] splines_4.2.0          prodlim_2019.11.13     highr_0.9             
#>  [7] selectr_0.4-2          blob_1.2.3             yaml_2.3.5            
#> [10] globals_0.16.2         ipred_0.9-12           pillar_1.9.0          
#> [13] RSQLite_2.3.1          lattice_0.20-45        glue_1.6.2            
#> [16] digest_0.6.33          rvest_1.0.2            colorspace_2.0-3      
#> [19] htmltools_0.5.6        Matrix_1.4-1           pkgconfig_2.0.3       
#> [22] phonics_1.3.10         listenv_0.9.0          purrr_1.0.1           
#> [25] xtable_1.8-4           scales_1.2.1           lava_1.6.10           
#> [28] ff_4.0.7               tibble_3.2.1           proxy_0.4-26          
#> [31] generics_0.1.2         ggplot2_3.4.0          cachem_1.0.6          
#> [34] withr_2.5.0            RecordLinkage_0.4-12.4 nnet_7.3-17           
#> [37] cli_3.6.1              ada_2.0-5              survival_3.3-1        
#> [40] magrittr_2.0.3         memoise_2.0.1          evaluate_0.15         
#> [43] fs_1.5.2               future_1.33.0          fansi_1.0.3           
#> [46] parallelly_1.36.0      MASS_7.3-57            xml2_1.3.3            
#> [49] class_7.3-20           tools_4.2.0            data.table_1.14.2     
#> [52] lifecycle_1.0.3        stringr_1.5.0          munsell_0.5.0         
#> [55] reprex_2.0.2           rWCVPdata_0.4.1        compiler_4.2.0        
#> [58] e1071_1.7-9            evd_2.3-6.1            rlang_1.1.1           
#> [61] units_0.8-0            classInt_0.4-3         grid_4.2.0            
#> [64] gt_0.9.0               rstudioapi_0.13        rmarkdown_2.14        
#> [67] gtable_0.3.0           codetools_0.2-18       curl_4.3.2            
#> [70] DBI_1.1.2              R6_2.5.1               knitr_1.39            
#> [73] dplyr_1.1.2            fastmap_1.1.0          future.apply_1.11.0   
#> [76] bit_4.0.4              utf8_1.2.2             KernSmooth_2.23-20    
#> [79] stringi_1.7.6          parallel_4.2.0         Rcpp_1.0.11           
#> [82] vctrs_0.6.3            sf_1.0-7               rpart_4.1.16          
#> [85] tidyselect_1.2.0       xfun_0.40              rWCVP_1.2.6

Created on 2024-03-14 with reprex v2.0.2

matildabrown commented 8 months ago

Hi Marc, For synonyms, you can use the wcvp_accepted_id field in the output to link to the accepted names. For example, in this case you could use

rWCVP::wcvp_match_names(data.frame(species.test = "Abelia triflora"),
                        name_col = "species.test",
                        id_col = NULL, author_col = NULL, 
                        join_cols = NULL, fuzzy = TRUE, 
                        progress_bar = TRUE) %>%
  dplyr::left_join(
    rWCVPdata::wcvp_names %>% select(wcvp_accepted_id = plant_name_id,
                                     wcvp_accepted_name = taxon_name)
  )

which appends the accepted name to the dataframe:

     species.test             match_type multiple_matches match_similarity match_edit_distance wcvp_id
1 Abelia triflora Exact (without author)            FALSE                1                   0 2609524
        wcvp_name   wcvp_authors wcvp_rank wcvp_status wcvp_homotypic wcvp_ipni_id wcvp_accepted_id
1 Abelia triflora R.Br. ex Wall.   Species     Synonym           TRUE     148232-1          2470477
  wcvp_accepted_name
1   Zabelia triflora

Regarding your other question, we do not support distribution information for synonyms, and do not automatically resolve synonyms as part of this process. The main reason for this is that treatment of synonyms is going to vary depending on what you are using the data for, and the type of synonym. In your example, A. triflora is a homotypic synonym of Z. triflora, which makes things easier, but for heterotypic synonyms, it becomes important to know what species concept you are dealing with. If Z. triflora had been 'lumped' into another species, should rWCVP automatically return the distribution of the old species concept, or the new one, which might be much larger? It gets even trickier when we think abut splitting species... The WCVP only includes distribution information for Accepted (and some Unplaced) species, so there is no snapshot of the 'distribution at the time that this synonym was considered Accepted', if that makes sense?

However, I think we can make this a more useful error message. For example:

! Distribution data not available for synonyms; please use the accepted taxon name.
ℹ The name 'Abelia triflora' is a homotypic synonym of 'Zabelia triflora' in this version of the WCVP (v12).

or, for a heterotypic synonym:

! Distribution data not available for synonyms; please use the accepted taxon name.
ℹ The name 'Name one' is a heterotypic synonym of 'Name two' in this version of the WCVP (v12). 
Note that for heterotypic synonyms, the distribution of 'Name two' might be different from the species 
concept represented by 'Name one'. 

Do you think that is more helpful, especially if we included the accompanying explanation in the Details section of the help pages?

MarcRieraDominguez commented 8 months ago

Hi Matilda, Sorry for the late reply! Thank you for your suggestions, that join made my life easier :) I understand about the distribution data, and I think that more detailed error messages would be very helpful! Perhaps a third type of error message could be considered when the user has supplied a name that is not in WVCVP (something like Treebeard sp.). Admittedly though, it's unlikely that someone jumps into requesting distribution without having checked the names first. Happy Easter!