matildabrown / rWCVP

Generating Summaries, Reports and Plots from the World Checklist of Vascular Plants
https://matildabrown.github.io/rWCVP/
GNU General Public License v3.0
19 stars 0 forks source link

wcvp_match_names dropping species names? #54

Closed azizka closed 1 year ago

azizka commented 1 year ago

Great package! When provided with a list of names (538 in this case) the function report only matching part of the names (511). Why? The outputdata seems complete. I can provide the species list if necessary.

Using the scrubbed_species_binomial column

── Exact matching 538 names ──

✔ Found 508 of 538 names

── Fuzzy matching 3 names ──

✔ Found 3 of 3 names

── Matching complete! ──

✔ Matched 511 of 511 names
ℹ Exact (with author): 392
ℹ Exact (without author): 116
ℹ Fuzzy (edit distance): 2
ℹ Fuzzy (phonetic): 1
! Names with multiple matches: 7

barnabywalker commented 1 year ago

Thanks, Alex! I think I know what’s going on - there’s a filter before the fuzzy matching that removes names with the same ID as ones that have already been matched from the unmatched names list: https://github.com/matildabrown/rWCVP/blob/da4fabb3d5201cc4ca31b544aec1a4a02047e5b5/R/wcvp_match_names.R#L148

Did you pass in a ID column name to wcvp_name_match e.g. wcvp_name_match(names, name_col=“spnames”, id_col=“spid”)?

azizka commented 1 year ago

ah. No, no species ID:

wcvp_match_names(names_df = name_li, name_col = "scrubbed_species_binomial", author_col = "scrubbed_author")

barnabywalker commented 1 year ago

Ah, looks like it must be a bug then.

Don’t suppose you could share the input data?

azizka commented 1 year ago

yes, no problem. names_list_matching_WCVP.csv

R version 4.2.3 (2023-03-15 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale: [1] LC_COLLATE=German_Germany.utf8 LC_CTYPE=German_Germany.utf8 LC_MONETARY=German_Germany.utf8 LC_NUMERIC=C
[5] LC_TIME=German_Germany.utf8

attached base packages: [1] stats graphics grDevices datasets utils methods base

other attached packages: [1] janitor_2.2.0 rWCVP_1.2.4 CoordinateCleaner_2.0-20 sf_1.0-11 readxl_1.4.2 countrycode_1.4.0
[7] BIEN_1.2.6 RPostgreSQL_0.7-5 DBI_1.1.3 lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0
[13] dplyr_1.1.0 purrr_1.0.1 readr_2.1.4 tidyr_1.3.0 tibble_3.2.1 ggplot2_3.4.1
[19] tidyverse_2.0.0

barnabywalker commented 1 year ago

Sorry it's taken so long to get back on this - the problem was entirely with the CLI summary of the matching, which wasn't taking into account the author_col you provided. So all the matching was fine but the CLI counted things wrong.

I've made a pull request with the fix now.