matildabrown / rWCVP

Generating Summaries, Reports and Plots from the World Checklist of Vascular Plants
https://matildabrown.github.io/rWCVP/
GNU General Public License v3.0
19 stars 0 forks source link

`match_similarity` low for fuzzy-matched hybrids #45

Open nlkinlock opened 1 year ago

nlkinlock commented 1 year ago

I've been continuing to work with the wcvp_match_names() function and I've come across something that is not an error, but is perhaps an undesirable aspect of the fuzzy matching process. I noticed when matching hybrid names that differ slightly from the names in WCVP, the match_similarity seemed unusually low.

I think this is because hybrid symbols are removed from names before Levenshtein similarities are calculated. I can imagine why this is necessary, but maybe it is possible to compare similarities with the names with and without the cleaning (and take the highest value) or to remove hybrid symbols from matched WCVP names when calculating similarities. Otherwise, non-exact matching hybrids will always have relatively low similarities (see the example below).

Of course, I can work around this, but making a change at some point would be a huge help--at least for my work. Thanks in advance for your time, and my apologies for submitting two issues within a week!

df <- data.frame(TaxonName = c("Quercus × kinselae", "Sarracenia × readii", "Asplenium × waikomoi"),
                 Authority = c("(C.H.Mull.) Nixon & C.H.Mull.", "C.R.Bell", "W.H.Wagner & D.D.Palmer"))

wcvp.out <- rWCVP::wcvp_match_names(names_df = df, name_col = "TaxonName", author_col = "Authority")
#> 
#> ── Matching names to WCVP ──────────────────────────────────────────────────────
#> ℹ Using the `TaxonName` column
#> 
#> ── Exact matching 3 names ──
#> 
#> ✔ Found 0 of 3 names
#> 
#> ── Fuzzy matching 3 names ──
#> 
#> ✔ Found 3 of 3 names
#> 
#> ── Matching complete! ──
#> 
#> ✔ Matched 3 of 3 names
#> ℹ Fuzzy (phonetic): 3
#> ! Names with multiple matches: 0

as.data.frame(wcvp.out[, c("TaxonName", "match_similarity", "wcvp_name")])
#>              TaxonName match_similarity            wcvp_name
#> 1   Quercus × kinselae            0.789  Quercus × kinseliae
#> 2  Sarracenia × readii            0.789  Sarracenia × readei
#> 3 Asplenium × waikomoi            0.800 Asplenium × waikamoi

RecordLinkage::levenshteinSim(str1 = wcvp.out$TaxonName, str2 = wcvp.out$wcvp_name)
#> [1] 0.9473684 0.9473684 0.9500000

taxa.no.hyb <- gsub(pattern = " \u00d7", replacement = "", x = wcvp.out$TaxonName)

RecordLinkage::levenshteinSim(str1 = taxa.no.hyb, str2 = wcvp.out$wcvp_name)
#> [1] 0.7894737 0.7894737 0.8000000

Created on 2023-03-14 with reprex v2.0.2

matildabrown commented 1 year ago

Thanks @nlkinlock ! I think your suggestion of removing the hybrid marks from the WCVP names during matching is the way to go - we'll try to implement in the next release.