matildabrown / rWCVP

Generating Summaries, Reports and Plots from the World Checklist of Vascular Plants
https://matildabrown.github.io/rWCVP/
GNU General Public License v3.0
19 stars 0 forks source link

Fuzzy matching doesn't deduplicate names before matching #32

Closed matildabrown closed 1 year ago

matildabrown commented 2 years ago

For a large dataset with repeated names (point occurrences), matching 10 fuzzy matches was going to be ~40 min (because matching was actually happening >100 times for each name...)

Perhaps need to rearrange function to internally generate a row-unique matrix of matching cols only, match to that and then join it back onto the original?

Or could recommend this in docs?

matildabrown commented 2 years ago

Related: two identical name strings with different author strings are flagged as multiple matches

This might fix some of the weird numbers in the messaging too (I think I broke it again)

matildabrown commented 1 year ago

Closing for now - happy to reopen in future