wcvp_match_names dropping species names? #54

azizka commented 1 year ago

Great package! When provided with a list of names (538 in this case) the function report only matching part of the names (511). Why? The outputdata seems complete. I can provide the species list if necessary.

Using the scrubbed_species_binomial column

── Exact matching 538 names ──

✔ Found 508 of 538 names

── Fuzzy matching 3 names ──

✔ Found 3 of 3 names

── Matching complete! ──

✔ Matched 511 of 511 names
ℹ Exact (with author): 392
ℹ Exact (without author): 116
ℹ Fuzzy (edit distance): 2
ℹ Fuzzy (phonetic): 1
! Names with multiple matches: 7

barnabywalker commented 1 year ago

Thanks, Alex! I think I know what’s going on - there’s a filter before the fuzzy matching that removes names with the same ID as ones that have already been matched from the unmatched names list:

Did you pass in a ID column name to wcvp_name_match e.g. wcvp_name_match(names, name_col=“spnames”, id_col=“spid”)?

azizka commented 1 year ago

ah. No, no species ID:

wcvp_match_names(names_df = name_li, name_col = "scrubbed_species_binomial", author_col = "scrubbed_author")

barnabywalker commented 1 year ago

Ah, looks like it must be a bug then.

Don’t suppose you could share the input data?

azizka commented 1 year ago

yes, no problem. names_list_matching_WCVP.csv

barnabywalker commented 1 year ago

Sorry it's taken so long to get back on this - the problem was entirely with the CLI summary of the matching, which wasn't taking into account the author_col you provided. So all the matching was fine but the CLI counted things wrong.

