matildabrown / rWCVP

Generating Summaries, Reports and Plots from the World Checklist of Vascular Plants
https://matildabrown.github.io/rWCVP/
GNU General Public License v3.0
21 stars 0 forks source link

Function to avoid using Excel accepting/rejecting fuzzy matches #50

Open pgomba opened 1 year ago

pgomba commented 1 year ago

Hi again, I´ve wrote a simple function to avoid having to open Excel to decide if a Fuzzy match is valid or not (from tutorial here: https://matildabrown.github.io/rWCVP/articles/redlist-name-matching.html).

Basically the function asks you, species by species to confirm (1) or not (any other letter or number) the match, and creates the column resolved_match_type accordingly (also gets rid of columns keep, match_edit_distance and match_similarity as suggested in the tutorial). Uses cli:: for "pretty" headers, which explains why I have this issue #49.

wcvp_manual_fuzzy_check<-function(df,original_name_col=NULL,original_author_col=NULL){

   if (is.null(original_name_col)){
    original_name_col<-"scientific_name"
  }

  if (is.null(original_author_col)){
    original_author_col<-"authority"
  }

  resolved_match_type<-data.frame(resolved_match_type=rep(NA,nrow(df)))

  a<-cbind(df,resolved_match_type)%>%
    select(-keep,-match_edit_distance,-match_similarity)
  print(names(a))

  show<-data.frame(Info=c("Original","Fuzzy"),Species=NA,authors=NA)

    for (i in 1:nrow(df)) {

    show[1,2]<-df[[original_name_col]][i]
    show[1,3]<-df[[original_author_col]][i]
    show[2,2]<-df$wcvp_name[i]
    show[2,3]<-df$wcvp_authors[i]

    cli::cli_h1("Item {i} of {nrow(df)}")
    print(show)

    manual<-invisible(readline("Manual input required: Accept (Enter 1) / Reject (any letter/number): "))

    if (manual == 1){
      print(paste("Fuzzy match accepted. ",i,"/",nrow(df)))

    } else {
      print(paste("Fuzzy match rejected. ",i,"/",nrow(df)))
      a$resolved_match_type[i]<-"Fuzzy match rejected"
    }

  }

  a

}

Example:

fuzzy_checked<- wcvp_manual_fuzzy_check(df=fuzzy_matches,
                                        original_name_col = "scientificName",
                                        original_author_col="scientificNameAuthorship")

Happy to send over a better example. Basically df is the object fuzzy_matches from your tutorial. I use different column names for species, but I think that if options original_name_col and original_author_col will default to your standard.

matildabrown commented 1 year ago

Nice, thanks!