jalapic / engsoccerdata

English and European soccer results 1871-2022
755 stars 192 forks source link

Function to match team names from other datasets with those in 'teamnames' dataframe #41

Open JoGall opened 7 years ago

JoGall commented 7 years ago

Just a possible improvement that I've used in recent blog posts; a function which matches team names from another dataset onto engsoccerdata team names.

For each team name in a vector, it finds the highest similarity string in name_other variable of teamnames dataframe (using the levenshteinSim function from the RecordLinkage package).

Works well for me so far but untested with non-England teams.

#---------------------------------------------------------------------------
# matchTeamnames()
#---------------------------------------------------------------------------
# Matches a vector of team names with names used by 'teamnames' dataframe in
# engsoccerdata package
#---------------------------------------------------------------------------
# * Inputs a vector of team names outputs the original 
#   dataframe with new teamname in column 'team' and old teamname in column 
#   'team_old'
# * 'min_dist' specifies lowest similarity threshold for a match; if all
#   possible matches for a team are below this value, returns 'NA'
# * Returns a vector by default; if checkResults' is TRUE, returns a 
#   dataframe of old names and best matches for purposes of validation
#---------------------------------------------------------------------------
matchTeamnames <- function(teams, min_dist = 0.1, checkResults = FALSE) {
  require(engsoccerdata)
  require(RecordLinkage)
  require(dplyr)

  teams <- as.character(teams)

  old_new_df <- lapply(unique(teams), function(x) {
    distance <- levenshteinSim(as.character(x), as.character(teamnames$name_other))
    # threshold on distance
    new_name <- ifelse(max(distance, na.rm=T) >= min_dist, as.character(teamnames[which.max(distance),]$name), "NA")

    old_new_df <- data.frame(old_name = x, new_name, distance = max(distance, na.rm=T), stringsAsFactors = FALSE)
  }) %>%
    plyr::rbind.fill()

  if(checkResults) {
    return(old_new_df)
  } else {
    teams <- old_new_df$new_name[match(teams, old_new_df$old_name)]
    return(teams)
  }
}