jalapic / engsoccerdata

English and European soccer results 1871-2022
755 stars 192 forks source link

Updated england_current() function #31

Closed JoGall closed 7 years ago

JoGall commented 7 years ago

Just a few changes to this function as NAs were being returned. Not long left for this season now but function can easily be updated next season (i.e. update .csv links and change 'Season' to 2017).

'Date' as date class instead of character; 'division' and 'tier' changed to extract numeric from string and prevent NAs being returned (e.g. E0 -> 1); call to teamnames dataframe to replace team name variants with main name used in england dataframe (e.g. "Man City" -> "Manchester City").

england_current <- function(){

  #*update each season*
  df1 <- rbind(read.csv("http://www.football-data.co.uk/mmz4281/1617/E0.csv"),
              read.csv("http://www.football-data.co.uk/mmz4281/1617/E1.csv"),
              read.csv("http://www.football-data.co.uk/mmz4281/1617/E2.csv"),
              read.csv("http://www.football-data.co.uk/mmz4281/1617/E3.csv")
  ) 

  df2 <- data.frame("Date" = as.Date(df1$Date, "%d/%m/%y"),
               "Season" = rep(2016, nrow(df1)), #*update each season*
               "home" = df1$HomeTeam,
               "visitor" = df1$AwayTeam,
               "FT" = paste0(df1$FTHG, "-", df1$FTAG),
               "hgoal" = df1$FTHG,
               "vgoal" = df1$FTAG,
               "division" = as.numeric(sapply(strsplit(df1$Div, ""), "[[", 2)) + 1, #convert division names to numeric (e.g. "E0" ->"1")
               "tier" = as.numeric(sapply(strsplit(df1$Div, ""), "[[", 2)) + 1,
               "totgoal" = df1$FTHG + df1$FTAG,
               "goaldif" = df1$FTHG - df1$FTAG,
               "result" = df1$FTR
    )

    #replace any new team name variants with pre-existing names (e.g. "Man City" -> "Manchester City")
    df2$home <- teamnames$name[match(df2$home, teamnames$name_other)]
    df2$visitor <- teamnames$name[match(df2$visitor, teamnames$name_other)]

return(df2)
}
jalapic commented 7 years ago

The current version of this function on GitHub doesn't appear to return NAs - as far as I can see all the improvements were already in that function except for ensuring that the Date is a Date class .. or did I miss something?

I think the issue of what to do with this function in the off-season is a puzzle. Probably best to just leave it as is and on the day of the new season change it??? Or maybe add a warning?

JoGall commented 7 years ago

Ah something went wrong on my end, I didn't realise the current GitHub version had the call to the teamnames dataframe already.

I'm still getting NAs returned when I run the latest version though, I think because the function tries to convert the division and tiers to numeric directly ("division" = as.numeric(df$Div), "tier" = as.numeric(df$Div)) but the variable 'Div' also contains a character (i.e. E0, E1...). Think the numeric value has to be extracted first with strsplit, gsub, etc...

jalapic commented 7 years ago

interesting - I can't repeat that error, but I will look into it. I'll be overhauling the other functions tonight also, so hopefully can track down that error.

jalapic commented 7 years ago

ok the reason this should work is that E0,E1,E2,E3 are brought in as factors and then the as.numeric just reads the level of the factor as a number. To ensure it will work, I will just wrap the variable in factor - that ought to do it.

JoGall commented 7 years ago

Ah I've just realised why then, I had options(stringsAsFactors = FALSE) in my .Rprofile! I'm going to remove the line from my .Rprofile to make sure my code is portable in future but it's probably a good idea to explicitly make it a factor in the function for others that might have the option set.

JoGall commented 7 years ago

Also, I hadn't thought of what to do with this function during the off-season... Could we maybe check whether the england dataframe is already up-to-date before running? Something like:

england_current <- function(){

  df <- rbind(read.csv("http://www.football-data.co.uk/mmz4281/1617/E0.csv"),
              read.csv("http://www.football-data.co.uk/mmz4281/1617/E1.csv"),
              read.csv("http://www.football-data.co.uk/mmz4281/1617/E2.csv"),
              read.csv("http://www.football-data.co.uk/mmz4281/1617/E3.csv")
  )

  if(identical(max(as.Date(df$Date, "%d/%m/%y")), max(england$Date) )) {
    #message about being up to date
  }

  else {
    #rest of function
  }

}
jalapic commented 7 years ago

I think lots have that in their .Rprofile - therefore it's a good job to make sure that the code is robust to that.

jalapic commented 7 years ago

Good idea for the function date check - I will implement that.