Closed alecglen closed 1 year ago
I don't use R or I'd open a PR, but FWIW here's the algorithm/regex I'm using locally with good results.
def parse_precipitation(play):
return (
'Rain' in play['weather'] or
'rain' in play['weather'] or
'Snow' in play['weather'] or
'snow' in play['weather']
)
def parse_temperature(play):
if not np.isnan(play['temp']):
return play['temp']
if play['weather']:
match = re.search('Temp: (\d+)°', play['weather'])
if match:
return int(match.group(1))
def parse_wind(play):
if not np.isnan(play['wind']):
return play['wind']
if play['weather']:
match = re.search('Wind:.* (\d+) ', play['weather'])
if match:
return int(match.group(1))
This is example code to extract the parts of the weather string
df <- pbp |>
dplyr::filter(!is.na(weather)) |>
dplyr::distinct(season, game_id, weather) |>
dplyr::mutate(
temp_f = dplyr::case_when(
str_detect(weather, "Indoors") ~ NA_character_,
TRUE ~ str_extract(weather, "(?<=Temp: )-?[:digit:]{1,3}")
),
temp_f = as.numeric(temp_f),
temp_c = (temp_f - 32) * 5 / 9,
hum = str_extract(weather, "(?<=Humidity: )[:digit:]{1,3}"),
hum = as.numeric(hum) / 100,
wind = str_extract(weather, "(?<=Wind: ).+(?= mph)") |> str_trim()
) |>
dplyr::na_if("") |>
dplyr::filter(!(is.na(temp_f) & is.na(hum) & is.na(wind)))
df
# A tibble: 2,723 x 7
game_id season weather temp_f temp_c hum wind
<chr> <int> <chr> <dbl> <dbl> <dbl> <chr>
1 2001_01_ATL_SF 2001 partly cloudy Temp: 68° F, Humidity: 63%, Wind: Southwest 12 MPH mph 68 20 0.63 Southwest 12~
2 2001_01_CAR_MIN 2001 Temp: 65° F, Wind: mph 65 18.3 NA NA
3 2001_01_CHI_BAL 2001 Mostly cloudy, highs in mid 80's Temp: 83° F, Humidity: 66%, Wind: Sou~ 83 28.3 0.66 South 10
4 2001_01_DET_GB 2001 Rain throughout game, heavy showers possible. Temp: 60° F, Humidity: 9~ 60 15.6 0.93 NW 5
5 2001_01_IND_NYJ 2001 Partly Sunny Temp: 81° F, Humidity: 81%, Wind: SW 6 mph mph 81 27.2 0.81 SW 6 mph
6 2001_01_MIA_TEN 2001 Partly Cloudy & Windy Temp: 81° F, Humidity: 69%, Wind: From the South~ 81 27.2 0.69 From the Sou~
7 2001_01_NE_CIN 2001 Partly clooudy, poossible showers/thunderstorms Temp: 79° F, Humidity:~ 79 26.1 0.87 S 8
8 2001_01_NO_BUF 2001 Sunny Temp: 87° F, Humidity: 52%, Wind: SW 10 mph 87 30.6 0.52 SW 10
9 2001_01_NYG_DEN 2001 Clear Temp: 75° F, Humidity: 18%, Wind: SE 9 mph 75 23.9 0.18 SE 9
10 2001_01_OAK_KC 2001 Mostly Sunny Temp: 64° F, Humidity: 78%, Wind: Northwest 12 mph 64 17.8 0.78 Northwest 12
# ... with 2,713 more rows
The pbp data set already is huge and these additional variables inflate it unnecessarily. I have provided code to extract information from the weather string. So I am going to close this as not planned.
A lot of data in the play_by_play
wind
andtemp
data is missing, including all 2021 games. As @guga31bb mentioned in https://github.com/nflverse/nflfastR-data/issues/32, it's probably just due to what the NFL provides.However, for most of these cases, the data to populate the columns are readily available in
weather
. Just wanted to put it out there that these could be backfilled.