[FBref] NaNs found in 'standard' and 'playing_time' stat_types

spartanovo commented 2 years ago

Hello,

I have found a small bug when pulling data from FBRef.com. NaN values appearing in the MP columns in the data for stat_types standard and playing_time for players who have played in the season.

I found this problem after I wrote a function to obtain multiple stat_types for multiple seasons and converted the DataFrames from a multiindex to a standard pandas DataFrame. I found a large quantity of NaNs due to this transformation.

To troubleshoot, I did a single pull using the .read_player_season_stats(stat_type = 'standard') call on 2 seasons of data (1718 & 1819) and found NaN values in both the MP and Playing Time MP columns. Players who played and did not play had received NaN values in the aforementioned columns. Under the "Playing Time" section's MP column, I found 890 NaN values and in the standalone 'MP' column, I found 380 NaN values. I am transitioning from R to Python and have always used the flattened-style DataFrame in the past.

Attached is a csv file containing the aforementioned data.

Call:

fbref_test = sd.FBref(leagues=['ENG-Premier League'], seasons= ['1718', '1819'])

hold = fbref_test.read_player_season_stats(stat_type = 'standard')
hold.head()

I greatly appreciate your assistance. fbref_nan_bug_df.csv

probberechts commented 2 years ago

FBRef uses a different layout for the 2017/18 and 2018/19 seasons. In the 2017/18 season, the "MP" column is a separate category. While in the 2018/19 season it is grouped under "Playing Time".

All you need is two lines of post-processing:

hold[("Playing Time", "MP")] = hold[("Playing Time", "MP")].fillna(hold["MP"])
hold.drop(columns=["MP"])

I'll add this to the codebase later.

spartanovo commented 2 years ago

Awesome. That fixed the problem. Thank you @probberechts!

probberechts / soccerdata

[FBref] NaNs found in 'standard' and 'playing_time' stat_types #79