nflverse / nfl_data_py

Python code for working with NFL play by play data.
MIT License
259 stars 50 forks source link

Reduce memory usage of multi-year pbp dataframe #9

Closed sansbacon closed 3 years ago

sansbacon commented 3 years ago

If you load 20 years of data into one dataframe, you could start pushing up against memory limits on certain users' computers. You can reduce memory usage about 30% by converting float64 values to float32 when loading the yearly play-by-play data.

cols = df.select_dtypes(include=[np.float64]).columns
df.loc[:, cols] = df.loc[:, cols].astype(np.float32)

On my computer, this reduced the memory usage of a single year from 129.7 MB to 94.5 MB. I don't think the lost precision is going to matter for anything we are doing with football stats.

If you are interested, I can submit a pull request that implements this change. You could also make it optional with the default being to downcast but allow the user to override if they want np.float64 as the dtype.

cooperdff commented 3 years ago

That sounds great, please feel free to submit that request when you've got time.

cooperdff commented 3 years ago

Change has been implemented.