nflverse / nfl_data_py

Python code for working with NFL play by play data.
MIT License
252 stars 48 forks source link

Added thread_requests parameter to import_pbp_data and import_weekly … #46

Closed bendominguez0111 closed 1 year ago

bendominguez0111 commented 1 year ago

…data functions.

Added optional parameter to import_pbp_data and import_weekly_data to use threading to speed up requests for play by play data and weekly data. I also tested async and multiprocessing as well but threading posted the best results. Depending on connection, it sped up the speed at which PBP data from 1999 to 2022 was loaded by 25-50%.

Also added associated tests. I added a init.py file to the tests folder, because it was the only way I could run pytest (although I may have done something wrong there).

Did not open issue for this beforehand but spoke to @cooperdff on Twitter about the idea and thought it was a good idea.

Test results below:

image

bendominguez0111 commented 1 year ago

Cool! I will make some of these changes. I actually thought up another optimization for this but never got to it (removes the need to sort the CSVs at the end, which takes a considerable amount of time if youre pulling a lot of data) I'll add that in as well along with the changes requested

bendominguez0111 commented 1 year ago

Made those changes. Didn't move the if thread_requests block all the way down to the # load data comment cause it didnt exactly work, but it doesnt bypass caching now w/ this new code. Also set engine = auto instead of explicitly setting pyarrow

bendominguez0111 commented 1 year ago

Oh, and added some additional logic compared to last time to avoid sorting years at the end. Since the HTTP requests can resolve out of order, originally there would need to be a sort at the end which took up some time. Now just creating a fixed sized list and then inserting responses into it as their threads resolve

bendominguez0111 commented 1 year ago

Committed those suggestions. Good call, those lines were a bit cluttered