Open temiwale88 opened 2 days ago
@temiwale88 thank you for using this library and taking the time to sort it out. I will definitely study your proposal for acceleration and if everything goes well, we will add it to the project. Will you be ready to do PR on your own or will I need to do it myself?
I'd be happy to make a PR. But let me know if you'd like me to totally replace the left_join_nbastats
method or make another one. This will definitely need you to test out my code as it's not a perfect match for left_join_nbastats
as is.
@temiwale88, as soon as I do the check, I'll let you know. While I'm thinking of creating a new function, for example fast_left_join_nbastats
Noted. I'm game for any approach you'd like. Thanks again!
Thanks @shufinskiy for this project. It truly makes working with the nba stats datasets easier. I appreciate how you leverage some of the top libraries and repos for this work.
With that said -
Use case: To capture every play that has a video available and join that to the
nbastats
dataset.Problem: Running
left_join_nbastats
on a single season (2023
season) took about 8 hours and 58 minutes on my PC. I really appreciate the logic behind it and the detailed explanation in @shufinskiy Colab tutorial.My solution: After examining the data for possible links, preferring direct matches over fuzzy linkages, I settle on performing the matching in stages.
TLDR the steps are
rapidfuzz
library which is "mostly written in C++ and on top of this comes with a lot of Algorithmic improvements to make string matching even faster"nbastats
) to the resulting merged dataset.Rough code below with help from the trusty friend ChatGPT. Make any modifications as needed. Happy to integrate this somehow into the the code if needed @shufinskiy
Step 1: Find direct matches
This relies on
GAMEID
,PERIOD
, a transformed column,NEW_DESCRIPTION
(e.g.return f"{row['HOMEDESCRIPTION']}: {row['VISITORDESCRIPTION']}"
), that combinesHOMEDESCRIPTION
,VISITORDESCRIPTION
, andNEUTRALDESCRIPTION
Step 2: Perform fuzzy matches recursively
Step 3: Concatenate the remainder data from
nba_stats
Ciao. Thank you again for this library!