Reducing left_join_nbastats execution time from approximately 9 hours to 8 secs

temiwale88 commented 2 days ago

Thanks @shufinskiy for this project. It truly makes working with the nba stats datasets easier. I appreciate how you leverage some of the top libraries and repos for this work.

With that said -

Use case: To capture every play that has a video available and join that to the nbastats dataset.

Problem: Running left_join_nbastats on a single season (2023 season) took about 8 hours and 58 minutes on my PC. I really appreciate the logic behind it and the detailed explanation in @shufinskiy Colab tutorial.

My solution: After examining the data for possible links, preferring direct matches over fuzzy linkages, I settle on performing the matching in stages.

TLDR the steps are

Find direct matches
For the unmatched rows, perform fuzzy matches recursively selecting a matching threshold until a set min threshold is reached
- This parts require the rapidfuzz library which is "mostly written in C++ and on top of this comes with a lot of Algorithmic improvements to make string matching even faster"
Finally, concatenate the unmatched rows in the dataset (nbastats) to the resulting merged dataset.

Rough code below with help from the trusty friend ChatGPT. Make any modifications as needed. Happy to integrate this somehow into the the code if needed @shufinskiy

Step 1: Find direct matches

This relies on GAMEID, PERIOD, a transformed column, NEW_DESCRIPTION (e.g. return f"{row['HOMEDESCRIPTION']}: {row['VISITORDESCRIPTION']}"), that combines HOMEDESCRIPTION, VISITORDESCRIPTION, and NEUTRALDESCRIPTION

import re
import time

# Start timing
start_time = time.time()

nba_stats = nbastats.copy()
pbp_stats = pbpstats.copy()
# game_id = 22300001

# Create `new_description` field in nba_stats
def create_description(row):
    if pd.notna(row['HOMEDESCRIPTION']) and pd.isna(row['VISITORDESCRIPTION']):
        return row['HOMEDESCRIPTION']
    elif pd.isna(row['HOMEDESCRIPTION']) and pd.notna(row['VISITORDESCRIPTION']):
        return row['VISITORDESCRIPTION']
    elif pd.notna(row['HOMEDESCRIPTION']) and pd.notna(row['VISITORDESCRIPTION']):
        return f"{row['HOMEDESCRIPTION']}: {row['VISITORDESCRIPTION']}"
    else:
        return row['NEUTRALDESCRIPTION']

# Normalize descriptions
def normalize_text(text):
    if pd.isna(text):
        return None
    # Convert to lowercase
    text = text.lower()
    # Remove special characters and extra spaces
    text = re.sub(r'[^a-z0-9\s]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

nba_stats = nba_stats.reset_index(drop=True)
pbp_stats = pbp_stats.reset_index(drop=True)

nba_stats['NEW_DESCRIPTION'] = nba_stats.apply(create_description, axis=1)

# Apply normalization to both descriptions
nba_stats['normalized_description'] = nba_stats['NEW_DESCRIPTION']#.apply(normalize_text) #normalization might not be necessary | could be same results without it thus reduces execution time
pbp_stats['normalized_description'] = pbp_stats['DESCRIPTION']#.apply(normalize_text)

print("done normalization")
nba_stats['PCTIMESTRING'] = pd.to_timedelta(nba_stats['PCTIMESTRING'] + ':00')
pbp_stats['STARTTIME_TEMP'] = pd.to_timedelta(pbp_stats['STARTTIME'] + ':00')
pbp_stats['ENDTIME_TEMP'] = pd.to_timedelta(pbp_stats['ENDTIME'] + ':00')

# Join datasets on GAME_ID and PERIOD
merged_data = pd.merge(nba_stats, pbp_stats, 
                       left_on=['GAME_ID', 'PERIOD', 'normalized_description'], 
                       right_on=['GAMEID', 'PERIOD', 'normalized_description'])

# Filter events to match possession times
merged_data = merged_data[
    (merged_data['PCTIMESTRING'] >= merged_data['ENDTIME_TEMP']) &
    (merged_data['PCTIMESTRING'] <= merged_data['STARTTIME_TEMP'])
]

# Drop duplicate rows based on the 'URL' column, keeping the first occurrence
merged_data = merged_data.drop_duplicates(subset='URL', keep='first')

# End timing
end_time = time.time()

# Calculate and print elapsed time
print(f"Join operation took {end_time - start_time:.2f} seconds.")

merged_data
# --execution time: approx 5 secs

Step 2: Perform fuzzy matches recursively

from rapidfuzz import fuzz, process
from datetime import timedelta

# Function to find the best fuzzy match for each unmatched description within a time window

def find_fuzzy_matches(row, nba_stats, time_window=10, initial_threshold=70, min_threshold=60, max_retries=5, debug=False):
    """
    Find fuzzy matches with a retry mechanism to progressively lower the threshold.

    Args:
        row (pd.Series): The row from `pbp_stats` containing match details.
        nba_stats (pd.DataFrame): The dataset to search for matches.
        time_window (int): Minutes to adjust STARTTIME and ENDTIME.
        initial_threshold (int): Starting fuzzy match threshold.
        min_threshold (int): Minimum allowable fuzzy match threshold.
        max_retries (int): Maximum number of retries with a lower threshold.

    Returns:
        str or None: Best match description if found, otherwise None.
    """
    if pd.isna(row['normalized_description']):
        return None

    time_start = row['STARTTIME_TEMP'] + timedelta(minutes=time_window)
    time_end = row['ENDTIME_TEMP'] - timedelta(minutes=time_window)

    # Filter nba_stats by period and adjusted time window
    match_search_space = nba_stats[
        (nba_stats['PERIOD'] == row['PERIOD']) & 
        (nba_stats['GAME_ID'] == row['GAMEID']) \
        #     &  # Adding PCTIMESTRING filter
        # (nba_stats['PCTIMESTRING'] <= time_start) & # -- consider adding this filter if you'd like to shrink the search space (on the other hand, it might be faster and more exhaustive to actually search without the time constraint. It is more accurate to leave this constraint out too)
        # (nba_stats['PCTIMESTRING'] >= time_end)
    ]

    # Debug: Check if we have potential matches
    if match_search_space.empty:
        if debug:
            print(f"No potential matches for row: {row['normalized_description']}")
        return None

    # Debug: Inspect normalized_description of the row and potential matches
    if debug:
        print(f"Searching for match:")
        print(f"  Description: {row['normalized_description']}")
        # print(f"  Time Window: {time_start} to {time_end}")
        print(f"  Search Space Count: {match_search_space.shape[0]}")
        print("  Example Search Space:")
        for idx, desc in enumerate(match_search_space['normalized_description'].head(5), start=1):
            print(f"    {idx}. {desc}")
        print("-" * 50)  # Separator for better readability

    current_threshold = initial_threshold
    for attempt in range(max_retries):
        # Apply fuzzy matching on the descriptions
        match = process.extractOne(
            row['normalized_description'], 
            match_search_space['normalized_description'].tolist(), 
            scorer=fuzz.ratio
        )

        if match:
            best_match, score, _ = match
            if score >= current_threshold:
                if debug:
                    print("Fuzzy Match Found:")
                    print(f"  Attempt: {attempt + 1}")
                    print(f"  Potential Match Found: {match[0]} at threshold {current_threshold}")
                    print(f"  Match Score: {score}")
                    print(f"  Matched With: {best_match}")
                    print("-" * 50)  # Separator for better readability
                return best_match

        # Lower the threshold and retry
        current_threshold -= (initial_threshold - min_threshold) // max_retries
        current_threshold = max(current_threshold, min_threshold)
        if debug:
            print("Fuzzy Matching Retry:")
            print(f"  Attempt: {attempt + 1}")
            print(f"  Current Threshold: {current_threshold}")
            print("-" * 50)  # Separator for better readability

    if debug:
        print(f"No match found for: {row['normalized_description']} after {max_retries} retries.")
    return 

# Apply fuzzy matching to unmatched rows in pbp_stats
unmatched_pbp_stats = pbp_stats[~pbp_stats['URL'].isin(merged_data['URL'].unique())]

unmatched_pbp_stats['fuzzy_match'] = unmatched_pbp_stats.apply(
    lambda row: find_fuzzy_matches(row, nba_stats, time_window=10, initial_threshold=70, min_threshold=60, max_retries=3, debug=False), 
    axis=1
)

# Filter for rows with successful fuzzy matches
fuzzy_matched = unmatched_pbp_stats[~unmatched_pbp_stats['fuzzy_match'].isna()]

# Merge the fuzzy-matched rows into the nba_stats dataset
fuzzy_matched_merged = fuzzy_matched.merge(
    nba_stats,
    left_on=['fuzzy_match', 'PERIOD', 'GAMEID'],
    right_on=['normalized_description', 'PERIOD', 'GAME_ID'],
    suffixes=('_pbp', '_nba')
)

# Concatenate the newly matched rows back into the main merged_data
merged_data = pd.concat([merged_data, fuzzy_matched_merged], ignore_index=True)
merged_data
# --execution time: approx 0.4 secs

Step 3: Concatenate the remainder data from `nba_stats`

# Identify the existing GAME_ID and EVENTNUM combinations in merged_data
existing_combinations = set(merged_data[['GAME_ID', 'EVENTNUM']].itertuples(index=False, name=None))
remaining_rows = nba_stats[~nba_stats[['GAME_ID', 'EVENTNUM']].apply(tuple, axis=1).isin(existing_combinations)]
merged_data = pd.concat([merged_data, remaining_rows], ignore_index=True)
# --execution time: approx 2 secs

Ciao. Thank you again for this library!

shufinskiy commented 1 day ago

@temiwale88 thank you for using this library and taking the time to sort it out. I will definitely study your proposal for acceleration and if everything goes well, we will add it to the project. Will you be ready to do PR on your own or will I need to do it myself?

temiwale88 commented 1 day ago

I'd be happy to make a PR. But let me know if you'd like me to totally replace the left_join_nbastats method or make another one. This will definitely need you to test out my code as it's not a perfect match for left_join_nbastats as is.

shufinskiy commented 13 hours ago

@temiwale88, as soon as I do the check, I'll let you know. While I'm thinking of creating a new function, for example fast_left_join_nbastats

temiwale88 commented 8 hours ago

Noted. I'm game for any approach you'd like. Thanks again!

shufinskiy / nba-on-court