rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.46k stars 908 forks source link

String Similarity Library/ Algorithm #17184

Open teskuteyi opened 4 weeks ago

teskuteyi commented 4 weeks ago

I am trying to be able to run this anonymized script using cuDF or cuDF.pandas with GPU acceleration. currently runs for over 5 hours for 2Million rows of data.

import pandas as pd
from rapidfuzz import fuzz, process

def fuzzy_merge(df1, df2, key1, key2, threshold=95, limit=1):
    """
    Perform a fuzzy merge and return both matching and unmatching DataFrames.
    """
    s = df2[key2].tolist()
    m = df1[key1].apply(lambda x: process.extract(x, s, scorer=fuzz.partial_ratio, limit=limit))
    df1['matches'] = m

    m2 = pd.DataFrame(df1['matches'].tolist(), index=df1.index)
    df1['match'] = m2[0].apply(lambda x: x[0] if x[1] >= threshold else None)

    # Drop the 'matches' column as it's no longer needed
    df1 = df1.drop(columns=['matches'], axis=1)

    # Merge based on the 'match' column to include all rows from df1
    df_final = pd.merge(df1, df2, left_on='match', right_on=key2, how='left')

    # Identify matching and unmatching rows
    matching_rows = df_final.dropna(subset=[key2])  # Rows where the merge found a match
    unmatching_rows = df_final[df_final[key2].isna()]  # Rows without a match

    return matching_rows, unmatching_rows.drop(columns=df2.columns, axis=1)

# File paths
input_file1 = 'csvfile1.csv'
input_file2 = 'csvfile2.csv'
output_file_matched = 'csvfile3.csv'
output_file_unmatched = 'csvfile4.csv'

# Specify the key columns in both DataFrames
key1 = 'column_name_in_df1'  # Replace with actual column name from df1
key2 = 'column_name_in_df2'  # Replace with actual column name from df2

try:
    # Load your datasets
    df1 = pd.read_csv(input_file1)
    df2 = pd.read_csv(input_file2)

    # Perform the fuzzy merge and separate the results
    matched_df, unmatched_df = fuzzy_merge(df1, df2, key1, key2, threshold=95)

    # Save the matching and unmatching rows to their respective CSV files
    matched_df.to_csv(output_file_matched, index=False)
    unmatched_df.to_csv(output_file_unmatched, index=False)

    print(f"Matched rows saved to {output_file_matched}")
    print(f"Unmatched rows saved to {output_file_unmatched}")
bdice commented 3 weeks ago

With cudf.pandas, the line df1[key1].apply(lambda x: process.extract(x, s, scorer=fuzz.partial_ratio, limit=limit)) will not be able to run on GPU because of limitations on what is supported in user-defined apply functions (this calls an external library that Numba cannot JIT compile for the GPU). However, I think we have the tools needed to accelerate this computation.

From StackOverflow, it sounds like the partial ratio is computed using Levenshtein distances: https://stackoverflow.com/questions/53755558/need-more-understanding-on-python-fuzz-partial-ratio

cuDF has a function for edit_distance which computes the Levenshtein distance. https://docs.rapids.ai/api/cudf/stable/user_guide/api_docs/api/cudf.core.column.string.stringmethods.edit_distance/

Can you try using that and let us know?

davidwendt commented 3 weeks ago

There is also jaccard_index function. Looks like it is not documented but should be. https://github.com/rapidsai/cudf/blob/4c04b7c8790263dc68c5753609f3cb867806359f/python/cudf/cudf/core/column/string.py#L5463

Here is the unit test which can serve as an example usage https://github.com/rapidsai/cudf/blob/4c04b7c8790263dc68c5753609f3cb867806359f/python/cudf/cudf/tests/text/test_text_methods.py#L1009

wence- commented 2 days ago

@teskuteyi is there anything else you need here? For example, guidance on how to use the suggested features?