Open teskuteyi opened 4 weeks ago
With cudf.pandas
, the line df1[key1].apply(lambda x: process.extract(x, s, scorer=fuzz.partial_ratio, limit=limit))
will not be able to run on GPU because of limitations on what is supported in user-defined apply
functions (this calls an external library that Numba cannot JIT compile for the GPU). However, I think we have the tools needed to accelerate this computation.
From StackOverflow, it sounds like the partial ratio is computed using Levenshtein distances: https://stackoverflow.com/questions/53755558/need-more-understanding-on-python-fuzz-partial-ratio
cuDF has a function for edit_distance
which computes the Levenshtein distance. https://docs.rapids.ai/api/cudf/stable/user_guide/api_docs/api/cudf.core.column.string.stringmethods.edit_distance/
Can you try using that and let us know?
There is also jaccard_index
function. Looks like it is not documented but should be.
https://github.com/rapidsai/cudf/blob/4c04b7c8790263dc68c5753609f3cb867806359f/python/cudf/cudf/core/column/string.py#L5463
Here is the unit test which can serve as an example usage https://github.com/rapidsai/cudf/blob/4c04b7c8790263dc68c5753609f3cb867806359f/python/cudf/cudf/tests/text/test_text_methods.py#L1009
@teskuteyi is there anything else you need here? For example, guidance on how to use the suggested features?
I am trying to be able to run this anonymized script using cuDF or cuDF.pandas with GPU acceleration. currently runs for over 5 hours for 2Million rows of data.