rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.24k stars 884 forks source link

[FEA] Jaro-Winkler algorithm for cudf.core.column.string.StringMethods.edit_distance #6503

Open paulhendricks opened 3 years ago

paulhendricks commented 3 years ago

Is your feature request related to a problem? Please describe.

Add Jaro-Winkler algorithm for cudf.core.column.string.StringMethods.edit_distance.

Documentation: https://docs.rapids.ai/api/cudf/stable/api.html?highlight=tokenizer#cudf.core.column.string.StringMethods.edit_distance

Describe the solution you'd like

def edit_distance(targets, algorithm='levenshtein', **kwargs):
...

Parameters
targets array-like, Sequence or Series or str - The string(s) to measure against each string.
algorithm str - The algorithm - either Levenshtein or Jaro-Winkler.

Returns
Series or Index of int32.
Examples

Usage:

>>>
import cudf
sr = cudf.Series(["puppy", "doggy", "kitty"])
targets = cudf.Series(["pup", "dogie", "kitten"])
sr.str.edit_distance(targets=targets, algorithm='jarowinkler')
0    2
1    2
2    2
dtype: int32
sr.str.edit_distance("puppy")
0    0
1    4
2    4
dtype: int32

Describe alternatives you've considered

cuDF UDFs? Open to ideas.

Additional context

cc @beckernick @kkraus14

github-actions[bot] commented 3 years ago

This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

ynandal99 commented 1 month ago

@paulhendricks Hi, was wondering if you ended up doing any workaround for JW similarity in cudf? It'd be cool to have this distance in cudf @GregoryKimball