jamesturk / jellyfish

🪼 a python library for doing approximate and phonetic matching of strings.
https://jamesturk.github.io/jellyfish/
MIT License
2.04k stars 157 forks source link

Add jaccard similarity #186

Closed RossKen closed 1 month ago

RossKen commented 1 year ago

Thanks for a great package! I am planning to use this for some of my work in the record linkage package, Splink

It would be really great to add jaccard similarity as an option within jellyfish.

I can give a PR a shot, but I haven't done any Rust before so I can't guarantee how well (or quickly) I would do it 😅

jamesturk commented 1 year ago

I'd definitely be open to this one, it's a little unorthodox for strings, but I think it's simple & well-defined enough to be useful. Would you compute it between n-grams? (Presumably with a tunable n?)

NiklasvonM commented 1 month ago

+1 for this feature request. I'd suggest not using n-grams by default but enable them if the parameter n is set. Example signature:

def jaccard_similarity(str1: str, str2: str, ngram_size: int | None = None) -> float:
    ...
jamesturk commented 1 month ago

released in 1.1.0! thanks @NiklasvonM