jamesturk / jellyfish

🪼 a python library for doing approximate and phonetic matching of strings.
https://jamesturk.github.io/jellyfish/
MIT License
2.04k stars 157 forks source link

Idea for a further feature: similarity matrices #130

Open trichie opened 4 years ago

trichie commented 4 years ago

Dear James,

One of the bottlenecks of fuzzy comparison between two large lists of size n and m is that it has O(nm) runtime as the comparison function -- e.g. jellyfish.jaro_winkler -- has to be called n m times. So far I have not been able to vectorize this operation, i.e. to push the looping through the lists into the highly performant faster C++ implementation cjellyfish as the jellyfish package seems not to support vectorized operations e.g. on NumPy arrays yet (as far as I was able to figure out). Also, just-in-time compilation packages such as Numba seem to have issues with the jellyfish function calls (i.e. it doesn't seem to work that way either).

I am aware that this is no general cure for the issue: O(n*m) just explodes with growing size and from a certain size on one needs to rethink the entire approach e.g. by somehow making an intelligent pre-split of the list. However, having e.g. a factor 16 speed gain (I see roughly factor 22 on my machine when %timeit-ing the c-compiled jaro_winkler vs the python jaro_winkler) would still make it possible to match 4 times larger lists in the same runtime (which would already help me a lot for the kind of lists I normally have to deal with).

Below you find some idea that I had for what could be added into the precompiled cjellyfish part of the package for significant speed gain when searching matches in lists. The return object would be a n*m matrix containing all the individual distances that can be processed further. Please note that -- if a list of strings is compared against itself (e.g. for duplicate scans of that list) -- one only needs to calculate the upper triangle of the matrix for symmetry reasons. this is achieved by setting flag = 1 as then the inner loop starts at i + 1 instead of 0. This uses up only half the runtime which is valuable as the whole reason for this exercise is runtime optimization ;-)

Please see the attached .txt file (I was not allowed to upload a .py file) for a python code suggestion of how to maybe implement it. But i am no pro programmer, just an actuary-turning-data-scientist with less than half a year of Python experience, so I am sure that my implementation still has some flaws (e.g. no type checking, no error handling, etc.) and is also dependent on numpy (which you might not want). It is just meant to show you the basic idea.

Regards Thomas / trichie

similarity_matrix.txt

jamesturk commented 4 years ago

Hi, thanks for taking the time to write-

I'm definitely open to a patch for this if there were an optional numpy dependency :) I'm not a numpy user myself, so someone with experience there would have to do that.

trichie commented 4 years ago

Hi James, meanwhile I got a bit more accustomed to Python ;-) This should actually do the trick without numpy, returning a 2d list.

import jellyfish as jf def similarity_matrix(lst: list, oth = None, sim_func = jf.jaro_winkler_similarity, ignore_case = True, kwargs): lst = [x.lower() if isinstance(x, str) else x for x in lst] if ignore_case else lst oth = lst if oth is None else [x.lower() if isinstance(x, str) else x for x in oth] if ignore_case else oth flag = 1 if oth == lst else 0 sim_mat = [[sim_func(lst[i], oth[j], kwargs) if j >= flag * (i + 1) else 0 for j in range(len(oth))] for i in range(len(lst))] return sim_mat