rapidfuzz / RapidFuzz

Rapid fuzzy string matching in Python using various string metrics
https://rapidfuzz.github.io/RapidFuzz/
MIT License
2.73k stars 119 forks source link

Extend process module #188

Open maxbachmann opened 2 years ago

maxbachmann commented 2 years ago

Currently the process module has the following functions:

function kind explanation
extractOne one x many returns the best match as (choice, score, index/key)
extract one x many returns the best matches until limit as list[(choice, score, index/key)]
extract_iter one x many generator yielding (choice, score, index/key). Usage is not really recommended, since it is far slower than the others
cdist many x many returns all results as numpy matrix

It would be nice to have equivalents of extractOne / extract for many x many. They would need less memory than cdist, which can take a large amount of memory when len(queries) and len(choices) are large.

function kind explanation
- many x many returns the best matches as list[(choice, score, index)]
- many x many returns the best matches until limit as list[list[(choice, score, index)]]
- one x many returns all result without any sorting like cdist

A first thought might be to overload the existing extractOne / extract on the type passed as query / queries. However this is not possible, since the following is a valid usage of these methods:

extractOne(["hello", "world"], [["hello", "world"]])

which can not be distinguished from many x many. For this reason these functions need a new API.

Beside this in many cases users are not actually interested, but only care about finding elements with a score, which is better than the score_cutoff. These could potentially be implemented more efficiently, since the implementation could quit once it is known, that they are better than score_cutoff. These could be cases:

function kind explanation
- many x many returns matrix of bool
- one x many returns list of bool when there is a matching choice (e.g. https://stackoverflow.com/questions/70770842/matching-strings-within-two-lists/70780527#70780527)

This could be automatically done when the user passes dtype=bool.

Any suggestions on the naming of these new API's are welcome.

maxbachmann commented 2 years ago

The api will follow the following scheme:

class Compare:
    __init__(scorer, *):

   def set_seq1(a):      -> single
   def set_seq1_list(a): -> many

   def set_seq2(b):      -> single
   def set_seq2_list(b): -> many

   def all()
   def max(axis=None, initial=..., keepdims=False)
   def sorted(axis=None, keepdims=False, limit=None)
ghost commented 2 years ago

How about calling it extractMany? I would love to have this feature soon :)

ghost commented 2 years ago

Is there any chance it'll be in any of the upcoming releases or you're not yet interested in implementing this? Thank you :)

maxbachmann commented 2 years ago

I am still interested in adding this feature. However I do not expect that I will have the time to implement this until later this year.

maxbachmann commented 2 years ago

For the max/sorted functions it would make sense to store both the score and the corresponding index in the results. However I am unsure how this should be stored in numpy/pandas/xarray/... to make further usage simpler.

Options I am considering so far:

I do not have a lot of experience with the whole scientific Python stack. Is there a specific format which would make sense for the results to simplify further processing. @Unc3nZureD do you have any preference for the result format?

ghost commented 2 years ago

Sorry for the huge delay. Sadly I have no preference as I'm not familiar with them at all.

Currently, it could help me by moving away from the case where I've got let's say 100 patterns and a "database" of patterns, let's say with 1000 ones inside. Currently, I'm looping the 100 patterns on a single python thread and each time checking the DB. It'd be a lot more optimal if some multithreaded C/++ code could do it and just return an array of bool or float (depending if I need the actual value or not)

Honestly speaking I'm not even sure why scientific libs would be necessary, but it may be due to I'm a beginner.

Thank you!

maxbachmann commented 1 year ago

Some things about this could be simplified: 1) only make this many x many. When users pass in a list with a single element this is good enough of a representation for 1 x many. The default of keepdims=false takes care of this anyways.

2) allow dumping to a byte array and loading from it. This would allow generating optimized models ahead of time.

3) similar to the changes in the process module scorer kwargs should be passed via scorer_kwargs

class Compare:
    __init__(choices, *, scorer, processor=None, scorer_kwargs=None):

   def all(queries)
   def max(queries, axis=None, initial=..., keepdims=False)
   def sorted(queries, axis=None, keepdims=False, limit=None)
silenceOfTheLambda commented 2 months ago

I think multi-string inputs (i.e. many-to-many matches) would be a great addition to extract, extractIter, and extractOne:

Finding the best N matching strings for a single input string can already be done via extract with the limit argument. But finding the best N matching strings for M input strings currently requires calling extract() M times (once separately for each string). Alternatively this could be done via batch input matching using process.cdist and manual thresholding/sorting of the similarity matrix.

However, allowing single-call-extractions for multi-string-inputs via the extract() functions (e.g. via list-like input, similar to process.cdist; not sure regarding the overloading issue mentioned above) would improve convenience and might also yield performance improvements compared to individual function calls.

The use case I can imagine (and currently have) is to identify, based on customer names, whether there are relations between a set of customers and which other customers any given input customer may be related to.

Side question: Aren't extractIter and extractOne() identical to extract() with limit=1, apart from mabye return behavior for multiple same-similarity-matches?

silenceOfTheLambda commented 2 months ago

Question about extract(): The table at the beginning of this issue mentions that extractIter() should not really be used due to its slowness:

grafik

However, in the source code (https://github.com/rapidfuzz/RapidFuzz/blob/main/src/rapidfuzz/process_py.py), extract() seems to fall back onto extractOne (for limits == 1) and extractIter (for limits != 1). Does that mean that extract() shouldn't really be used either? Testing a single string against ~161k strings, extract(limits=1) took ~8 musec, whereas extract(limits=2) took ~400 msec, thus ~50000 times longer.

maxbachmann commented 2 months ago

Yes I absolutely see why this class of APIs would be useful. While as you described it's possible to either call extract multiple times or cdist both approaches have their downsides:

Side question: Aren't extractIter and extractOne() identical to extract() with limit=1, apart from mabye return behavior for multiple same-similarity-matches?

Historically extractOne could be faster + use less memory. However nowadays extract will call extractOne if limit=1 is passed. So extractOne is the same as extract(..., limit=1)[0]. extractIter is a generator and only performs the calculation on the fly. This doesn't really have any advantage, but is quite a bit slower. It only exists since a similar API did exist in fuzzywuzzy which was what this was based on.

However, in the source code (https://github.com/rapidfuzz/RapidFuzz/blob/main/src/rapidfuzz/process_py.py), extract() seems to fall back onto extractOne (for limits == 1) and extractIter (for limits != 1). Does that mean that extract() shouldn't really be used either? Testing a single string against ~161k strings, extract(limits=1) took ~8 musec, whereas extract(limits=2) took ~400 msec, thus ~50000 times longer.

This is the Python fallback implementation where it doesn't really make that much of a difference and which is going to be significantly slower anyway. The C++ implementation in https://github.com/rapidfuzz/RapidFuzz/blob/main/src/rapidfuzz/process_cpp_impl.pyx doesn't do this.

The performance difference between extract(limit=1) and extract(limit=2) highly depends on the data used to test this:

silenceOfTheLambda commented 2 months ago

The performance difference between extract(limit=1) and extract(limit=2) highly depends on the data used to test this:

extract(limit=1) is using the best score found so far as score_cutoff for further calls. This can allow it to skip bad matches quite a bit faster. In addition it can exit early once it found a perfect match. extract(limit=2) calculates all scores, then sorts them and returns the best ones. It would probably make sense to include a similar mechanism to extract(limit=1) to find a score_cutoff. However this is a bit of a balance, since while this score_cutoff can make things faster it's not guaranteed to do so. On the other hand finding this score_cutoff would require maintaining a sorted list of the results which certainly comes at a cost. For small limits it would probably still be worth it though.

Turns out, the cause was mainly the best-score-short-circuit (which I hadn't known about), since I had the string itself at the beginning of choices. Removing that string, the timings become significantly more comparable, with ~250 ms for extractOne/extract(..., limit=1) and ~410 ms for extract(..., limit=2).

That being said, for matching 161k strings among themselves by repeatedly calling extractOne(), based on those timings, we'd expect something like 161000 * 250 ms =~ 40000 sec =~ 11 h.

On the other hand, a self-written argmax-based wrapper around cdist, run in batches of 1000 x 161000-matchings (due to disk space limitations), takes on the order of ~100 s, meaning it is a factor of ~400 faster. Not sure whether that is solely due to multi-processing (workers=-1).

maxbachmann commented 2 months ago

On the other hand, a self-written argmax-based wrapper around cdist, run in batches of 1000 x 161000-matchings (due to disk space limitations), takes on the order of ~100 s, meaning it is a factor of ~400 faster. Not sure whether that is solely due to multi-processing (workers=-1).

which scorer function did you use for the test, approximately how long were your strings and did you use a preprocessing function?

silenceOfTheLambda commented 2 months ago

@maxbachmann: I installed rapidfuzz 3.9.7 today and later switched back to 3.9.3. Unfortunately I can't seem to reproduce the timings from yesterday (which were also in 3.9.3), for whatever reason.

However, I created a new timing script, which - apart from loading the strings - is shown below. There are two test cases:

In both test cases, the performance of extractOne() and extract() is compared to a custom cdist-based (with workers=-1) wrapper fuzzy_match() to extract either the best-matching string (mode='extract_best') or all matching (mode='extract_all') strings above the (default) similarity threshold of 95. The output of the timings is provided below the code.

Code:


# ----------- #
# test case A #
# ----------- #

test_string = strings[0]
other_strings = strings[1:]
string_lengths = [len(elem) for elem in other_strings]
test_series = pd.Series([test_string]*len(other_strings))

print(rapidfuzz.__version__)

print('------------')
print('Test case A:')
print('------------')
print(f'# of strings: {len(other_strings)}')
print(f'length of first elem.: {len(test_string)}')
print(f'avg. / median of string lengths: {np.mean(string_lengths)} / {np.median(string_lengths)}')

%timeit process.extractOne(test_string, other_strings)
%timeit process.extract(test_string, other_strings, limit=1)
%timeit process.extract(test_string, other_strings, limit=2)

t0 = time.time()
t_dict, _, _ = fuzzy_match(test_series, strings[1:], mode='extract_best')
t1 = time.time()
print(t1 - t0)
t0 = time.time()
t_dict, _ = fuzzy_match(test_series, strings[1:], mode='extract_all')
t1 = time.time()
print(t1 - t0)

t0 = time.time()
t_dict, _, _ = fuzzy_match(strings, mode='extract_best')
t1 = time.time()
print(t1 - t0)
t0 = time.time()
t_dict, _ = fuzzy_match(strings, mode='extract_all')
t1 = time.time()
print(t1 - t0)

# ----------- #
# test case B #
# ----------- #

test_string = 'tree' * 7
test_series = pd.Series([test_string]*len(other_strings))
other_strings = pd.Series(['house'*3]*len(other_strings))

string_lengths = [len(elem) for elem in other_strings]

print('------------')
print('Test case B:')
print('------------')
print(f'# of strings: {len(other_strings)}')
print(f'length of first elem.: {len(test_string)}')
print(f'avg. / median of string lengths: {np.mean(string_lengths)} / {np.median(string_lengths)}')

%timeit process.extractOne(test_string, other_strings)
%timeit process.extract(test_string, other_strings, limit=1)
%timeit process.extract(test_string, other_strings, limit=2)

t0 = time.time()
t_dict, _, _ = fuzzy_match(test_series, strings[1:], mode='extract_best')
t1 = time.time()
print(t1 - t0)
t0 = time.time()
t_dict, _ = fuzzy_match(test_series, strings[1:], mode='extract_all')
t1 = time.time()
print(t1 - t0)

Terminal output:

grafik

Test case A: The timings of extractOne() and extract(..., limit=1) are very much comparable to one another, say ~1 msec, whereas extract(..., limit=2) takes ~800x as long (>~800 msec, between 0.5 and 1 sec). For 161k identical runs, we thus expect extractOne() and extract(..., limit=1) to take 161 sec, whereas extract(..., limit=2) would take approx. 130000 sec = 36 h. Compare this to fuzzy_match() which takes only slightly longer to extract all strings (~151 sec) than to extract the best-matching string (~134 sec). When running all strings against one another _fuzzymatch() needs a bit less time for both best- and all-string extraction (~120 sec), which can be expected, since the average/median string length of approx. 17/18 is shorter than the length of the test string of 29 chars.

Test case B: Here extractOne() and extract(..., limit=1) take much longer than in test case A, being much closer to extract(..., limit=2), whereas the performance of fuzzy_match() is very much comparable to the timing in test case A.

Summary: In test case A, extract(..., limit=2) appears to take orders of magnitude longer than extractOne() and extract(..., limit=1). I don't think that should be the case. In test case B, even extractOne() and extract(..., limit=1) take far longer than they probably should.