rapidfuzz / RapidFuzz

Rapid fuzzy string matching in Python using various string metrics
https://rapidfuzz.github.io/RapidFuzz/
MIT License
2.61k stars 116 forks source link

Feature request: Time limit on long-running functions #338

Open shirakaba opened 1 year ago

shirakaba commented 1 year ago

Background

I'm using fuzz.partial_ratio_alignment() to align each paragraph of a novel to a timestamped transcription from the novel's corresponding audiobook. This will allow me to know how far into the book I've "read" whilst listening to its audiobook.

Approach

For each paragraph in the novel, I take a substring of ~128 characters to query against the full audio transcription (around 200,000 characters).

Here's the general idea, in code:

for paragraph in charlesDickensNovel:
    alignment = fuzz.partial_ratio_alignment(
        paragraph[0:128],
        charlesDickensFullAudioTranscription
    )

Problem

Most of the lines are successfully aligned within milliseconds (or nanoseconds), but perhaps 1% of these queries hang for several seconds.

I suspect that a smaller query would produce a quicker alignment, but I do need to query at least about 128 characters to reduce false-positives. There's also no way around the fact that I have to match against the whole audio transcription.

Requested solution

I actually don't need to match up every line; just a large proportion would be good enough for me to build a progress map. So I'd be happy to cancel any matches that take longer than 100 milliseconds to execute and just move onto the next one. Would it be possible to add an optional time limit onto each of the RapidFuzz APIs? Something like this timeout parameter:

fuzz.partial_ratio_alignment(
  s1=line[0:128],
  s2=charlesDickensFullAudioTranscription,
  timeout=0.1
)

If the timeout is reached, it could simply return None without explanation, or (if it feels more Pythonic) throw an error.

shirakaba commented 1 year ago

Follow-up: I found that the reason RapidFuzz was hanging was that I'd initialised the substring incorrectly, and so was querying with a larger and larger string (thousands of characters instead of just 128). So I no longer personally need this feature.

However, it still seems like a sensible feature to implement to protect against denial-of-service when users might be able to pass arbitrary search queries to it. In the Node.js package ecosystem (npm), RegEx denial-of-service attacks are treated as a severe vulnerability, for perspective.

maxbachmann commented 1 year ago

I am really still not sure what to do in these cases. One basic issue right now is that when processing very long sequences there is not really any way to stop execution yet. When processing multiple sequences using e.g. process.cdist I do check for KeyboardInterrupts from time to time, but this is not yet done when processing very long sequences, since the calculation is handed off into C++. One of the challenges in this regard is that the C++ implementation is used both standalone and in the Python Wrapper. So I have a hard time extending it with features that are not needed in C++. Thinking about this at the very least I had a rough idea how I could implement the check for KeyboardInterrupts in a compatible manner.

If the timeout is reached, it could simply return None without explanation, or (if it feels more Pythonic) throw an error.

Probably an exception would be the best option. Some scorers can already return None when no good enough match was found making it impossible to differentiate between the errors. In addition an exception would work better with the functions in the processor module.

I will keep this in the back of my head, but I am still unsure whether I will add this feature. I would need to add this argument to every function which adds complexity for all users. One compromise might be a global timeout value the user can set for all internal scorer functions.