taleinat / fuzzysearch

Find parts of long text or data, allowing for some changes/typos.
MIT License
301 stars 26 forks source link

Need to implement time limit for find_near_matches() #44

Open levitation opened 1 year ago

levitation commented 1 year ago

Need to implement time limit for find_near_matches() else the process may just suddenly hang with certain values of max_l_dist and not even debugger can pause it.

levitation commented 1 year ago

Currently I am using my own time limit handler. You can find the code here https://github.com/levitation-opensource/Manipulative-Expression-Recognition/blob/main/TimeLimit.py

But having a built-in time limit argument in the find_near_matches() function would make the behaviour of the function less surprising and thus error prone. People might otherwise not foresee that the find_near_matches() function may go into a long running loop / result in a hang.

The hang appears very suddenly. For example, on the data I was testing on, max_l_dist = 21 returned just as quickly as with smaller max_l_dist values, but max_l_dist = 22 computed indefinitely.

holub008 commented 9 months ago

For any application taking variable inputs (in the searched text or pattern, both in my case), it's essential to be able to give up and move on for "bad" inputs where there are exponentially many close matches per edit distance parameters.

I'm not terribly python proficient, but it seems challenging to implement a timeout in user space (hard to interrupt C callouts, platform dependent, thread safety, etc.). It would be awesome if this library could internally handle this! I'd be happy to take this on if there's interest from maintainers.

taleinat commented 9 months ago

Hi, maintainer here. You're welcome to give this a shot!

The issue is that this codebase currently includes quite a few different search functions, and find_near_matches() chooses the most appropriate among them. The implementation methods very a lot.

I highly recommend starting with one or two or the search methods, getting something working, and getting my feedback on it before progressing further.