taleinat / fuzzysearch

Find parts of long text or data, allowing for some changes/typos.
MIT License
301 stars 26 forks source link

Possible to exclude characters from search? #14

Closed nodice73 closed 5 years ago

nodice73 commented 5 years ago

Hi There,

Does fuzzysearch.find_near_matches exclude any characters before searching? If not can it be made to do so?

I ask because I'm searching for many sub-strings in a much larger string that I made from joining a list of strings with '|' characters. I don't want the '|' characters to be included when it looks for differences. I'm doing this to avoid looping through both lists for every search.

If there is a better way to do it, please let me know.

Thank you!

taleinat commented 5 years ago

Hi Adam,

There is no option to exclude or ignore certain characters in fuzzysearch.

I recommend simply looping over both lists. Your approach has several disadvantages, such as making it rather difficult to figure out which string a match was found in and where that match was.

If you want to keep trying your approach with the concatenated strings, try joining them with a multi-character separator (e.g. '|||') with more characters than the maximum allowed Levenshtein distance. This will ensure that you don't receive fuzzy matches which cross a separator. As long as the separator character doesn't appear in the searched sub-strings, this should do what you want.

I am considering adding special support for searching for multiple sub-sequences in multiple sequences, which is precisely what you would have liked to use. There are different algorithms which can be used in such cases to make the search much more efficient.