taleinat / fuzzysearch

Find parts of long text or data, allowing for some changes/typos.
MIT License
301 stars 26 forks source link

Feature Request: Report fuzzy matched chars #21

Closed georgh closed 4 years ago

georgh commented 4 years ago

I think it would be handy to have a field describing the fuzzy part of the match. You already report the distance, but it is in somecases a bit cumbersome to finde the exact positions where the fuzzy part happend.

So for example: find_near_matches('I love you', 'I luve yuu XXXXXXX', max_l_dist=5) returns a Match(start=0, end=10, dist=2, matched='I luve yuu') Now it would be nice to know the fuzzy positions: [3,8] in this case

It becomes a bit tricky, if parts are missing or inserted - so maybe that could be reported seperatly? find_near_matches('I love you', 'I luve yu XXXXXXX', max_l_dist=5) -> fuzzy_match = [3], fuzzy_missing=[(o,8)]

What do you think about it?

taleinat commented 4 years ago

Hi @georgh,

Comparing a matched sub-string with the original search string is just comparing two similar strings in terms of Levenshtein distance. That is already solved well by other libraries. Here are two examples:

Considering the above, I don't think it's worth the effort to add this to fuzzysearch. However, I think a good example showing how to do this would be a great addition to the docs - example code or a PR would be very welcome!

taleinat commented 4 years ago

I'm closing this since I currently don't see a real need to add such a feature, and there's been no further response from the poster (@georgh). Feel free to continue the discussion if needed, and I'll re-open the issue if necessary.