taleinat / fuzzysearch

Find parts of long text or data, allowing for some changes/typos.
MIT License
301 stars 26 forks source link

Can we get "dist" value in float instead of integer #26

Closed sasi143 closed 4 years ago

sasi143 commented 4 years ago

[Match(start=21, end=28, dist=6, matched=':228621'), Match(start=22, end=28, dist=5, matched='228621'), Match(start=23, end=28, dist=6, matched='28621'), Match(start=21, end=28, dist=6, matched=':228621'), Match(start=22, end=28, dist=5, matched='228621'), Match(start=23, end=28, dist=6, matched='28621'), Match(start=21, end=28, dist=6, matched=':228621'), Match(start=22, end=28, dist=5, matched='228621'), Match(start=23, end=28, dist=6, matched='28621'), Match(start=21, end=28, dist=6, matched=':228621'), Match(start=22, end=28, dist=5, matched='228621'), Match(start=23, end=28, dist=6, matched='28621'), Match(start=21, end=28, dist=6, matched=':228621'), Match(start=22, end=28, dist=5, matched='228621'), Match(start=23, end=28, dist=6, matched='28621'), Match(start=21, end=28, dist=6, matched=':228621'), Match(start=22, end=28, dist=5, matched='228621'), Match(start=23, end=28, dist=6, matched='28621'), Match(start=21, end=28, dist=6, matched=':228621'), Match(start=22, end=28, dist=5, matched='228621'), Match(start=23, end=28, dist=6, matched='28621')]

We are getting dist value in integer and its become hard to pick the best value because if you see above output 5 is the least dist value but a lot of values are coming with dist=5

if we have dist value in float like dist = 5.1 or 5.4 or 6.1 e.t.c

Let me know, what are the possibilities

taleinat commented 4 years ago

Hi @sasi143,

Which function are you running to get that output? find_near_matches() makes sure not to return such overlapping matches, returning the best result (lowest distance, and the longest amongst those).

If you're running one of the internal functions, have you tried running the list of matches through fuzzysearch.common.consolidate_overlapping_matches()? You can see an example of doing this on the usage page of the docs.

taleinat commented 4 years ago

@sasi143, in addition to the above, fuzzysearch consistently uses the Levenshtein distance, which is by definition integral. If you'd like to suggest supporting difference distance metrics, that could be interesting, but please provide a reference to which specific distance metric you mean.

taleinat commented 4 years ago

I'm closing this due to the suggested behavior of returning a fractional distance being very unclear, and a lack of further response from the poster (@sasi143).