rapidfuzz / Levenshtein

The Levenshtein Python C extension module contains functions for fast computation of Levenshtein distance and string similarity
https://rapidfuzz.github.io/Levenshtein
GNU General Public License v2.0
276 stars 16 forks source link

From 19.0.3 to 20.0.1 matching blocks now return empty list if no match found #32

Closed Otterpatsch closed 2 years ago

Otterpatsch commented 2 years ago

Hello there, First of all ty for your work and all, but i currently have issue which needs me to pin your package version. The problem in earlier versions matching blocks returned something like the following:

matching_blocks=[(0, 0, 1), (3, 61, 0)]                                                                                                                                   
matching_blocks=[(0, 0, 1), (3, 27, 0)]                                                                                                                         
matching_blocks=[(3, 34, 0)]  #no match found

now:

matching_blocks=[MatchingBlock(a=1, b=12, size=1), MatchingBlock(a=2, b=17, size=1), MatchingBlock(a=6, b=21, size=0)]
matching_blocks=[] #i assume if no matching block at all is found

While one solution on my side just would be to be just add a tuple (len(a),len(b),0) i thought i should rather open a issue and raise the question if that change was intended.

maxbachmann commented 2 years ago

I hope MatchingBlock behaves similar to a tuple. If some tuple feature is missing this would be an issue.

maxbachmann commented 2 years ago

Ah you mean for the empty case. I think this could be a bug. I will look into it tomorrow.

maxbachmann commented 2 years ago

I failed to reproduce this in my phone. Do you have an example?

Otterpatsch commented 2 years ago

Mhm i dont really have an example i can share. So you dont get an empty list on empty matching blocks? Than my idea is false. I dont want to alarm you but i did update to the new version and did fix the code on my side (like being depending on the last element for the information of the strings) but now our piplelines are failing which are using this code.(nothing else changed) So what is matching on a lot of string comparisons and stuff changed for some reason. Soo yeah i will be swap back to 19.0.3.

Here is the usage of your functionality in our code

`def _get_matching_blocks(query: str, text: str) -> list[tuple[int, int, int]]: """Get matching blocks between to strings.""" edits = Levenshtein.editops(query, text) return Levenshtein.matching_blocks(edits, query, text)

def _get_matching_ratio(query: str, text: str) -> float: """Calculate how much of query is also contained in text.""" matching_blocks = _get_matching_blocks(query=query, text=text) fragmentation_penalty = -len(matching_blocks) + 1 return ( 2 * sum(match[2] for match in matching_blocks)

maxbachmann commented 2 years ago

Ah I just tested rapidfuzz, since the Levenshtein library currently fails to install in termux (I did not apply the fix I implemented in rapidfuzz here yet). Looking at the code it appears the following line causes this issue: https://github.com/maxbachmann/Levenshtein/blob/aa4711fc2963f8a9947c69d7fe0210abbb30cb35/src/Levenshtein/__init__.py#L158

Otterpatsch commented 2 years ago

alright makes sense.

maxbachmann commented 2 years ago

This is fixed in v0.20.2