taleinat / fuzzysearch

Find parts of long text or data, allowing for some changes/typos.
MIT License
301 stars 26 forks source link

Function find_near_matches is alternating results for the same situation #40

Open RagnarCris opened 2 years ago

RagnarCris commented 2 years ago

Hi, my name is Cristiano and I'm using the library to determine how far someone has read, by comparing a transcript to a text to be read. And strangely, for the same parameters and strings, the function find_near_matches returned two different results, in 5 to 6 times that i runned the script.

I'm using the version 0.7.3 of fuzzysearch. I'm leaving an example of a script in which this case is happening. If you want, just run around 7 times to see the results changing at least two times.

Thanks in advance!

Example:

from fuzzysearch import find_near_matches

def fuzzy_extract(qs, ls, threshold):
    for match in find_near_matches(qs, ls, max_l_dist=10):
        match = ls[match.start:match.end]
        index = ls.find(match)
        yield (match, index)

ref = "bruna ficava perto da casa de fátima e brincava de bola perto da lagoa"

ret = list(fuzzy_extract("d dê néká m",ref,0))
print("fuzzy_search: "+str(ret))

P.S.: The text in the script is in Portuguese.

The output that i get by running 7 times this script is: output_strange_fuzz

taleinat commented 2 years ago

Hi Cristiano, thanks for this report!

That is indeed very strange. I have been able to reproduce this locally using your script; many thanks for that!

I'll look into this.

jcsilva commented 1 year ago

Hi @taleinat!

I'm facing a similar issue here. Do you have any suggestion on this?

taleinat commented 1 year ago

Hi,

Unfortunately I haven't had a chance to pin this down.

I will mention, though, that the original question is searching for a pattern of length 11 ("d dê néká m") with max_l_dist=10 (the threshold argument seems to be unused!). The results seem like an edge-case, and do seem to indicate a bug, but I also think that @RagnarCris was not running the fuzzy search they actually intended to run, and that's the main reason the results were not as expected.

@jcsilva, I suggest opening a separate issue with the relevant details, and I'll try to see what can be done.

taleinat commented 1 year ago

@RagnarCris, apologies for not having followed up your issue.

Is this still relevant for you?

jcsilva commented 1 year ago

Hi @taleinat ,

sorry for my late response. To provide a bit more of context, I'm trying to replace @RagnarCris on this topic, as he is now working in other subject here. So, I can open a new issue, but it would be basically the same described here.

As you mentioned, the threshold is not used in the example provided. Actually it is just a small snippet of a major function ... this example could be reduced to:

rom fuzzysearch import find_near_matches

def fuzzy_extract(qs, ls):
    for match in find_near_matches(qs, ls, max_l_dist=10):
        match = ls[match.start:match.end]
        index = ls.find(match)
        yield (match, index)

ref = "bruna ficava perto da casa de fátima e brincava de bola perto da lagoa"

ret = list(fuzzy_extract("d dê néká m",ref))
print("fuzzy_search: "+str(ret))

Using this example, we would like to find a pattern of length 11 with max_l_dis=10. Actually we need to use this "weird" configuration, where pattern length and max_l_dist are close because the text and the pattern usually are a bit different (the text comes from one source of information and the pattern from another one, and they produce data with different qualities).

The issue we didn't understand is that the index return by find_near_matches varies between consecutive executions even using the same input. We thought the algorithm would be deterministic, right? But we observed those unexpected results and we would like to ask you help to try to understand it.