taleinat / fuzzysearch

Find parts of long text or data, allowing for some changes/typos.
MIT License
301 stars 26 forks source link

how to get best match without passing max_l_dist? #22

Closed sasi143 closed 4 years ago

taleinat commented 4 years ago

Hi @sasi143, I will need more information to understand your question before I can help.

An example showing what you are trying to do would be the best.

sasi143 commented 4 years ago

@taleinat thanks for your reply. when I am trying to check a similar match of my string with a given sentence, it is giving multiple results with max_l_dist.

Here is the sample output I am getting:

[Match(start=26, end=32, dist=1, matched='4480-5')]
[Match(start=24, end=29, dist=1, matched='22448')]
[Match(start=26, end=30, dist=1, matched='4480')]
[Match(start=25, end=29, dist=1, matched='2448')]
[Match(start=26, end=30, dist=1, matched='4480')]
[Match(start=24, end=29, dist=1, matched='22448')]
[Match(start=25, end=29, dist=1, matched='2448')]
[Match(start=24, end=28, dist=1, matched='2244')]
[Match(start=24, end=28, dist=1, matched='2244')]
[Match(start=26, end=31, dist=1, matched='4480-')]
[Match(start=25, end=29, dist=1, matched='2448')]
[Match(start=26, end=30, dist=1, matched='4480')]
[Match(start=26, end=30, dist=1, matched='4480')]
[Match(start=24, end=29, dist=1, matched='22448')]
[Match(start=25, end=29, dist=1, matched='2448')]
[Match(start=24, end=32, dist=1, matched='224480-5')]
[Match(start=24, end=29, dist=1, matched='22448')]
[Match(start=25, end=29, dist=1, matched='2448')]

my expected output

[Match(start=24, end=32, dist=1, matched='224480-5')]

My questions:

  1. How can I get one close match instead of multiple?
  2. Can we get dist value in a float type?
  3. Can we dynamically pass max_l_dist value without hardcoding like 0,1,2 ?
taleinat commented 4 years ago

Hi @sasi143,

I am still unsure about how you are receiving such output. fuzzysearch has special code to avoid returning such overlapping results. Also, a single call to find_near_matches() will return a single list of results, but the output you've supplied includes multiple lists (each containing a single Match object).

Are you calling find_near_matches() multiple times, perhaps in a loop? Could you post the piece of code that generated this output?

sasi143 commented 4 years ago

@taleinat , yes you are correct. I am looping the find_near_matches() function.

Here is my code

from fuzzysearch import find_near_matches

for i in item_number: score = find_near_matches(i, "chemicals nitrogen code-224480-5g", max_l_dist=1) if len(score) != 0: print(score)

sasi143 commented 4 years ago

@taleinat The results are varying by changing max_l_dist value, But not sure what will be the perfect value to pass, could you please help me on this

taleinat commented 4 years ago

@sasi143, for a single search, if you call find_near_matches() with a high value for max_l_dist, it will return all potential matches and you can choose the one with the lowest distance (dist) as the best match.

In your case, you're running multiple fuzzy searches and appear to want to choose the best result. Have you tried something like this?

results = [
    find_near_matches(i, "chemicals nitrogen code-224480-5g", max_l_dist=1)
    for i in item_number
]
# Select the result with the lowest Levenshtein distance,
# and of those the one with the longest matched string.
best_result = max(results, key=lambda match: (-match.dist, len(match.matched))

This is a rather general programming question not directly related to fuzzysearch, and not relevant as a bug or enhancement suggestion, so I'm closing this issue.

In the future, I highly recommend getting programming help in more appropriate forums, such as the Stack Overflow Q&A website, the #python IRC channel or the python-tutor mailing list.

sasi143 commented 4 years ago

@taleinat Thank you very much for your time