Closed kevinrue closed 7 years ago
Hi Kevin,
I'll do as you ask soon! I was waiting to see if there is any interest before investing more time in this library.
BTW posting this as an issue is great :)
Great! Thanks a million. The code seems to do what you described so I'm happy, but a short explanation and example of why to use each of these functions would be perfect.
Also, I don't want to put too much work on you, but in fact I was originally looking for a function which, instead of the edit distance, is able to differentiate between insertions, deletions, and substitutions. For instance, I want to know if a sequence has a match in another with a maximum of I insertions, D deletions, and S substitutions allowed. I am not sure how to optimise such as function, and I will be using it millions of times for biology data, so the faster the better. There is a Perl package which does that, if you are curious (http://search.cpan.org/~jhi/String-Approx-3.26/Approx.pm), but I cannot understand everything they do, therefore I can't translate it myself in Python. I haven't found any Python equivalent already coded (posted on StackOverFlow: http://stackoverflow.com/questions/22328884/python-equivalent-of-the-perl-stringapprox-amatch-function).
Are you interested in that? Would you agree that I post this as an enhancement of your package? I can understand in you have other priorities, or if you think this package is not the right place for this feature. Thanks for your answer, not matter the answer!
Hmm, interesting :)
For now, you could just use fuzzysearch
with max_l_dist
set to something appropriate, and then filter the results using one of the existing Levenshtein distance libraries, some of which return the number of required replacements, deletions and insertions. If you do this, I'd be interested to see how you did it and perhaps add such code as a utility in this library.
Feel free to add this as another enhancement request!
I did integrate your fuzzysearch into my pipeline as a temporary fix for now. But since your only example was using find_near_matches_with_ngrams() I usedd that one :)
As you suggest, I will use find_near_matches () with max_l_dist
.
Oh.. I haven't seen that some libraries returned the number of replacements and indels. I will look for one of those packages,. (if it works, it means I'll probably stop using yours.. sorry ^^). But I will let you know how I did it if it works that way!
I'll add the enhancement issue now, thanks!
I wrote this library because as far as I could tell all other libraries allow fuzzy comparison, but not fuzzy search. The distinction is important in several ways, especially when searching though long sequences.
If what you need is just to compare strings then by all means use one of the other existing libraries! But if you are actually searching through sequences then use fuzzysearch
for the initial search and then compare the results to the pattern you searched for using an existing library.
P.S. It depends on the length of the pattern you're searching for and the maximum allowed distance, but you should usually use the _with_ngrams
variant. I'm working on a utility function which chooses a suitable implementation based on the given parameters, and it's already working and tested, so it will be up very soon. Then just use fuzzysearch.find_near_matches
as in the example (also to be added soon).
I've added some documentation in v0.2.0. There's still a lot of undocumented code here, so I'm leaving this open.
To answer your first comment above, I want to point out that your package is indeed the closest to what I want. I don't want the edit distance between string1 and string2. I want to know if there are matches with fewer changes than a threshold. As you say I believe your code is the only one which took this different approach in things.
Your latest comment: Thanks for that! For the remaining undocumented code, I will try and understand the tough bits by running it on sample variables from the command line.. and see what they do ;)
For the record, fuzzysearch
now allows setting specific limits on the number of allowed insertions, deletions and/or substitutions, in addition to the maximum Levenshtein distance. It also uses various optimizations in suitable scenarios, e.g. when only substitutions are allowed.
I'm still leaving this open since the documentation still needs to be improved.
Closing as the documentation has been improved to the point where the original request has been addressed.
I very much appreciate the effort of coding this package. However, rather than spending hours guessing from the code, I wish there was a minimal explanation of the difference between find_near_matches and find_near_matches_with_ngrams.
I wanted to post this as a comment somewhere, but I couldn't find where. However, it works well as an issue too!
Many thanks!