taleinat / fuzzysearch

Find parts of long text or data, allowing for some changes/typos.
MIT License
301 stars 26 forks source link

Better documentation please? #1

Closed kevinrue closed 7 years ago

kevinrue commented 10 years ago

I very much appreciate the effort of coding this package. However, rather than spending hours guessing from the code, I wish there was a minimal explanation of the difference between find_near_matches and find_near_matches_with_ngrams.

I wanted to post this as a comment somewhere, but I couldn't find where. However, it works well as an issue too!

Many thanks!

taleinat commented 10 years ago

Hi Kevin,

I'll do as you ask soon! I was waiting to see if there is any interest before investing more time in this library.

BTW posting this as an issue is great :)

kevinrue commented 10 years ago

Great! Thanks a million. The code seems to do what you described so I'm happy, but a short explanation and example of why to use each of these functions would be perfect.

Also, I don't want to put too much work on you, but in fact I was originally looking for a function which, instead of the edit distance, is able to differentiate between insertions, deletions, and substitutions. For instance, I want to know if a sequence has a match in another with a maximum of I insertions, D deletions, and S substitutions allowed. I am not sure how to optimise such as function, and I will be using it millions of times for biology data, so the faster the better. There is a Perl package which does that, if you are curious (http://search.cpan.org/~jhi/String-Approx-3.26/Approx.pm), but I cannot understand everything they do, therefore I can't translate it myself in Python. I haven't found any Python equivalent already coded (posted on StackOverFlow: http://stackoverflow.com/questions/22328884/python-equivalent-of-the-perl-stringapprox-amatch-function).

Are you interested in that? Would you agree that I post this as an enhancement of your package? I can understand in you have other priorities, or if you think this package is not the right place for this feature. Thanks for your answer, not matter the answer!

taleinat commented 10 years ago

Hmm, interesting :)

For now, you could just use fuzzysearch with max_l_dist set to something appropriate, and then filter the results using one of the existing Levenshtein distance libraries, some of which return the number of required replacements, deletions and insertions. If you do this, I'd be interested to see how you did it and perhaps add such code as a utility in this library.

Feel free to add this as another enhancement request!

kevinrue commented 10 years ago

I did integrate your fuzzysearch into my pipeline as a temporary fix for now. But since your only example was using find_near_matches_with_ngrams() I usedd that one :) As you suggest, I will use find_near_matches () with max_l_dist.

Oh.. I haven't seen that some libraries returned the number of replacements and indels. I will look for one of those packages,. (if it works, it means I'll probably stop using yours.. sorry ^^). But I will let you know how I did it if it works that way!

I'll add the enhancement issue now, thanks!

taleinat commented 10 years ago

I wrote this library because as far as I could tell all other libraries allow fuzzy comparison, but not fuzzy search. The distinction is important in several ways, especially when searching though long sequences.

If what you need is just to compare strings then by all means use one of the other existing libraries! But if you are actually searching through sequences then use fuzzysearch for the initial search and then compare the results to the pattern you searched for using an existing library.

P.S. It depends on the length of the pattern you're searching for and the maximum allowed distance, but you should usually use the _with_ngrams variant. I'm working on a utility function which chooses a suitable implementation based on the given parameters, and it's already working and tested, so it will be up very soon. Then just use fuzzysearch.find_near_matches as in the example (also to be added soon).

taleinat commented 10 years ago

I've added some documentation in v0.2.0. There's still a lot of undocumented code here, so I'm leaving this open.

kevinrue commented 10 years ago

To answer your first comment above, I want to point out that your package is indeed the closest to what I want. I don't want the edit distance between string1 and string2. I want to know if there are matches with fewer changes than a threshold. As you say I believe your code is the only one which took this different approach in things.

Your latest comment: Thanks for that! For the remaining undocumented code, I will try and understand the tough bits by running it on sample variables from the command line.. and see what they do ;)

taleinat commented 9 years ago

For the record, fuzzysearch now allows setting specific limits on the number of allowed insertions, deletions and/or substitutions, in addition to the maximum Levenshtein distance. It also uses various optimizations in suitable scenarios, e.g. when only substitutions are allowed.

I'm still leaving this open since the documentation still needs to be improved.

taleinat commented 7 years ago

Closing as the documentation has been improved to the point where the original request has been addressed.