seatgeek / fuzzywuzzy

Fuzzy String Matching in Python
http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
GNU General Public License v2.0
9.23k stars 875 forks source link

Documentation of the scoring algorithms in fuzzywuzzy.fuzz #137

Open pengyu opened 8 years ago

pengyu commented 8 years ago

I am trying to understand the scoring algorithms implemented in fuzzywuzzy.fuzz. But I don't find any good documentation yet. Does anyone have any suggestions on how one can quickly understand the difference of the algorithms? Thanks.

    QRatio(s1, s2, force_ascii=True)
        # q is for quick

    UQRatio(s1, s2)

    UWRatio(s1, s2)
        Return a measure of the sequences' similarity between 0 and 100,
        using different algorithms. Same as WRatio but preserving unicode.

    WRatio(s1, s2, force_ascii=True)
        Return a measure of the sequences' similarity between 0 and 100,
        using different algorithms.

    partial_ratio(*args, **kwargs)
        "Return the ratio of the most similar substring
        as a number between 0 and 100.

    partial_token_set_ratio(s1, s2, force_ascii=True, full_process=True)

    partial_token_sort_ratio(s1, s2, force_ascii=True, full_process=True)
        Return the ratio of the most similar substring as a number between
        0 and 100 but sorting the token before comparing.

    ratio(*args, **kwargs)

    token_set_ratio(s1, s2, force_ascii=True, full_process=True)

    token_sort_ratio(s1, s2, force_ascii=True, full_process=True)
        Return a measure of the sequences' similarity between 0 and 100
        but sorting the token before comparing.
DavidCEllis commented 7 years ago

I've started doing some of this. So far I've written docstrings for QRatio, UQRatio, UWRatio and WRatio. I don't have docs for the other methods yet so I'm not sure if I should make a PR and note that it shouldn't be merged yet or if I should wait until I've documented the other methods.

The methods I have documented rely on other methods which I haven't had the chance to dig into.

josegonzalez commented 7 years ago

@DavidCEllis any documentation is good documentation, and we can at least partially complete this issue :)

josegonzalez commented 7 years ago

Note: I don't actually use this repo, I'm just the OSS steward at SeatGeek, which is why sometimes some PRs/issues don't get answered :(

We're probably using an ancient version of fuzzywuzzy internally to boot :P

DavidCEllis commented 7 years ago

I was using this at work for a list of technical papers - I ran into the issue of not extracting exact matches when running extract with fuzz.ratio without specifying the processor and saw that there were a few issues raised that appeared to be related. I'll make a PR with my current docstrings (and one additional test).