rapidfuzz / RapidFuzz

Rapid fuzzy string matching in Python using various string metrics
https://rapidfuzz.github.io/RapidFuzz/
MIT License
2.61k stars 116 forks source link

Quick Question #343

Closed b03505036 closed 1 year ago

b03505036 commented 1 year ago

I re-write the Levenshtein_py Distance by pure Cython, I'm expecting it will be faster than the Levenshtein_py. But after I tested it, it is 9 times slower than RapidFuzz. I am not a python experienced python developer, would like to ask what makes this repo run so fast. Or how to improve my pure Cython version.

Thank youuuu!

from cpython cimport array
import numpy as np
cimport numpy as np
from libc.stdint cimport int8_t, uint8_t
from typing import Union

cdef uint8_t[::1] conv_sequence(s):
    if isinstance(s, (bytes, bytearray)):
        # Add np.copy() here
        return np.copy(np.frombuffer(s, dtype=np.uint8))
    elif isinstance(s, str):
        # And also here
        return np.copy(np.frombuffer(s.encode('utf-8'), dtype=np.uint8))
    else:
        raise ValueError("Unsupported sequence type")

cdef int _uniform_distance(uint8_t[::1] s1, uint8_t[::1] s2):
    cdef int len_s1 = len(s1)
    cdef int len_s2 = len(s2)

    if not len_s1:
        return len_s2

    cdef int VP = (1 << len_s1) - 1
    cdef int VN = 0
    cdef int currDist = len_s1
    cdef int mask = 1 << (len_s1 - 1)
    cdef int block[256]
    cdef int x = 1

    cdef uint8_t ch1
    for ch1 in s1:
        block[ch1] |= x
        x <<= 1

    cdef uint8_t ch2
    cdef int PM_j, X, D0, HP, HN
    for ch2 in s2:
        PM_j = block[ch2]
        X = PM_j
        D0 = (((X & VP) + VP) ^ VP) | X | VN
        HP = VN | ~(D0 | VP)
        HN = D0 & VP
        currDist += (HP & mask) != 0
        currDist -= (HN & mask) != 0
        HP = (HP << 1) | 1
        HN <<= 1
        VP = HN | ~(D0 | HP)
        VN = HP & D0
    return currDist

cpdef int distance(s1, s2, score_cutoff = None):
    cdef uint8_t[::1] seq1 = conv_sequence(s1)
    cdef uint8_t[::1] seq2 = conv_sequence(s2)
    cdef int dist = _uniform_distance(seq1, seq2)
    if score_cutoff is not None and dist > score_cutoff:
        return score_cutoff + 1
    return dist
maxbachmann commented 1 year ago

A couple of things: 1) it is not entirely clear to me why you would like to rewrite the python implementation in Cython. Rapidfuzz provides a fast C++ implementation (using Cython to generate the wrapper code) and a pure Python fallback implementation. This pure Python implementation is used in case the C++ version fails to compile on a system. So rewriting it in Cython would defeat the purpose 2) I assume you are benchmarking against rapidfuzz.distance.Levenshtein.distance and not rapidfuzz.distance.Levenshtein_py.distance, which is likely the C++ version. In addition the library performs a lot of optimizations to be even faster when using metrics like the Levenshtein distance with the process module (e.g. process.cdist) 3) your implementation appears like you used the Python implementation as base and tried to add type hints. Your type hints are only correct if the following conditions hold true:

At the very least these conditions should be checked, so it does not crash if they are called with invalid strings from Python. Both the C++ and Python implementation of RapidFuzz are able to handle arbitrary Unicode strings.

b03505036 commented 1 year ago

thank you for the reply. And where can I find the C++ implementation?

maxbachmann commented 1 year ago

The C++ implementation is integrated as a git submodule in extern/rapidfuzz-cpp to allow using it standalone in C++ applications. You can find it in https://github.com/maxbachmann/rapidfuzz-cpp.