RapidFuzz currently only works with strings. E.g. https://github.com/roy-ht/editdistance will always calculate the edit distance between two sequences where each element has to be hashable. RapidFuzz should be able to calculate the edit distance between any hashable objects as well.
In the common case of a string the unnecessary hashing in editdistance is quite slow. In RapidFuzz this should be implemented in the following way:
directly use the underlying buffer when a string is used
when a array.array is used:
on Cpython directly use the underlying buffer (similar to the way this is done in Cython)
on PyPy it probably has to allocate a new buffer of the corresponding size, but since the type of the elements is known it can directly use the values without any hashing (a normal python integer might require hashing, since it could be larger than 64bit)
for float/double the hash should be calculated, since Python only directly uses the value as hash for integral values
Numpy arrays could be handled in a similar way to array.array using the Numpy C Api. This would add Numpy as compile time dependency + runtime time dependency. However will be required for #51 anyways.
use the hashes of the elements when a different iterable is used (e.g. a list of words)
Generators could be handled in two ways (could be unsupported in the beginning):
1) reallocate memory while iterating (might require a lot of allocations)
2) When the generator provides a size hint this could directly be used for the correct allocation
RapidFuzz currently only works with strings. E.g. https://github.com/roy-ht/editdistance will always calculate the edit distance between two sequences where each element has to be hashable. RapidFuzz should be able to calculate the edit distance between any hashable objects as well.
In the common case of a string the unnecessary hashing in editdistance is quite slow. In RapidFuzz this should be implemented in the following way: