The C implementation seems to break with certain accented Unicode characters. This is with jellyfish 0.5.6 and Python 3.5.2 on Linux. The Python implementation however seems to work fine.
>>> import jellyfish
>>> jellyfish.damerau_levenshtein_distance('a', 'Č')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: Encountered unsupported code point in string.
Although it may be a bit hacky is there any interest in a PR for a dynamic Python fallback? So if Unicode chars are not found in the input strings, it'd use the C implementation, then gracefully fallback to Python, instead of throwing an exception. Looking through the source, it seems pretty easy to implement, and would grant a fair degree of robustness, in exchange for slightly less predictable performance.
The C implementation seems to break with certain accented Unicode characters. This is with jellyfish
0.5.6
and Python3.5.2
on Linux. The Python implementation however seems to work fine.Although it may be a bit hacky is there any interest in a PR for a dynamic Python fallback? So if Unicode chars are not found in the input strings, it'd use the C implementation, then gracefully fallback to Python, instead of throwing an exception. Looking through the source, it seems pretty easy to implement, and would grant a fair degree of robustness, in exchange for slightly less predictable performance.