jamesturk / jellyfish

🪼 a python library for doing approximate and phonetic matching of strings.
https://jamesturk.github.io/jellyfish/
MIT License
2.07k stars 159 forks source link

C implementation of `damerau_levenshtein_distance` breaks on Unicode #80

Closed rectangletangle closed 7 years ago

rectangletangle commented 7 years ago

The C implementation seems to break with certain accented Unicode characters. This is with jellyfish 0.5.6 and Python 3.5.2 on Linux. The Python implementation however seems to work fine.

>>> import jellyfish
>>> jellyfish.damerau_levenshtein_distance('a', 'Č')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: Encountered unsupported code point in string.

Although it may be a bit hacky is there any interest in a PR for a dynamic Python fallback? So if Unicode chars are not found in the input strings, it'd use the C implementation, then gracefully fallback to Python, instead of throwing an exception. Looking through the source, it seems pretty easy to implement, and would grant a fair degree of robustness, in exchange for slightly less predictable performance.

J535D165 commented 7 years ago

This is a duplicate of #55.

jamesturk commented 7 years ago

closed in favor of #55