jamesturk / jellyfish

🪼 a python library for doing approximate and phonetic matching of strings.
https://jamesturk.github.io/jellyfish/
MIT License
2.07k stars 158 forks source link

"Encountered unsupported code point in string" for damerau_levenshtein_distance #84

Closed almcleanuk closed 7 years ago

almcleanuk commented 7 years ago

I'm finding errors raised by damerau_levenshtein_distance for code points that don't cause problems for levenshtein_distance. The following:

from jellyfish import damerau_levenshtein_distance, levenshtein_distance

cases = [
    ('NICHOLASŸ', 'NICHOLAS'),
    ('NICHOLAS\u0178', 'NICHOLAS'),
    ('ÀUĎREY', 'GERTRUDE'),
    ('\xc0U\u010eREY', 'GERTRUDE'),
]

for a, b in cases:
    try:
        l = levenshtein_distance(a, b)
    except ValueError as e:
        print("Problem calculating levenshtein_distance between %r and %r: %s" % (a, b, e))
    try:
        dl = damerau_levenshtein_distance(a, b)
    except ValueError as e:
        print("Problem calculating damerau_levenshtein_distance between %r and %r: %s" % (a, b, e))

run in Python 3.6 produces

Problem calculating damerau_levenshtein_distance between 'NICHOLASŸ' and 'NICHOLAS': Encountered unsupported code point in string.
Problem calculating damerau_levenshtein_distance between 'NICHOLASŸ' and 'NICHOLAS': Encountered unsupported code point in string.
Problem calculating damerau_levenshtein_distance between 'ÀUĎREY' and 'GERTRUDE': Encountered unsupported code point in string.
Problem calculating damerau_levenshtein_distance between 'ÀUĎREY' and 'GERTRUDE': Encountered unsupported code point in string.
J535D165 commented 7 years ago

This is a duplicate of #55 and #80. The Python version works well.

almcleanuk commented 7 years ago

Thanks. Didn't spot that. I've left a comment on #55.