handling umlauts - Githubissues

jamesturk / jellyfish

🪼 a python library for doing approximate and phonetic matching of strings.

https://jamesturk.github.io/jellyfish/

MIT License

2.04k stars 157 forks source link

Closed geoHeil closed 1 year ago

geoHeil commented 6 years ago

jellyfish.match_rating_codex('ä')fails with

ValueError: character U+ffffffff is not in range [U+0000; U+10ffff]

how should umlauts be handled to be fit for jellyfish?

jamesturk commented 4 years ago

if you import from jellyfish._jellyfish you'll get the Python version that handles unicode properly

still unsure what to do about C versions

maxbachmann commented 2 years ago

@jamesturk is this still an issue? as far as I can see this is fixed:

>>> jellyfish._jellyfish.match_rating_codex('ä')
'Ä'
>>> jellyfish.cjellyfish.match_rating_codex('ä')
'Ä'

jamesturk commented 1 year ago

added a test to confirm this is fixed & avoid regression