utils.default_process is not work like what the doc says.

rapidfuzz / RapidFuzz

Rapid fuzzy string matching in Python using various string metrics

https://rapidfuzz.github.io/RapidFuzz/

MIT License

2.61k stars 116 forks source link

utils.default_process is not work like what the doc says. #365

Closed qkxie closed 6 months ago

qkxie commented 6 months ago

The doc says that this function will remove all non alphanumeric characters. However, when I call this function, it doesn't remove chinese char

maxbachmann commented 6 months ago

I don't know chinese characters. So without anything to copy and paste into google, I can't really check whether those are alphanumeric. My assumption is that you might misunderstand between "non alphanumeric" and "non ascii".

qkxie commented 6 months ago

@maxbachmann Thanks for your reply. I think that "alphanumeric" means 0-9 and a-z and A-Z. Any char which is not in the 62 chars is called "non alphanumeric". Therefore, any Chinese char，such as 哈 is surely non alphanumeric .

maxbachmann commented 6 months ago

I use alphanumeric similar to the way it's defined for str.isalnum in python:

>>> "哈".isalnum()
True

For specific languages you could have tighter definitions of what an alphanumeric character would be. So e.g. for english your definition using 0-9, a-z and A-Z would work.

qkxie commented 6 months ago

Thank you. I have read Python's document and finally I know what alphanumeric really is.

Hope this issue will help other people who have the same misunderstanding like me. So let me explain more.

In Python, A character c is alphanumeric if one of the following returns True: c.isalpha(), c.isdecimal(), c.isdigit(), or c.isnumeric().

And c.isalpha() return True if all characters in the string are alphabetic and there is at least one character, False otherwise. Alphabetic characters are those characters defined in the Unicode character database as “Letter”, i.e., those with general category property being one of “Lm”, “Lt”, “Lu”, “Ll”, or “Lo”.

maxbachmann commented 6 months ago

In general both for lowercase and for filtering non-alphanumeric I use the same definition as Python. Right now the only exception to this is the lowercasing of U+0130. In Cpython this results into two characters when lowercased, while in my implementation it only returns a single character

>>> ord("İ".lower()[0])
105
>>> ord("İ".lower()[1])
775

>>> ord(default_process("İ"))
105