Strange case for token_set_ratio with Thai language

YosuaMichael commented 4 years ago

Hi,

First of all thanks for the library! It is really helpful for various string matching task.

I use the library for various language in South East Asia and it mostly work well. However I got some strange cases in Thai language:

fuzz.token_set_ratio('ป้ารัตน์ หน้าโรงเรียนมารีย์','ย์ย์ย์ย์ย์ย์ย์ย์ย์ย์ย์ย์ย์ย์ย์ย์ย์ย์ย์ย์ย์ย์ย์ย์')

The above script somehow return 100 (perfect match), although we can clearly see that it is totally different?

Is it a bug? Or is there any explanation why it behave like that?

Thanks!

maxbachmann commented 4 years ago

@YosuaMichael This is caused by full_process. https://github.com/seatgeek/fuzzywuzzy/blob/2188520502b86375cf2610b5100a56935417671f/fuzzywuzzy/string_processing.py#L21 In this case it tries to replace non characters with whitespace. However it apparently screws up

> regex = re.compile(r"(?ui)\W")
> regex.sub(" ", 'ป้ารัตน์ หน้าโรงเรียนมารีย์')
'ป าร ตน  หน าโรงเร ยนมาร ย '
> regex.sub(" ", 'ย์ย์ย์ย์ย์ย์ย์ย์ย์ย์ย์ย์ย์ย์ย์ย์ย์ย์ย์ย์ย์ย์ย์ย์')
'ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย '

It does now find ย in both sequences as a single word and therefore returns a 100% match

It works when you do not use full_process

> fuzz.token_set_ratio('ป้ารัตน์ หน้าโรงเรียนมารีย์','ย์ย์ย์ย์ย์ย์ย์ย์ย์ย์ย์ย์ย์ย์ย์ย์ย์ย์ย์ย์ย์ย์ย์ย์', full_process=False)
11

YosuaMichael commented 4 years ago

@maxbachmann Ah thanks a lot!

Didn't know about the param full_process=False I guess I will use it for non-latin characters and it will fix my problems.

seatgeek / fuzzywuzzy

Strange case for token_set_ratio with Thai language #270