kootenpv / textsearch

Find strings/words in text; convenience and C speed :fireworks:
126 stars 17 forks source link

Error with Foreign Characters in findall Method #8

Closed emremrah closed 11 months ago

emremrah commented 1 year ago

When using the TextSearch module in the provided code, an error occurs when attempting to find all occurrences of a word with Turkish characters. The code encounters an exception specifically with the Turkish characters "İ" and "ı".

Steps to Reproduce:

  1. Run the provided code snippet.
    
    import string

chars = string.ascii_letters + 'ıİöÖüÜşŞçÇğĞ'

ts = TextSearch(case='insensitive', returns='match')

ts.add(['test'])

for c in chars: try: r = ts.findall(c + 'test') except Exception as e: print('Error:', c, e) continue if r: print(c)

Outputs:

ı Error: İ string index out of range ö Ö ü Ü ş Ş ç Ç ğ Ğ


**Expected Behavior:**

The `TextSearch` module shouldn't return matches for any of these `.findall` calls, but it does when Turkish characters prepended to the word. It also raises an exception with the uppercase i.

**Additional Info**

Python 3.9, Ubuntu 22.04
emremrah commented 1 year ago

I found that the issue occurs due to left_bound_chars (and right_bound_chars) set to alphanumeric characters and no foreign characters. I needed to add my language's special characters to these sets. But the issue with uppercase i is continuing.

kootenpv commented 1 year ago

Maybe we need to also add upper chars then for Turkish?

Considering that we could perhaps by default add all the Turkish characters since there's no cost to having more characters.

emremrah commented 1 year ago

Fixed the uppercase i (İ) using https://github.com/emre/unicode_tr. Now this works:

ts = TextSearch(case='insensitive', returns='match')
ts.left_bound_chars = ts.left_bound_chars.union(TURKISH_CHARS)
ts.right_bound_chars = ts.right_bound_chars.union(TURKISH_CHARS)

text = unicode_tr(text)

ts.findall(text)

Yes, you may not be able to add all the unicode characters but adding foreign characters of some popular languages would be a good idea.

Thanks!

kootenpv commented 1 year ago

I can run some experiment whether it becomes slower to add TURKISH_CHARS - but if not - might as well ass them to the left/right bound chars by default

Unless these characters could have a different meaning in another language 🤔