Closed emremrah closed 11 months ago
I found that the issue occurs due to left_bound_chars
(and right_bound_chars
) set to alphanumeric characters and no foreign characters. I needed to add my language's special characters to these sets. But the issue with uppercase i is continuing.
Maybe we need to also add upper chars then for Turkish?
Considering that we could perhaps by default add all the Turkish characters since there's no cost to having more characters.
Fixed the uppercase i (İ) using https://github.com/emre/unicode_tr. Now this works:
ts = TextSearch(case='insensitive', returns='match')
ts.left_bound_chars = ts.left_bound_chars.union(TURKISH_CHARS)
ts.right_bound_chars = ts.right_bound_chars.union(TURKISH_CHARS)
text = unicode_tr(text)
ts.findall(text)
Yes, you may not be able to add all the unicode characters but adding foreign characters of some popular languages would be a good idea.
Thanks!
I can run some experiment whether it becomes slower to add TURKISH_CHARS
- but if not - might as well ass them to the left/right bound chars by default
Unless these characters could have a different meaning in another language 🤔
When using the
TextSearch
module in the provided code, an error occurs when attempting to find all occurrences of a word with Turkish characters. The code encounters an exception specifically with the Turkish characters "İ" and "ı".Steps to Reproduce:
chars = string.ascii_letters + 'ıİöÖüÜşŞçÇğĞ'
ts = TextSearch(case='insensitive', returns='match')
ts.add(['test'])
for c in chars: try: r = ts.findall(c + 'test') except Exception as e: print('Error:', c, e) continue if r: print(c)
ı Error: İ string index out of range ö Ö ü Ü ş Ş ç Ç ğ Ğ