Open chenkovsky opened 6 years ago
@chenkovsky Can you give a more detailed example. Something that I can run.
Thanks
sorry, I have forget the code. it appears only in case insensitive mode. flashtext lowers the string, and make the assumption that the length of original string is equal to the lowercase string. but it's not true.
Ok.. that I can fix. Cool.. thanks..
On Sat 27 Jan, 2018, 08:10 鹿の遠音, notifications@github.com wrote:
sorry, I have forget the code. it appears only in case insensitive mode. flashtext lowers the string, and make the assumption that the length of original string is equal to the lowercase string. but it's not true.
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/vi3k6i5/flashtext/issues/44#issuecomment-360953706, or mute the thread https://github.com/notifications/unsubscribe-auth/AC-Nwv6fy7ne471O2FlpUhyJ7_A2Y-ykks5tOoyXgaJpZM4RkYFw .
I ran into this problem today as well, I'm working around it by just preprocessing the text before passing it to replace_keywords now; this is fine for me as case doesn't matter in my program just now.
A runnable example is
from flashtext import KeywordProcessor
KeywordProcessor().replace_keywords("İstanbul")
Lower cased version has 1 more codepoint than the uppercase version.
Causes
File ".../python3.6/site-packages/flashtext/keyword.py", line 665, in replace_keywords
current_word += orig_sentence[idy]
IndexError: string index out of range
as orig_sentence is assumed to be the same length as the lower case version.
i ran into this problem recently, with exactly the same letter İ.
İ as lowercase results to: i+̇
I ran into the problem as well. Same character. Fun stuff.
Guys, I faced the same bug and I fixed it on my own because I can't wait until Vikash will do it.
To Vikash: than you very much for this excellent library. Feel free to use my fix if you find it suitable for you. Here is the code:
def replace_keywords(self, a_sentence):
"""Searches in the string for all keywords present in corpus.
Keywords present are replaced by the clean name and a new string is returned.
Args:
sentence (str): Line of text where we will replace keywords
Returns:
new_sentence (str): Line of text with replaced keywords
Examples:
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor2()
>>> keyword_processor.add_keyword('Big Apple', 'New York')
>>> keyword_processor.add_keyword('Bay Area')
>>> new_sentence = keyword_processor.replace_keywords('I love Big Apple and bay area.')
>>> new_sentence
>>> 'I love New York and Bay Area.'
"""
if not a_sentence:
# if sentence is empty or none just return the same.
return a_sentence
new_sentence = []
if not self.case_sensitive:
sentence = a_sentence.lower()
# by Ihor Bobak:
# some letters can expand in size when lower() is called, therefore we will preprocess
# a_sentense to find those letters which lower()-ed to 2 or more symbols.
# So, imagine that X is lowered as yz, the rest are lowered as is: A->a, B->b, C->c
# then for the string ABCXABC we want to get
# ['A', 'B', 'C', 'X', '', 'A', 'B', 'C'] which corresponds to
# ['a', 'b', 'c', 'y', 'z', 'a', 'b', 'c'] because when the code below will run by the indexes
# of the lowered string, it will "glue" the original string also by THE SAME indexes
orig_sentence = []
for i in range(0, len(a_sentence)):
char = a_sentence[i]
len_char_lower = len(char.lower())
for j in range(0, len_char_lower): # in most cases it will work just one iteration and will add the same char
orig_sentence.append(char if j == 0 else '') # but if it happens that X->yz, then for z it will add ''
else:
sentence = a_sentence
orig_sentence = a_sentence
current_word = ''
current_dict = self.keyword_trie_dict
current_white_space = ''
sequence_end_pos = 0
idx = 0
sentence_len = len(sentence)
while idx < sentence_len:
char = sentence[idx]
current_word += orig_sentence[idx]
# when we reach whitespace
if char not in self.non_word_boundaries:
current_white_space = char
# if end is present in current_dict
if self._keyword in current_dict or char in current_dict:
# update longest sequence found
sequence_found = None
longest_sequence_found = None
is_longer_seq_found = False
if self._keyword in current_dict:
sequence_found = current_dict[self._keyword]
longest_sequence_found = current_dict[self._keyword]
sequence_end_pos = idx
# re look for longest_sequence from this position
if char in current_dict:
current_dict_continued = current_dict[char]
current_word_continued = current_word
idy = idx + 1
while idy < sentence_len:
inner_char = sentence[idy]
current_word_continued += orig_sentence[idy]
if inner_char not in self.non_word_boundaries and self._keyword in current_dict_continued:
# update longest sequence found
current_white_space = inner_char
longest_sequence_found = current_dict_continued[self._keyword]
sequence_end_pos = idy
is_longer_seq_found = True
if inner_char in current_dict_continued:
current_dict_continued = current_dict_continued[inner_char]
else:
break
idy += 1
else:
# end of sentence reached.
if self._keyword in current_dict_continued:
# update longest sequence found
current_white_space = ''
longest_sequence_found = current_dict_continued[self._keyword]
sequence_end_pos = idy
is_longer_seq_found = True
if is_longer_seq_found:
idx = sequence_end_pos
current_word = current_word_continued
current_dict = self.keyword_trie_dict
if longest_sequence_found:
new_sentence.append(longest_sequence_found)
new_sentence.append(current_white_space)
current_word = ''
current_white_space = ''
else:
new_sentence.append(current_word)
current_word = ''
current_white_space = ''
else:
# we reset current_dict
current_dict = self.keyword_trie_dict
new_sentence.append(current_word)
current_word = ''
current_white_space = ''
elif char in current_dict:
# we can continue from this char
current_dict = current_dict[char]
else:
# we reset current_dict
current_dict = self.keyword_trie_dict
# skip to end of word
idy = idx + 1
while idy < sentence_len:
char = sentence[idy]
current_word += orig_sentence[idy]
if char not in self.non_word_boundaries:
break
idy += 1
idx = idy
new_sentence.append(current_word)
current_word = ''
current_white_space = ''
# if we are end of sentence and have a sequence discovered
if idx + 1 >= sentence_len:
if self._keyword in current_dict:
sequence_found = current_dict[self._keyword]
new_sentence.append(sequence_found)
else:
new_sentence.append(current_word)
idx += 1
return "".join(new_sentence)
Just a small comment to my fix: I did a unit test on 2000 documents of different sizes (from 10K to 10M), did replacement by my new code and the old code. There were hundreds of different replacements. I got a full match of the results. This is said to let everyone know that the fix is likely to be correct.
@ibobak didn't work for me (refer the above referenced issue and the text attached) :)
Also had a similar issue, quick fix was to add a check for "and idy < len(orig_sentence)" in lines 593, 615, 665.
I just ran into this problem also. Would be great to get a fix for this @vi3k6i5 :)
check #82 for a fix
I can confirm the same error on the example given below:
t = 'Hayırlı cumalar olsun ✨ Jummah mubarek ✨The Homeland \nСэ Адыгэ Хэкужъырэ сипсэ нахьи нахь ш1у сэлъэгъух ❤️❤️\nİzninle paylaştım Aytek abi'
# --- Problem ---
print(len(t)) # 136
print(len(t.lower()) # 137
# --- Fix ---
from unicodedata import normalize
from unidecode import unidecode
t_fixed = normalize('NFD', unidecode(t))
# --- Check ---
print(len(t_fixed)) # 136
print(len(t_fixed.lower()) # 136
If anyone is stuck in this problem. Maybe you can try another library. https://github.com/nppoly/cyac. this library is writen by cython, performance is better, and can save & load data, supports multi processes.
I posted a fix (#82) a long time ago but it was never merged. Thus I moved to pyahocorasick which is faster too.
this will cause string index out of range flashtext.