vi3k6i5 / flashtext

Extract Keywords from sentence or Replace keywords in sentences.
MIT License
5.58k stars 598 forks source link

bug #44

Open chenkovsky opened 6 years ago

chenkovsky commented 6 years ago
len("İ") # 1
len("İ".lower()) # 2

this will cause string index out of range flashtext.

vi3k6i5 commented 6 years ago

@chenkovsky Can you give a more detailed example. Something that I can run.

Thanks

chenkovsky commented 6 years ago

sorry, I have forget the code. it appears only in case insensitive mode. flashtext lowers the string, and make the assumption that the length of original string is equal to the lowercase string. but it's not true.

vi3k6i5 commented 6 years ago

Ok.. that I can fix. Cool.. thanks..

On Sat 27 Jan, 2018, 08:10 鹿の遠音, notifications@github.com wrote:

sorry, I have forget the code. it appears only in case insensitive mode. flashtext lowers the string, and make the assumption that the length of original string is equal to the lowercase string. but it's not true.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/vi3k6i5/flashtext/issues/44#issuecomment-360953706, or mute the thread https://github.com/notifications/unsubscribe-auth/AC-Nwv6fy7ne471O2FlpUhyJ7_A2Y-ykks5tOoyXgaJpZM4RkYFw .

bobthekingofegypt commented 6 years ago

I ran into this problem today as well, I'm working around it by just preprocessing the text before passing it to replace_keywords now; this is fine for me as case doesn't matter in my program just now.

A runnable example is

from flashtext import KeywordProcessor
KeywordProcessor().replace_keywords("İstanbul")

Lower cased version has 1 more codepoint than the uppercase version.

Causes

File ".../python3.6/site-packages/flashtext/keyword.py", line 665, in replace_keywords
    current_word += orig_sentence[idy]
IndexError: string index out of range

as orig_sentence is assumed to be the same length as the lower case version.

LukiSBB commented 6 years ago

i ran into this problem recently, with exactly the same letter İ.

İ as lowercase results to: i+̇

xokocodo commented 5 years ago

I ran into the problem as well. Same character. Fun stuff.

ibobak commented 5 years ago

Guys, I faced the same bug and I fixed it on my own because I can't wait until Vikash will do it.

To Vikash: than you very much for this excellent library. Feel free to use my fix if you find it suitable for you. Here is the code:

    def replace_keywords(self, a_sentence):
        """Searches in the string for all keywords present in corpus.
        Keywords present are replaced by the clean name and a new string is returned.

        Args:
            sentence (str): Line of text where we will replace keywords

        Returns:
            new_sentence (str): Line of text with replaced keywords

        Examples:
            >>> from flashtext import KeywordProcessor
            >>> keyword_processor = KeywordProcessor2()
            >>> keyword_processor.add_keyword('Big Apple', 'New York')
            >>> keyword_processor.add_keyword('Bay Area')
            >>> new_sentence = keyword_processor.replace_keywords('I love Big Apple and bay area.')
            >>> new_sentence
            >>> 'I love New York and Bay Area.'

        """
        if not a_sentence:
            # if sentence is empty or none just return the same.
            return a_sentence
        new_sentence = []

        if not self.case_sensitive:
            sentence = a_sentence.lower()
            # by Ihor Bobak:
            # some letters can expand in size when lower() is called, therefore we will preprocess
            # a_sentense to find those letters which lower()-ed to 2 or more symbols.
            # So, imagine that X is lowered as yz,  the rest are lowered as is:  A->a, B->b, C->c
            # then for the string ABCXABC we want to get
            # ['A', 'B', 'C', 'X', '',  'A', 'B', 'C'] which corresponds to
            # ['a', 'b', 'c', 'y', 'z', 'a', 'b', 'c'] because when the code below will run by the indexes
            # of the lowered string, it will "glue" the original string also by THE SAME indexes
            orig_sentence = []
            for i in range(0, len(a_sentence)):
                char = a_sentence[i]
                len_char_lower = len(char.lower())
                for j in range(0, len_char_lower):  # in most cases it will work just one iteration and will add the same char
                    orig_sentence.append(char if j == 0 else '')  # but if it happens that X->yz, then for z it will add ''
        else:
            sentence = a_sentence
            orig_sentence = a_sentence

        current_word = ''
        current_dict = self.keyword_trie_dict
        current_white_space = ''
        sequence_end_pos = 0
        idx = 0
        sentence_len = len(sentence)
        while idx < sentence_len:
            char = sentence[idx]
            current_word += orig_sentence[idx]
            # when we reach whitespace
            if char not in self.non_word_boundaries:
                current_white_space = char
                # if end is present in current_dict
                if self._keyword in current_dict or char in current_dict:
                    # update longest sequence found
                    sequence_found = None
                    longest_sequence_found = None
                    is_longer_seq_found = False
                    if self._keyword in current_dict:
                        sequence_found = current_dict[self._keyword]
                        longest_sequence_found = current_dict[self._keyword]
                        sequence_end_pos = idx

                    # re look for longest_sequence from this position
                    if char in current_dict:
                        current_dict_continued = current_dict[char]
                        current_word_continued = current_word
                        idy = idx + 1
                        while idy < sentence_len:
                            inner_char = sentence[idy]
                            current_word_continued += orig_sentence[idy]
                            if inner_char not in self.non_word_boundaries and self._keyword in current_dict_continued:
                                # update longest sequence found
                                current_white_space = inner_char
                                longest_sequence_found = current_dict_continued[self._keyword]
                                sequence_end_pos = idy
                                is_longer_seq_found = True
                            if inner_char in current_dict_continued:
                                current_dict_continued = current_dict_continued[inner_char]
                            else:
                                break
                            idy += 1
                        else:
                            # end of sentence reached.
                            if self._keyword in current_dict_continued:
                                # update longest sequence found
                                current_white_space = ''
                                longest_sequence_found = current_dict_continued[self._keyword]
                                sequence_end_pos = idy
                                is_longer_seq_found = True
                        if is_longer_seq_found:
                            idx = sequence_end_pos
                            current_word = current_word_continued
                    current_dict = self.keyword_trie_dict
                    if longest_sequence_found:
                        new_sentence.append(longest_sequence_found)
                        new_sentence.append(current_white_space)
                        current_word = ''
                        current_white_space = ''
                    else:
                        new_sentence.append(current_word)
                        current_word = ''
                        current_white_space = ''
                else:
                    # we reset current_dict
                    current_dict = self.keyword_trie_dict
                    new_sentence.append(current_word)
                    current_word = ''
                    current_white_space = ''
            elif char in current_dict:
                # we can continue from this char
                current_dict = current_dict[char]
            else:
                # we reset current_dict
                current_dict = self.keyword_trie_dict
                # skip to end of word
                idy = idx + 1
                while idy < sentence_len:
                    char = sentence[idy]
                    current_word += orig_sentence[idy]
                    if char not in self.non_word_boundaries:
                        break
                    idy += 1
                idx = idy
                new_sentence.append(current_word)
                current_word = ''
                current_white_space = ''
            # if we are end of sentence and have a sequence discovered
            if idx + 1 >= sentence_len:
                if self._keyword in current_dict:
                    sequence_found = current_dict[self._keyword]
                    new_sentence.append(sequence_found)
                else:
                    new_sentence.append(current_word)
            idx += 1
        return "".join(new_sentence)
ibobak commented 5 years ago

Just a small comment to my fix: I did a unit test on 2000 documents of different sizes (from 10K to 10M), did replacement by my new code and the old code. There were hundreds of different replacements. I got a full match of the results. This is said to let everyone know that the fix is likely to be correct.

AdityaSoni19031997 commented 5 years ago

@ibobak didn't work for me (refer the above referenced issue and the text attached) :)

alhague commented 5 years ago

Also had a similar issue, quick fix was to add a check for "and idy < len(orig_sentence)" in lines 593, 615, 665.

ned2 commented 5 years ago

I just ran into this problem also. Would be great to get a fix for this @vi3k6i5 :)

kkaiser commented 4 years ago

check #82 for a fix

xpatronum commented 3 years ago

I can confirm the same error on the example given below:

t = 'Hayırlı cumalar olsun ✨ Jummah mubarek ✨The Homeland \nСэ Адыгэ Хэкужъырэ сипсэ нахьи нахь ш1у сэлъэгъух ❤️❤️\nİzninle paylaştım Aytek abi'
# --- Problem --- 
print(len(t)) # 136
print(len(t.lower()) # 137
# --- Fix --- 
from unicodedata import normalize
from unidecode import unidecode
t_fixed = normalize('NFD', unidecode(t))
# --- Check --- 
print(len(t_fixed)) # 136
print(len(t_fixed.lower()) # 136
chenkovsky commented 3 years ago

If anyone is stuck in this problem. Maybe you can try another library. https://github.com/nppoly/cyac. this library is writen by cython, performance is better, and can save & load data, supports multi processes.

kkaiser commented 3 years ago

I posted a fix (#82) a long time ago but it was never merged. Thus I moved to pyahocorasick which is faster too.