vi3k6i5 / flashtext

Extract Keywords from sentence or Replace keywords in sentences.
MIT License
5.57k stars 599 forks source link

span_info on combined unicode character(s) #81

Open kkaiser opened 5 years ago

kkaiser commented 5 years ago

Hello,

I encountered an issue with span_info=True when used on a string with combined characters. As demonstration consider the following example:

import re

from flashtext import KeywordProcessor
from unicodedata import normalize
from unidecode import unidecode

s = KeywordProcessor()
s.set_non_word_boundaries('_')
k = 'afa'
s.add_keyword(k)

t = 'İlgili muhafaza'
t2 = unidecode(t)
t3 = normalize('NFD', t)
r = s.extract_keywords(t, span_info=True)
r2 = s.extract_keywords(t2, span_info=True)
r3 = s.extract_keywords(t3, span_info=True)

(
    t,                     # ('İlgili muhafaza',
    len(t),                # 15,
    r,                     # [('afa', 11, 14)],
    t[r[0][1]:r[0][2]],    # 'faz',
    re.search(k, t),       # <re.Match object; span=(10, 13), match='afa'>,
    t2,                    # 'Ilgili muhafaza',
    len(t2),               # 15,
    r2,                    # [('afa', 10, 13)],
    t2[r2[0][1]:r2[0][2]], # 'afa',
    t3,                    # 'İlgili muhafaza',
    len(t3),               # 16,
    r3,                    # [('afa', 11, 14)],
    t3[r3[0][1]:r3[0][2]], # 'afa')
)

The expected behaviour is that span start and end return the same as re without having to normalise the string. The issue is especially annoying when the returned start or end is greater than len(string).