span_info on combined unicode character(s)

Hello,

I encountered an issue with span_info=True when used on a string with combined characters. As demonstration consider the following example:

import re

from flashtext import KeywordProcessor
from unicodedata import normalize
from unidecode import unidecode

s = KeywordProcessor()
s.set_non_word_boundaries('_')
k = 'afa'
s.add_keyword(k)

t = 'İlgili muhafaza'
t2 = unidecode(t)
t3 = normalize('NFD', t)
r = s.extract_keywords(t, span_info=True)
r2 = s.extract_keywords(t2, span_info=True)
r3 = s.extract_keywords(t3, span_info=True)

(
    t,                     # ('İlgili muhafaza',
    len(t),                # 15,
    r,                     # [('afa', 11, 14)],
    t[r[0][1]:r[0][2]],    # 'faz',
    re.search(k, t),       # <re.Match object; span=(10, 13), match='afa'>,
    t2,                    # 'Ilgili muhafaza',
    len(t2),               # 15,
    r2,                    # [('afa', 10, 13)],
    t2[r2[0][1]:r2[0][2]], # 'afa',
    t3,                    # 'İlgili muhafaza',
    len(t3),               # 16,
    r3,                    # [('afa', 11, 14)],
    t3[r3[0][1]:r3[0][2]], # 'afa')
)

The expected behaviour is that span start and end return the same as re without having to normalise the string. The issue is especially annoying when the returned start or end is greater than len(string).

vi3k6i5 / flashtext

span_info on combined unicode character(s) #81