Open kkaiser opened 5 years ago
Hello,
I encountered an issue with span_info=True when used on a string with combined characters. As demonstration consider the following example:
span_info=True
import re from flashtext import KeywordProcessor from unicodedata import normalize from unidecode import unidecode s = KeywordProcessor() s.set_non_word_boundaries('_') k = 'afa' s.add_keyword(k) t = 'İlgili muhafaza' t2 = unidecode(t) t3 = normalize('NFD', t) r = s.extract_keywords(t, span_info=True) r2 = s.extract_keywords(t2, span_info=True) r3 = s.extract_keywords(t3, span_info=True) ( t, # ('İlgili muhafaza', len(t), # 15, r, # [('afa', 11, 14)], t[r[0][1]:r[0][2]], # 'faz', re.search(k, t), # <re.Match object; span=(10, 13), match='afa'>, t2, # 'Ilgili muhafaza', len(t2), # 15, r2, # [('afa', 10, 13)], t2[r2[0][1]:r2[0][2]], # 'afa', t3, # 'İlgili muhafaza', len(t3), # 16, r3, # [('afa', 11, 14)], t3[r3[0][1]:r3[0][2]], # 'afa') )
The expected behaviour is that span start and end return the same as re without having to normalise the string. The issue is especially annoying when the returned start or end is greater than len(string).
re
len(string)
Hello,
I encountered an issue with
span_info=True
when used on a string with combined characters. As demonstration consider the following example:The expected behaviour is that span start and end return the same as
re
without having to normalise the string. The issue is especially annoying when the returned start or end is greater thanlen(string)
.