vi3k6i5 / flashtext

Extract Keywords from sentence or Replace keywords in sentences.
MIT License
5.57k stars 599 forks source link

span_info on combined unicode characters #82

Open kkaiser opened 5 years ago

kkaiser commented 5 years ago

This fixes issue: #81

Lowering a sentence with combined unicode chars changes the length of a sentence.

s = 'İ love Big Apple and Bay Area.'
len(s)  # 30
len(s.lower())  # 31

Lowering keywords and search sentence now works on a per char basis to return the correct span_info

from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
keyword_processor.add_keyword('Big Apple', 'New York')
keyword_processor.add_keyword('Bay Area')
keyword_processor.add_keyword('İ love')
s = 'İ love Big Apple and Bay Area.'
keywords_found = keyword_processor.extract_keywords(s, span_info=True)
keywords_found
# old: [('İ love', 0, 7), ('New York', 8, 17), ('Bay Area', 22, 30)]
# new: [('İ love', 0, 6), ('New York', 7, 16), ('Bay Area', 21, 29)]
for k in keywords_found:
    print(s[k[1]:k[2]])
# new: İ love
# old: İ love
# new: Big Apple
# old: ig Apple
# new: Bay Area
# old: ay Area.
coveralls commented 5 years ago

Coverage Status

Coverage increased (+0.02%) to 99.327% when pulling 40633c9e92bbba581c3a13c4ff03ddbae449d4ae on kkaiser:master into 50c45f1f4a394572381249681046f57e2bf5a591 on vi3k6i5:master.

coveralls commented 5 years ago

Coverage Status

Coverage increased (+0.02%) to 99.327% when pulling 40633c9e92bbba581c3a13c4ff03ddbae449d4ae on kkaiser:master into 50c45f1f4a394572381249681046f57e2bf5a591 on vi3k6i5:master.

vi3k6i5 commented 4 years ago

Moving the check inside the loop ads a little of execution overhead, please share a test case for the change. Thanks

vi3k6i5 commented 4 years ago

Can you please resolve the conflict.

kkaiser commented 4 years ago

Ready for review

laurenegerton commented 3 years ago

This is still happening with flashtext-2.7 Looks like the fix was never merged with master...

spencertollefson commented 3 years ago

@vi3k6i5 - Hi there! Is this ready for and can you complete merging this PR?