vi3k6i5 / flashtext

Extract Keywords from sentence or Replace keywords in sentences.
MIT License
5.58k stars 598 forks source link

Wrong matching result for word with accent marks #94

Open isaac47 opened 5 years ago

isaac47 commented 5 years ago

Thanks for your lib. I definitively love it. But the is an error when using "add_keywords_from_dict" or add_keywords_from_list". The error is about wrong matching result as you can see below

from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()

keyword_processor.add_keywords_from_list(["hydro", "fran",'cam'])
keyword_processor.extract_keywords('Le groupe français va concevoir, construire et exploiter une centrale hydroélectrique au Cameroun')

output ['fran', 'hydro']

As you can see, it's as if he truncates an accent character. This can be resolve by removing accent (by using deaccent from gensim for example), and by using "span_info" in other to recovry the original word from the text at the end.

thakur-nandan commented 4 years ago

Hi @isaac47 , Please refer to the issue #109, and add "ç" and "é" as nonword boundaries and run your scripts again. This should solve your issue.

Kind Regards, Nandan Thakur