vi3k6i5 / flashtext

Extract Keywords from sentence or Replace keywords in sentences.
MIT License
5.57k stars 598 forks source link

Flashtext with German characters #97

Closed Salma-Bouzid closed 4 years ago

Salma-Bouzid commented 4 years ago

Given the following example

from flashtext import KeywordProcessor sentence = ''#frühstück äck öck ßck ack'' keyword_processor = KeywordProcessor() keyword_processor.add_keywords_from_list([ 'ck' , 'sunny', 'kkd', 'üü']) keyword_processor.extract_keywords(sentence, span_info = True)

Flashtext should only output exact matches based on separate words, can someone please point me to a workaround in order to handle German characters?

iwpnd commented 4 years ago

Hi @Salma-Bouzid

What you're looking for are non-word-boundaries, or better the letters/punctuations that flashtext uses to identify where a word starts/ends. By default flashtext uses the default letters/digits/punctuations in string library, see:

import string
non_word_boundaries = set(string.digits + string.ascii_letters + '_')
print(non_word_boundaries)
>> {'z', 'J', 'm', 'l', '8', 'c', 'G', 'U', 's', 'D', 'y', 'A', 'I', 'K', 'E', 'L', '0', '2', '4', '3', 'w', 'O', 'p', 'f', 'v', '9', 'B', 't', 'H', 'h', 'Y', 'r', 'k', 'i', 'n', 'q', 'C', 'P', 'x', 'd', 'V', '7', 'e', 'F', 'M', 'a', 'b', 'j', 'o', '6', 'S', 'X', 'R', '_', 'T', 'W', 'Z', 'Q', 'N', 'g', 'u', '5', '1'}

or in flashtext:

keyword_processor = KeywordProcessor() 
print(keyword_processor.non_word_boundaries)
>> {'z', 'J', 'm', 'l', '8', 'c', 'G', 'U', 's', 'D', 'y', 'A', 'I', 'K', 'E', 'L', '0', '2', '4', '3', 'w', 'O', 'p', 'f', 'v', '9', 'B', 't', 'H', 'h', 'Y', 'r', 'k', 'i', 'n', 'q', 'C', 'P', 'x', 'd', 'V', '7', 'e', 'F', 'M', 'a', 'b', 'j', 'o', '6', 'S', 'X', 'R', '_', 'T', 'W', 'Z', 'Q', 'N', 'g', 'u', '5', '1'}

If you now want to add german umlaute, or on that matter every other alphabet in the world, you would simply:

from flashtext import KeywordProcessor 
sentence = '#frühstück äck öck ßck ack'
keyword_processor = KeywordProcessor() 

umlaut = "ÄÜÖäüöß"
keyword_processor.non_word_boundaries.update(list(umlaut))

keyword_processor.add_keywords_from_list([ 'ck' , 'sunny', 'kkd', 'üü']) 
keyword_processor.extract_keywords(sentence, span_info = True)
>> []
Salma-Bouzid commented 4 years ago

Thanks so much, Ben!