vi3k6i5 / flashtext

Extract Keywords from sentence or Replace keywords in sentences.
MIT License
5.57k stars 598 forks source link

Greek Language #110

Closed psalias2006 closed 4 years ago

psalias2006 commented 4 years ago

Hi I think is not working as supposed to with Greek language

flash

from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()

keyword_processor.add_keywords_from_list(['πο'])
keyword_processor.extract_keywords('πομολο')
iwpnd commented 4 years ago

Flashtext does not know the greek alphabet.

from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
keyword_processor.set_non_word_boundaries(set('πομολο')) # or the entire script if you want

print(keyword_processor.non_word_boundaries)
>> {'μ', 'π', 'ο', 'λ'}

keyword_processor.add_keywords_from_list(['πο'])
keyword_processor.extract_keywords('πο μολο')
>> πο

if you don't put the space between πο and μολο extract_keywords() will not yield a result because μ is not a word boundary. if you want to understand how flashtext works, this will get you started.

psalias2006 commented 4 years ago

Flashtext does not know the greek alphabet.

from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
keyword_processor.set_non_word_boundaries(set('πομολο')) # or the entire script if you want

print(keyword_processor.non_word_boundaries)
>> {'μ', 'π', 'ο', 'λ'}

keyword_processor.add_keywords_from_list(['πο'])
keyword_processor.extract_keywords('πο μολο')
>> πο

if you don't put the space between πο and μολο extract_keywords() will not yield a result because μ is not a word boundary. if you want to understand how flashtext works, this will get you started.

Thanks for the help! I'll give it a try

psalias2006 commented 4 years ago

Based on @iwpnd post I have this workaround for greek words

import string
from flashtext import KeywordProcessor

greek_simple = 'ΑαΒβΓγΔδΕεΖζΗηΘθΙιΚκΛλΜμΝνΞξΟοΠπΡρΣσςΤτΥυΦφΧχΨψΩω'
greek_accent = 'άέήίόύώΆΈΉΊΌΎΏ'
non_word_boundaries = set(string.digits + string.ascii_letters + greek_simple + greek_accent + '_')

keyword_processor = KeywordProcessor()
keyword_processor.set_non_word_boundaries(non_word_boundaries)

thanks again!