vi3k6i5 / flashtext

Extract Keywords from sentence or Replace keywords in sentences.
MIT License
5.6k stars 600 forks source link

Not working with Chinese. #134

Open joshhu opened 2 years ago

joshhu commented 2 years ago

Missing a lot of matches with only Chinese characters not words. Modifying line 523 in the keyword.py not working at all.

Hyprnx commented 2 years ago

Faced the same problem, kinda fixed it by adding my type of alphabet characters(in your case, chinese) to self._white_space_chars variable

self._keyword = '_keyword_'
self._white_space_chars = set(['.', '\t', '\n', '\a', ' ', ','])
vn_text = 'àáãạảăắằẳẵặâấầẩẫậèéẹẻẽêềếểễệđìíĩỉịòóõọỏôốồổỗộơớờởỡợùúũụủưứừửữựỳỵỷỹýÀÁÃẠẢĂẮẰẲẴẶÂẤẦẨẪẬÈÉẸẺẼÊỀẾỂỄỆĐÌÍĨỈỊÒÓÕỌỎÔỐỒỔỖỘƠỚỜỞỠỢÙÚŨỤỦƯỨỪỬỮỰỲỴỶỸÝ' # My Language alphabet characters
other_text = 'äöüßÄÖÜß' # German alphabet characters
try:
    # python 2.x
    self.non_word_boundaries = set(string.digits + string.letters + '_' + vn_text + other_text)
except AttributeError:
    # python 3.x
    self.non_word_boundaries = set(string.digits + string.ascii_letters + '_' + vn_text + other_text)