Open joshhu opened 2 years ago
Faced the same problem, kinda fixed it by adding my type of alphabet characters(in your case, chinese) to self._white_space_chars variable
self._keyword = '_keyword_'
self._white_space_chars = set(['.', '\t', '\n', '\a', ' ', ','])
vn_text = 'àáãạảăắằẳẵặâấầẩẫậèéẹẻẽêềếểễệđìíĩỉịòóõọỏôốồổỗộơớờởỡợùúũụủưứừửữựỳỵỷỹýÀÁÃẠẢĂẮẰẲẴẶÂẤẦẨẪẬÈÉẸẺẼÊỀẾỂỄỆĐÌÍĨỈỊÒÓÕỌỎÔỐỒỔỖỘƠỚỜỞỠỢÙÚŨỤỦƯỨỪỬỮỰỲỴỶỸÝ' # My Language alphabet characters
other_text = 'äöüßÄÖÜß' # German alphabet characters
try:
# python 2.x
self.non_word_boundaries = set(string.digits + string.letters + '_' + vn_text + other_text)
except AttributeError:
# python 3.x
self.non_word_boundaries = set(string.digits + string.ascii_letters + '_' + vn_text + other_text)
Missing a lot of matches with only Chinese characters not words. Modifying line 523 in the
keyword.py
not working at all.