Closed HCIS2020 closed 6 years ago
Another problem: 1.9 version could NOT handle #HELLO#,HELLO PATTERNs.
Tokenizer Should split as #, HELLO, #. but now it split it as #HELLO#
Tokenizer.py texts_to_words() should handle following case correctly:
1、HELLO 2、HELLO WORLD 3、#HELLO# 4、HELLOWORLD 5、你好 6、你好,谢谢你 7、200万 8、#你好# 9、你好HELLO 10、HELLO#你好,谢谢你#WORLD OK
I modify some code to handle all above correctly.
` def _is_wildchar(self, ch): MATCHCHARS = ['^', '#', '', '*'] return bool(ch in MATCH_CHARS)
def texts_to_words(self, texts): if not texts: return []
words = []
last_word = ''
for ch in texts:
if CjkTokenizer._is_chinese_char(ch):
if len(last_word) > 0:
words.append(last_word)
last_word = ''
words.append(ch)
else:
if self._is_wildchar(ch):
if len(last_word) > 0:
words.append(last_word)
last_word = ''
words.append(ch)
else:
if ch == self.split_chars:
if len(last_word) > 0:
words.append(last_word)
last_word = ''
else:
last_word += ch
if len(last_word) > 0:
words.append(last_word)
return words
`
You can check my fixing: https://github.com/tomliau33/program-y/commit/7abaadb900b3133c42a901e4b73c95529a188dce
I finally discard the last_word related code, I think that it is not necessary to handle last_word.
Just added this to latest code base , tests running and likely to release 1.9.1 tonight
v1.9.1 released with the above fixes in
'
should be
‘