HCIS2020 commented 6 years ago

' def texts_to_words(self, texts):
    if not texts:
        return []

    words = []
    last_word = ''
    for ch in texts:
        if CjkTokenizer._is_chinese_char(ch):
            if len(last_word) > 0:
                words.append(last_word)
                last_word = ch
            else:
                words.append(ch)

'

should be

             ‘ if len(last_word) > 0:

                     words.append(last_word)

                     last_word = ‘’

               words.append(ch)

‘

HCIS2020 commented 6 years ago

Another problem: 1.9 version could NOT handle #HELLO#，HELLO PATTERNs.

Tokenizer Should split as #, HELLO, #. but now it split it as #HELLO#

你好# could be handle correctly.

HCIS2020 commented 6 years ago

Tokenizer.py texts_to_words() should handle following case correctly:

1、HELLO 2、HELLO WORLD 3、#HELLO# 4、HELLOWORLD 5、你好 6、你好，谢谢你 7、200万 8、#你好# 9、你好HELLO 10、HELLO#你好，谢谢你#WORLD OK

I modify some code to handle all above correctly.

` def _is_wildchar(self, ch): MATCHCHARS = ['^', '#', '', '*'] return bool(ch in MATCH_CHARS)

def texts_to_words(self, texts): if not texts: return []

    words = []
    last_word = ''
    for ch in texts:
        if CjkTokenizer._is_chinese_char(ch):
            if len(last_word) > 0:
                words.append(last_word)
                last_word = ''
            words.append(ch)
        else:
            if self._is_wildchar(ch):
                if len(last_word) > 0:
                    words.append(last_word)
                    last_word = ''
                words.append(ch)
            else:
                if ch == self.split_chars:
                    if len(last_word) > 0:
                        words.append(last_word)
                        last_word = ''
                else:
                    last_word += ch

    if len(last_word) > 0:
        words.append(last_word)

    return words

`

tomliau33 commented 6 years ago

You can check my fixing: https://github.com/tomliau33/program-y/commit/7abaadb900b3133c42a901e4b73c95529a188dce

I finally discard the last_word related code, I think that it is not necessary to handle last_word.

keiffster commented 6 years ago

Just added this to latest code base , tests running and likely to release 1.9.1 tonight

keiffster commented 6 years ago

v1.9.1 released with the above fixes in

keiffster / program-y

BUG feedback for program-y/src/programy/parser/tokenizer.py #132

你好# could be handle correctly.