there may have some problem about tokenizer.py

drockser commented 6 years ago

I meet a problem when I use tensor2tensor train a translate model, and decode some sentence. The Error is ' IndexError: string index out of range' so I debug the error sentence, and find it generate a subword '\' in the sentence end. '\' is a special char, so it unescaped a empty word, and length is 0, when run in tokenizer.py -> decode function, there have a code is: ' token_is_alnum = [t[0] in _ALPHANUMERIC_CHAR_SET for t in tokens] ' and when t's length is 0, it will error, but I don't know why '\' will be create in the sentence end.

stefan-it commented 6 years ago

@drockser Does still problem still exists in the latest version of tensor2tensor?

drockser commented 6 years ago

@stefan-it I don't using the latest version of tensor2tensor, however, I read the code of latest tokenizer.py, the code 'token_is_alnum = [t[0] in _ALPHANUMERIC_CHAR_SET for t in tokens]' is still in it , so I think the problem also exits in the lastest version.

tensorflow / tensor2tensor

there may have some problem about tokenizer.py #567