tensorflow / tensor2tensor

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.
Apache License 2.0
15.46k stars 3.49k forks source link

there may have some problem about tokenizer.py #567

Open drockser opened 6 years ago

drockser commented 6 years ago

I meet a problem when I use tensor2tensor train a translate model, and decode some sentence. The Error is ' IndexError: string index out of range' so I debug the error sentence, and find it generate a subword '\' in the sentence end. '\' is a special char, so it unescaped a empty word, and length is 0, when run in tokenizer.py -> decode function, there have a code is: ' token_is_alnum = [t[0] in _ALPHANUMERIC_CHAR_SET for t in tokens] ' and when t's length is 0, it will error, but I don't know why '\' will be create in the sentence end.

stefan-it commented 6 years ago

@drockser Does still problem still exists in the latest version of tensor2tensor?

drockser commented 6 years ago

@stefan-it I don't using the latest version of tensor2tensor, however, I read the code of latest tokenizer.py, the code 'token_is_alnum = [t[0] in _ALPHANUMERIC_CHAR_SET for t in tokens]' is still in it , so I think the problem also exits in the lastest version.