Closed SmileTM closed 4 years ago
Thanks for the contribution! Can you use double-quote symbol instead of single-quote? Our code checker won't let me merge the PR until that is fixed.
In vocab_chinese.txt of Chinese models released on December 30 2019, when token = "\u2028" tokenization.py get error mentioned above. Can you tell me how you processed "\u2028" to get the Chinese models? Whether "\u2028" was processed as id = 343 or just ignored. @0x0539
In vocab_chinese.txt of Chinese models released on December 30 2019, when token = "\u2028" tokenization.py get error mentioned above. Can you tell me how you processed "\u2028" to get the Chinese models? Whether "\u2028" was processed as id = 343 or just ignored. @0x0539
You can refer my commit to change the tokenization.py
.
'\u2028' ='\n'
@SmileTM Thank you. But your answer doesn't solve my confusion. Your commit was pushed on 2020.1.10 while goole's Chinese models released on 2019.12.30. I wonder how they processed the vacab that time. It would be a different model on a different tokenization.
@SmileTM
So you think google do:
token = token.strip().split()[0] if token.strip() else ' '
not:
token = token.strip()
if (not token):
continue
token = token.split()[0]
?
@SmileTM So you think google do:
token = token.strip().split()[0] if token.strip() else ' '
not:token = token.strip() if (not token): continue token = token.split()[0]
?
yes. If you change the vocab ,the pretrain model maybe not work well. So , we need to follow the vocab ,until the offical to replay this question. But, I think this is a bug. Because the ' ' not in the vocab of Bert and XLNet.
if token=' ', will have error.