google-research / albert

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
Apache License 2.0
3.24k stars 569 forks source link

Update tokenization.py #118

Closed SmileTM closed 4 years ago

SmileTM commented 4 years ago

if token=' ', will have error.

0x0539 commented 4 years ago

Thanks for the contribution! Can you use double-quote symbol instead of single-quote? Our code checker won't let me merge the PR until that is fixed.

majialin commented 4 years ago

In vocab_chinese.txt of Chinese models released on December 30 2019, when token = "\u2028" tokenization.py get error mentioned above. Can you tell me how you processed "\u2028" to get the Chinese models? Whether "\u2028" was processed as id = 343 or just ignored. @0x0539

SmileTM commented 4 years ago

In vocab_chinese.txt of Chinese models released on December 30 2019, when token = "\u2028" tokenization.py get error mentioned above. Can you tell me how you processed "\u2028" to get the Chinese models? Whether "\u2028" was processed as id = 343 or just ignored. @0x0539

You can refer my commit to change the tokenization.py.

'\u2028' ='\n'

majialin commented 4 years ago

@SmileTM Thank you. But your answer doesn't solve my confusion. Your commit was pushed on 2020.1.10 while goole's Chinese models released on 2019.12.30. I wonder how they processed the vacab that time. It would be a different model on a different tokenization.

majialin commented 4 years ago

@SmileTM So you think google do: token = token.strip().split()[0] if token.strip() else ' ' not:

token = token.strip()
  if (not token):
    continue
token = token.split()[0]

?

SmileTM commented 4 years ago

@SmileTM So you think google do: token = token.strip().split()[0] if token.strip() else ' ' not:

token = token.strip()
  if (not token):
    continue
token = token.split()[0]

?

yes. If you change the vocab ,the pretrain model maybe not work well. So , we need to follow the vocab ,until the offical to replay this question. But, I think this is a bug. Because the ' ' not in the vocab of Bert and XLNet.