google-research / bert

TensorFlow code and pre-trained models for BERT
https://arxiv.org/abs/1810.04805
Apache License 2.0
37.51k stars 9.54k forks source link

Why Bert-chinese use do_lower_case=False? #1188

Open Fei-Wang opened 3 years ago

Fei-Wang commented 3 years ago

Some Chinese Text has some English words, for example: "Apples是苹果的复数形式。". I have questions about how to tokenize the text:

  1. why Chinese Bert Case sensitive, but I can't find even 'A' in vocab.txt
  2. Because English words in Chinese vocab.txt is few, should I use wordpiece tokenizer as default, like "['apple', '##s', '是', '苹', ...]"or split to char to tokenize, like "['a', 'p', 'p', 'l', 'e', 's', '是', '苹', ...]"?
mianzhiwj commented 3 years ago

确切的说,是do_lower_case = True, Google 发布的官方Bert-chinese是默认do_lower_case = True。 也就是在使用时,最好也做一下do_lower_case ,否则部分英文tokenize后的结果为[UNK]。问题2的话采用第一种tokenize,很多开源库封装好了tokenize的,tokenize的结果为第一种。