google-research / bert

TensorFlow code and pre-trained models for BERT
https://arxiv.org/abs/1810.04805
Apache License 2.0
38.18k stars 9.61k forks source link

bert词表vocab.txt中汉语的##表示是什么意思?中文语料在tokenization之后好像从未出现过用##前缀表示的呀 #572

Closed zhuyuuyuhz closed 5 years ago

Kiteflyingee commented 4 years ago

@zhuyuuyuhz 我有类似的疑惑,中文是没有子词的啊,为什么bert官方的中文预训练模型里面vocab.txt会提供"##字“的token啊,用tokenizer分词不可能会出现##的啊,请问解决了吗? I have similar doubts, there is no subword in Chinese. Why does vocab.txt in bert ’s official Chinese pre-training model provide a "##word" token? It is impossible to use tokenizer for tokenization. , Is it resolved?

monk678 commented 4 years ago

同问,希望有人能解答

Crescentz commented 3 years ago

the same question

mianzhiwj commented 3 years ago

Same question, about half of the vocabularies in Chinese vocab.txt will not be used after tokenization. The tokenizer will add whitespace aroud any CJK character, so the CJK token with '##', like '##口', '##古', '##句', '##另', will never be used. Otherwise, the do_lower_cased is set 'True', it is also missing some necessary token, like '##A' ~ '##Z' and full-angle letter, some phrases will not be tokenized correctly uncless convert to lower cased.

pipilove commented 1 year ago

应该是用于全词掩码的bert模型吧,参考https://youzipi.blog.csdn.net/article/details/84951508#t3

YSU-Yk commented 1 year ago

解决了嘛,我也有类似疑问,在看全词掩码预训练时发现是有将词组之间的token加##,但是只是为了全词掩码,转为token id时还是对应的单字,这样词表中类似##+中文字符又有什么作用呢?

liyuqing1 commented 11 months ago

您好,请问有解决吗?全词掩码wwm确实只是用于预训练任务,##是中间形态,实际上还是单字的。而且Bert-wwm官方似乎也有类似的疑惑:https://github.com/ymcui/Chinese-BERT-wwm/issues/96