Closed zhuyuuyuhz closed 5 years ago
同问,希望有人能解答
the same question
Same question, about half of the vocabularies in Chinese vocab.txt will not be used after tokenization. The tokenizer will add whitespace aroud any CJK character, so the CJK token with '##', like '##口', '##古', '##句', '##另', will never be used. Otherwise, the do_lower_cased is set 'True', it is also missing some necessary token, like '##A' ~ '##Z' and full-angle letter, some phrases will not be tokenized correctly uncless convert to lower cased.
应该是用于全词掩码的bert模型吧,参考https://youzipi.blog.csdn.net/article/details/84951508#t3
解决了嘛,我也有类似疑问,在看全词掩码预训练时发现是有将词组之间的token加##,但是只是为了全词掩码,转为token id时还是对应的单字,这样词表中类似##+中文字符又有什么作用呢?
您好,请问有解决吗?全词掩码wwm确实只是用于预训练任务,##是中间形态,实际上还是单字的。而且Bert-wwm官方似乎也有类似的疑惑:https://github.com/ymcui/Chinese-BERT-wwm/issues/96
@zhuyuuyuhz 我有类似的疑惑,中文是没有子词的啊,为什么bert官方的中文预训练模型里面vocab.txt会提供"##字“的token啊,用tokenizer分词不可能会出现##的啊,请问解决了吗? I have similar doubts, there is no subword in Chinese. Why does vocab.txt in bert ’s official Chinese pre-training model provide a "##word" token? It is impossible to use tokenizer for tokenization. , Is it resolved?