中英文数据集处理问题

wangguanhua commented 4 years ago

您好，我看您数据集有中文的也有英文的。但中英文取token的方式不是不一样吗？英文是wordpiece，中文是直接切分，我没看到您的代码中有做相关的处理。或是我对您的代码理解有误？

LittleSJL commented 3 years ago

We use character-based tokenization for Chinese, and WordPiece tokenization for all other languages. Both models should work out-of-the-box without any code changes. We did update the implementation of BasicTokenizer in tokenization.py to support Chinese character tokenization, so please update if you forked it. 上面是谷歌BERT官方对于Tokenizer的解释，具体的你可以自己去看，意思就是对于中英文，BERT通过一个tokenizer就能无差别分词，不需要你根据不同的语言做不同的处理。

wangguanhua commented 3 years ago

多谢多谢

------------------ 原始邮件 ------------------ 发件人: "songyingxin/Bert-TextClassification" <notifications@github.com>; 发送时间: 2020年11月5日(星期四) 中午11:05 收件人: "songyingxin/Bert-TextClassification"<Bert-TextClassification@noreply.github.com>; 抄送: "767477036"<767477036@qq.com>;"Author"<author@noreply.github.com>; 主题: Re: [songyingxin/Bert-TextClassification] 中英文数据集处理问题 (#20)

We use character-based tokenization for Chinese, and WordPiece tokenization for all other languages. Both models should work out-of-the-box without any code changes. We did update the implementation of BasicTokenizer in tokenization.py to support Chinese character tokenization, so please update if you forked it. 上面是谷歌BERT官方对于Tokenizer的解释，具体的你可以自己去看，意思就是对于中英文，BERT通过一个tokenizer就能无差别分词，不需要你根据不同的语言做不同的处理。

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

songyingxin / Bert-TextClassification

中英文数据集处理问题 #20