What is the pre-training dataset

iflytek / cino

CINO: Pre-trained Language Models for Chinese Minority (少数民族语言预训练模型)

http://cino.hfl-rc.com

Apache License 2.0

212 stars 28 forks source link

What is the pre-training dataset #29

Closed ayaka14732 closed 1 year ago

ayaka14732 commented 1 year ago

論文入面好似冇提到預訓練資料集係乜嘢

ayaka14732 commented 1 year ago

The corpora of the minority languages are in-house data, consisting of short monolingual sentences. The total corpora size is 28 GB. The statistics of the pre-training corpora are listed in Appendix A.