huawei-noah / Pretrained-Language-Model

Pretrained language model and its related optimization techniques developed by Huawei Noah's Ark Lab.
3.02k stars 628 forks source link

复现TinyBERT需要pre-train的wiki语料,另是否开源tinybert-cased模型 #237

Open hppy139 opened 1 year ago

hppy139 commented 1 year ago

你好,

论文中提到,For the general distillation, we set the maximum sequence length to 128 and use English Wikipedia (2,500M words) as the text corpus and perform the intermediate layer distillation for 3 epochs with the supervision from a pre-trained BERT BASE and keep other hyper-parameters the same as BERT pre-training (Devlin et al., 2019). 关于pre-train的语料,是否可以提供下载地址?

此外,在pre-train阶段,对于general_distill.py的配置参数--do_lower_case,是否可以不设置该参数。看到已开放模型的vocab.txt是小写字典,请问目前是否有已训好的、关注大小写的TinyBERT模型(即"tinybert-cased")?

谢谢~