论文中提到,For the general distillation, we set the maximum sequence length to 128 and use English Wikipedia (2,500M words) as the text corpus and perform the intermediate layer distillation for 3 epochs with the supervision from a pre-trained BERT BASE and keep other hyper-parameters the same as BERT pre-training (Devlin et al., 2019). 关于pre-train的语料,是否可以提供下载地址?
你好,
论文中提到,For the general distillation, we set the maximum sequence length to 128 and use English Wikipedia (2,500M words) as the text corpus and perform the intermediate layer distillation for 3 epochs with the supervision from a pre-trained BERT BASE and keep other hyper-parameters the same as BERT pre-training (Devlin et al., 2019). 关于pre-train的语料,是否可以提供下载地址?
此外,在pre-train阶段,对于general_distill.py的配置参数--do_lower_case,是否可以不设置该参数。看到已开放模型的vocab.txt是小写字典,请问目前是否有已训好的、关注大小写的TinyBERT模型(即"tinybert-cased")?
谢谢~