microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
19.6k stars 2.5k forks source link

VLMO text only data, "wikibk.{index}.txt" #979

Closed NieShenRuc closed 1 year ago

NieShenRuc commented 1 year ago

Model I am using (VLMO), I found that the text-onlt data is loaded from "wikibk.{index}.txt" where index=0,1,...,49,I want to ask I can I get the .txt files?

wenhui0924 commented 1 year ago

Hi @NieShenRuc, you can follow BERT/UniLM/RoBERTa to process the text-only data (EN Wikipedia and Bookcorpus), and then split it into several small files. Or you can directly use our stage-2 model which is pretrained on text-only data.

NieShenRuc commented 1 year ago

Thanks for your reply!