microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
19.24k stars 2.45k forks source link

Pre-training Dataset #1064

Open det-tu opened 1 year ago

det-tu commented 1 year ago

Describe Model I am using (UniLM, MiniLM, LayoutLM ...): VLMO/BEiTv3

Is there any chance to share pre-training datasets used in VLMO/BEiTv3 through Baidu Net Disk or Google Cloud, as many image urls are inaccessible now. Thanks.

wenhui0924 commented 1 year ago

Hi, COCO and VG are easy to download. For SBU, CC3M and CC12M, you can refer to https://github.com/rom1504/img2dataset.

det-tu commented 1 year ago

Thanks~

det-tu commented 1 year ago

Could you provide your downloading scripts for SBU, CC3M and CC12M? I cannot align my dataset format with your readme through https://github.com/rom1504/img2dataset.