microsoft / Oscar

Oscar and VinVL
MIT License
1.03k stars 248 forks source link

File 'coco_flickr30k_googlecc_gqa_sbu_oi.lineidx' is Not Found #185

Open lostnighter opened 2 years ago

lostnighter commented 2 years ago

Hi! This file is needed for pretraining on Large corpus, but is not found. Could you share this file?

Thanks!

jontooy commented 2 years ago

Hi lostnighter,

I had the same problem when using OSCAR to fine-tune on image captioning with a custom dataset. I used this function to genereate the '.lineidx'-file

I guess that in your case you have a 'coco_flickr30k_googlecc_gqa_sbu_oi.tsv' file. If that is true, you should try the function above, with parameters:

` filein, idxout = 'coco_flickr30k_googlecc_gqa_sbu_oi.tsv', 'coco_flickr30k_googlecc_gqa_sbu_oi.lineidx'

Let me know if it works! `

lostnighter commented 2 years ago

Hi jontooy, I download this file via azcopy as follows: path/to/azcopy copy https://biglmdiag.blob.core.windows.net/vinvl/pretrain_corpus/coco_flickr30k_googlecc_gqa_sbu_oi.lineidx ./ --recursive

This url is not given. I just try it out.