microsoft / TAP

TAP: Text-Aware Pre-training for Text-VQA and Text-Caption, CVPR 2021 (Oral)
MIT License
70 stars 11 forks source link

Require OCR-CC information (image IDs) #5

Closed prajwalgatti closed 2 years ago

prajwalgatti commented 2 years ago

Hello @zyang-ur, and all

Thanks for this work, it is quite interesting.

I'm trying to obtain the OCR-CC dataset but due to my constraints, I can't download the 1.7TB dataset. However, I have the CC dataset and it would be possible for me to obtain the subset of images that are in OCR-CC.

Could you please share the image IDs of CC that were used to construct OCR-CC?

Thanks in advance!

zyang-ur commented 2 years ago

Hi @prajwalgatti ,

Thank you for your interest.

In this case, you could download the index files only, at: path/to/azcopy copy https://tapvqacaption.blob.core.windows.net/data/data/imdb/cc /data --recursive

The "image_name" in the index files are the IDs of CC. Thank you.