microsoft / TAP

TAP: Text-Aware Pre-training for Text-VQA and Text-Caption, CVPR 2021 (Oral)
MIT License
72 stars 11 forks source link

Where is the text images in CC-OCR? #23

Open TongkunGuan opened 2 years ago

TongkunGuan commented 2 years ago

Hello! When I try to download the link OCR-CC Data (Huge, ~1.3T), I find the CC-OCR dataset does not contain text images. So I would like to know where to get these images.

389)UP~HOSTP0NJ9WCBL9ZH

zyang-ur commented 2 years ago

We uploaded the GCC index file at https://tapvqacaption.blob.core.windows.net/data/GoogleCC/Train_GCC-training.tsv

The first index in "ocr_feat/visu_featresx" before "" indicates the row number in the index file (both 0-indexed). E.g., "100000_1967358300" is the "100000" row of the soccer match image.

daeing commented 2 years ago

We uploaded the GCC index file at https://tapvqacaption.blob.core.windows.net/data/GoogleCC/Train_GCC-training.tsv

The first index in "ocr_feat/visu_featresx" before "" indicates the row number in the index file (both 0-indexed). E.g., "100000_1967358300" is the "100000" row of the soccer match image.

Is there another way to download the OCR-CC Data? Such as Google Drive... I can not download the dataset stably due to my area. Many Thanks.

zyang-ur commented 2 years ago

Unfortunately, the CC3M dataset does not allow sharing raw images due to copyright issues. If you have a copy of CC3M images, it should cover all images in OCR-CC. There are also various online tools for CC3M downloading, which might solve/alleviate the network issue.

daeing commented 1 year ago

Hello! When I try to download the link OCR-CC Data (Huge, ~1.3T), I find the CC-OCR dataset does not contain text images. So I would like to know where to get these images.

389)UP~HOSTP0NJ9WCBL9ZH

老哥,能够分享一下你下载的这个数据集吗?我按照他提供的这个azcopy下载一直不行。。能分享一个百度网盘链接不。。感谢感谢