Open TongkunGuan opened 2 years ago
We uploaded the GCC index file at https://tapvqacaption.blob.core.windows.net/data/GoogleCC/Train_GCC-training.tsv
The first index in "ocr_feat/visu_featresx" before "" indicates the row number in the index file (both 0-indexed). E.g., "100000_1967358300" is the "100000" row of the soccer match image.
We uploaded the GCC index file at https://tapvqacaption.blob.core.windows.net/data/GoogleCC/Train_GCC-training.tsv
The first index in "ocr_feat/visu_featresx" before "" indicates the row number in the index file (both 0-indexed). E.g., "100000_1967358300" is the "100000" row of the soccer match image.
Is there another way to download the OCR-CC Data? Such as Google Drive... I can not download the dataset stably due to my area. Many Thanks.
Unfortunately, the CC3M dataset does not allow sharing raw images due to copyright issues. If you have a copy of CC3M images, it should cover all images in OCR-CC. There are also various online tools for CC3M downloading, which might solve/alleviate the network issue.
Hello! When I try to download the link OCR-CC Data (Huge, ~1.3T), I find the CC-OCR dataset does not contain text images. So I would like to know where to get these images.
老哥,能够分享一下你下载的这个数据集吗?我按照他提供的这个azcopy下载一直不行。。能分享一个百度网盘链接不。。感谢感谢
Hello! When I try to download the link OCR-CC Data (Huge, ~1.3T), I find the CC-OCR dataset does not contain text images. So I would like to know where to get these images.