google-research-datasets / conceptual-captions

Conceptual Captions is a dataset containing (image-URL, caption) pairs designed for the training and evaluation of machine learned image captioning systems.
Other
516 stars 26 forks source link

A lot of the links from the training/validation set do not exist or cannot be read. #5

Closed karansomaiah closed 5 years ago

karansomaiah commented 6 years ago

Hey! I recently started working on the competition and thank you so much to the Google AI Research team for open sourcing such data sets for us to use and learn. While going through the data set (train and validation both), it seems some of the links do not exist or for some reason cannot be read through. A lot of URLs can be parsed but cannot be read by using Image from the pillow package in python. I will post some of the scripts and the some output to give everyone a better idea of what I am seeing. Please feel free to correct me if I am doing something wrong. Hope this helps anyone facing the errors too.

I am using the requests library to read URLs and the pillow library in Python to read from them. The code is very primitive since I am still in the exploration stages and hence I'm appending images to the list

from PIL import Image
import requests 

train_file = 'data/Train%2FGCC-training.tsv'  # train file
with open(train_file,'r') as f:
    train_read = f.readlines()

sample_train = train_read[:10000]

train_map = {
   line.split("\t")[1][:-1] : line.split("\t")[0] for line in sample_train
}
links = [k for k,v in train_map.items()]

not_read = 0 # keep a count of images that were not possible to read

# loop over the links and read whichever possible
for link in links:
    try:
        im = Image.open(requests.get(link, stream=True).raw)
    except:
        print(link)
        not_read += 1

Here are some of the links that did not work.

https://cdn.mantelligence.com/wp-content/uploads/2017/08/Questions-to-Ask-a-Girl-to-Get-to-Know-Her-What-do-you-want-most-out-of-life.jpg http://duro6.com/weather/images/gallery3_lightning_rainbow_shot.jpg http://image.dailyfreeman.com/storyimage/DF/20170505/NEWS/170509808/AR/0/AR-170509808.jpg&maxh=400&maxw=667 https://cdn.bravehunters.com/wp-content/uploads/2017/09/Guide-to-Living-in-a-Tent-800x416.jpg http://www.saltandpinephoto.com/wp-content/uploads/2016/06/Bride-and-Groom-Walking-through-the-Forest.jpg http://blog.visitmo.com/wp-content/uploads/2014/03/12506026093_092d091fc2_b.jpg https://lynismael.com/wp-content/uploads/2014/07/Belwood-Lake-Conservation-wedding-sara-ayron-_0011(pp_w768_h534).jpg http://www.eurasianet.org/sites/default/files/imagecache/galleria_fullscreen/060613_0.jpg https://www.bailiwickexpress.com/files/cache/88ec9331c05013c55b49024a551341ac_f587432.jpg https://i2-prod.mirror.co.uk/incoming/article1443634.ece/ALTERNATES/s615/%C2%A3%C2%A3%C2%A3%20%20Police%20car%20driving%20straight%20into%20a%20road%20of%20freshly%20layed%20cement http://www.nerjarob.com/nature/wp-content/uploads/Cormorants-in-tree-sized.jpg http://grantbaldwin.com/wp-content/uploads/2015/11/ScottVoelker.jpeg https://drawinglics.com/view/186698/how-to-draw-flowers-and-leaves-in-a-vase-9-steps-with-pictures-image-titled-draw-flowers-and-leaves-in-a-vase-step-9bullet1.jpg

From a sample of 10000, I was able to get at least 51 links that did not work. Looking forward to hearing more from you guys. Thanks!

sharma-piyush commented 5 years ago

Hi, The owner of an image might chose to remove the image anytime. So we do expect to lose some train/dev images over time. But that should be a very small fraction (approx 0.5% in your case). Given that we have over 3M images for training, this should not be a problem. However, the test set for Conceptual Captions (hosted in the competition server) is fixed and will not vary over time.

karansomaiah commented 5 years ago

Hi! Thank you so much for your response. It's awesome that you have everything covered at your side. Look forward to the some amazing insights and results from this dataset. Thanks for clearing out once again.

Gyubin commented 4 years ago

Hi, @sharma-piyush I tried to download the whole CC datasets using VL-BERT author's script. But I could get only 630k images, which is 20% of total. Any way to download 3.3M total images for research purpose? Thanks for your reading.

Best regard, Gyubin Son

gsrivas4 commented 4 years ago

I also tried downloaded using the VL-BERT script, and I could only download 340k images. @Gyubin could you download majority of the images? If you have a link to the 630k images that you could download, that would be great?