sbu caption dataset format

1024er commented 1 year ago

sub.json is organized in the format: [{'image': '4385058960_b0f291553e.jpg', 'caption': 'a wooden chair in the living room', 'url': 'http://static.flickr.com/2723/4385058960_b0f291553e.jpg'}, ...}

but the downloaded sbu_images.rar is extracted as: 0000/ 0001/ 0002/ 0003/ ... 0999/ in each directory contains 1000 images named in order: 000.jpg 001.jpg 002.jpg ... 999.jpg

Therefore, the image storage path does not correspond to the path in json. @dxli94

dxli94 commented 1 year ago

Hi, @1024er,

Thanks for raising this. This definitely needs fixing. I'll work on this this week.

Thanks.

1024er commented 1 year ago

Hi, @1024er,

Thanks for raising this. This definitely needs fixing. I'll work on this this week.

Thanks.

Has it been fixed ? thank you ~

dxli94 commented 1 year ago

Hi @1024er ,

It seems the annotations of SBU captions are not properly addressing the image directory structure in the zip.

I have now updated the downloading script to directly fetch images from urls. Though I wouldn't be surprised if some urls deprecate as they will.

Let me know how it works.

Thanks.

xinbowu2 commented 1 year ago

Hi, I tried the new annotation file, but I still found a lot of images were missing. I am wondering if there is a script to generate an annotations file based on available images.

slyviacassell commented 1 year ago

Hi, I tried the new annotation file, but I still found a lot of images were missing. I am wondering if there is a script to generate an annotations file based on available images.

I've encountered the same issue. Would you mind providing the processed images via google drive? @dxli94

slyviacassell commented 1 year ago

Hi, I tried the new annotation file, but I still found a lot of images were missing. I am wondering if there is a script to generate an annotations file based on available images.

I've encountered the same issue. Would you mind providing the processed images via google drive? @dxli94

I'll give a processing script for masking the non-valid records of sbu captions.

import tqdm
import os
nonvalid_records=[]
valid_records=[]
with open('sbu_captions/annotations/sbu.json', "r") as f:
    dset=json.load(f)
    def check_file_exists(filename,path):
        exist=os.path.exists(os.path.join(path,filename))
        return exist

    for ann in tqdm.tqdm(dset):
        exist=check_file_exists(ann['image'],'sbu_captions/images')

        if exist:
            valid_records.append(ann)
        else:
            nonvalid_records.append(ann)
    print('not valid records',len(nonvalid_records),'valid records',len(valid_records))

print("saving valid")
with open('sbu_captions/annotations/sbu_valid.json', "w") as f:
    dset=json.dump(valid_records,f)

print("saving nonvalid")
with open('sbu_captions/annotations/sbu_nonvalid.json', "w") as f:
    dset=json.dump(nonvalid_records,f)

salesforce / LAVIS

sbu caption dataset format #44