salesforce / ALBEF

Code for ALBEF: a new vision-language pre-training method
BSD 3-Clause "New" or "Revised" License
1.45k stars 193 forks source link

support other visual grounding datasets? #112

Open PaulTHong opened 1 year ago

PaulTHong commented 1 year ago

Hey, you conduct Visual Grounding experiment on RefCOCO+. Have you tried on other datasets such as RefCOCO or RefCOCOg? If I am going to do this, how can I get the data? Since in your release, only json file of RefCOCO+ is provided. Are these json file generated by yourself? Or are they downloaded from somewhere else? (I just find the data form of your ALBEF VG is not the same as TransVG.) Thank you very much. Looking forward to your reply.

LiJunnan1992 commented 1 year ago

Our refcoco+ annotation is converted from the official annotation: https://github.com/lichengunc/refer. We only use image-text pairs during training. During inference, we use the GradCam to rank the proposals provided by https://github.com/lichengunc/MAttNet.

PaulTHong commented 1 year ago

Thank you for your reply! I still have some questions about dataset to consult on you. I conduct VG experiments with dataset provided in work TransVG's guidance https://github.com/djiajunustc/TransVG/blob/main/docs/GETTING_STARTED.md, which is downloaded from https://drive.google.com/file/d/1fVwdDvXNbH8uuq_pHD_o5HI7yqeuz0yS/view. The downloaded data includes unc+_train.pth, corpus.pth etc., it seems similar to your converted refcoco+_train.json etc.

Since you only provide json file of RefCOCO+,now I want to experiment with RefCOCO etc, do I just need to convert the files as you did with RefCOCO+? I can't open the data link in https://github.com/lichengunc/refer.

So my issue is could you please provide the code script of converting data, or matched version of RefCOCO. Thus I can try RefCOCO, RefCOCOg directly. Thank you very much! Wish I have stated my question clearly.

PaulTHong commented 1 year ago

Hello, could you help me to response the above question? Thank you very much!

LiJunnan1992 commented 1 year ago

Here is a code snippet I used for data conversion:

split = 'train'
ref_ids = refer.getRefIds(split=split)

annotations = []

dim_w, dim_h = 384, 384
patch_size = 32
n_patch_w, n_patch_h = dim_w//patch_size, dim_h//patch_size

refer.getRefIds()
for ref_id in ref_ids:

    ref = refer.Refs[ref_id]      
    image = refer.Imgs[ref['image_id']]

    width, height = image['width'], image['height']   
    w_step = width/n_patch_w
    h_step = height/n_patch_h   
    patch_area = height*width/(n_patch_w*n_patch_h)    

    mask = refer.getMask(ref)['mask']

    patch = []
    for i in range(n_patch_h):
        for j in range(n_patch_w):
            y0 = max(0,round(i*h_step))
            y1 = min(height, round((i+1)*h_step))
            x0 = max(0,round(j*w_step))
            x1 = min(width, round((j+1)*w_step))
            submask = mask[int(y0):int(y1),int(x0):int(x1)]
            patch.append(submask.sum()/patch_area)    

    text = [sentence['sent'] for sentence in ref['sentences']]
    imgPath = os.path.join('/export/share/datasets/vision/coco/images/train2014', image['file_name'])
    annotation = {'image': imgPath, 'text':text, 'patch':patch, 'type':'ref', 'ref_id':ref['ref_id']}
    annotations.append(annotation)
PaulTHong commented 1 year ago

Got it. Thank you very much! I will have a try!

PaulTHong commented 1 year ago

Hey, I have a small problem to bother you. I find four files in you data/refcoco+ subfoler: cocos.json, dets.json, instances.json, ref(unc).p. And in the provided file of codebase TransVG, I find there are two similar files in refcoco and refcoco+:instances.json, refs(unc).p, but the cocos.json, dets.json are missing. Where are the latter two files from? Or do refcoco and refcoco+ share the same file of cocos.json, dets.json. I make refcoco share the same file and finish the data transfer, training is OK, but at one step of eval, it raises "ref_id key error". Sorry to bother you again. Thank you very much.