thunlp / PEVL

Source code for EMNLP 2022 paper “PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models”
MIT License
47 stars 5 forks source link

Reproducing the Phrase Grounding task #16

Open sweetdream33 opened 1 year ago

sweetdream33 commented 1 year ago

Hi, thanks for sharing the code of your interesting work.

  1. I want to reproduce the phrase groundinig task, So when I tried running the following command on the flicker dataset, I encountered the following error. The flicker json file does not have keys such as tokens_positive or not_crop_bbox_list. How can I resolve this issue?

python -m torch.distributed.launch --nproc_per_node=8 --master_port=12451 --use_env run_grounding_train.py --train 1 --pretrain 0 --test_dataset flickr --config ./configs/visual_grounding.yaml --output_dir ./output/phrase_grounding --checkpoint grounding.pth --eval_step 500

image-2023-10-4_14-13-56 image-2023-10-4_14-9-29

  1. in flicker.json

    file_name": "flickr30k_images/flickr30k_images/1000092795.jpg", "text_type": "caption", "height": 500, "width": 333, "pseudo_caption": "Two young guys with shaggy hair look at their hands @@ [pos_242] [pos_188] [pos_302] [pos_229]

    while hanging out in the yard .", "normal_caption": "Two young guys with shaggy hair look at their hands while hanging out in the yard .", "bbox": [158.0, 184.0, 40.0, 41.0], "bbox_list": [[158.0, 184.0, 40.0, 41.0]]},


    What is the meaning of '@@ [pos_242][pos_188][pos_302][pos_229]? If I want to fine-tune on my custom dataset, I need to create a JSON file that follows the same input format, right?

  2. In Refcoco.json, what is the meaning of not_crop_bbox_list, positive token, negative token? If I want to fine-tune on my custom dataset, I need to create a JSON file that follows the same input format, right?

Thank you so much!