Zero-shot inference my own images

thunlp / CPT

Colorful Prompt Tuning for Pre-trained Vision-Language Models

MIT License

47 stars 1 forks source link

Zero-shot inference my own images #2

Open insundaycathy opened 2 years ago

insundaycathy commented 2 years ago

Thank you for this amazing work. I was wondering, is there a way to use CPT on my own data? There is a few point in the .json file and code that I don't understand. (1) does CPT require labeled obect bounding boxes for prediction or can it work on unlabeled images? (2) what does the 'rle' field in the segs.json file mean?

waxnkw commented 2 years ago

Thanks for your interest! You can use CPT for your own data. (1) Yes. In fact, for visual grounding, it is not the labeled object boxes used but the detected object proposals from MAttNet. We use this only for a fair comparison with previous methods, because many of them use this for evaluation. You can use the object bounding boxes generated by the VinVL's detector, which is our prompt_feat code. (2) Rle is a widely adopted format for the storage of segmentation masks. You can refer to https://github.com/cocodataset/cocoapi/issues/184 for more information.

insundaycathy commented 2 years ago

Thank you so much for the prompt reply, it was very helpful.

insundaycathy commented 2 years ago

Quick question. Now I've created a split/mydata.json file with the img_name, img_id, caption, height and weight I've also processed a object.json file with the code provided in prompt_feat. which includes detected boxes for each image. which code should I use to do CPT visual grounding inference on my data? Thanks

waxnkw commented 2 years ago

(1) For the feature extraction part, please add a config file under prompt_feat/data/refcoco/yamls/ with a similar format with refcoco_val.yaml. Then, go to prompt_feat/cmds/refcoco/cpt, and modify the DATA_DIR attribute to your config file. Finally, go to prompt_feat dir and run bash cmds/refcoco/cpt/your_script.sh Btw, please make sure the object bounding box format is converted to [x0, y0, w, h]. The detector might output a [x0, y0, x1, y1] format.

(2) For the Oscar inference, please go to Oscar/cmds/refcoco/zsl/refcoco.sh and modify the --test_dir to the feature extracted in the last step. Due to that I do not fit the oscar/zeroshot/refcoco_cpt.py to inference on custom images, there might be some errors. But I think it will not be too difficult to debug.

If any problem, feel free to post it in the issue.

waxnkw commented 2 years ago

A segmentation mask is one of the choices. Using bounding boxes is also ok, which is CPT-Blk in Table 1.

The VinVL's code does not provide a segmentation function. If you want to generate segmentation masks, you should seek help from other tools like https://github.com/facebookresearch/maskrcnn-benchmark.