Open insundaycathy opened 2 years ago
Thanks for your interest! You can use CPT for your own data. (1) Yes. In fact, for visual grounding, it is not the labeled object boxes used but the detected object proposals from MAttNet. We use this only for a fair comparison with previous methods, because many of them use this for evaluation. You can use the object bounding boxes generated by the VinVL's detector, which is our prompt_feat code. (2) Rle is a widely adopted format for the storage of segmentation masks. You can refer to https://github.com/cocodataset/cocoapi/issues/184 for more information.
Thank you so much for the prompt reply, it was very helpful.
Quick question. Now I've created a split/mydata.json file with the img_name, img_id, caption, height and weight I've also processed a object.json file with the code provided in prompt_feat. which includes detected boxes for each image. which code should I use to do CPT visual grounding inference on my data? Thanks
(1) For the feature extraction part, please add a config file under prompt_feat/data/refcoco/yamls/ with a similar format with refcoco_val.yaml. Then, go to prompt_feat/cmds/refcoco/cpt, and modify the DATA_DIR attribute to your config file. Finally, go to prompt_feat dir and run bash cmds/refcoco/cpt/your_script.sh
Btw, please make sure the object bounding box format is converted to [x0, y0, w, h]. The detector might output a [x0, y0, x1, y1] format.
(2) For the Oscar inference, please go to Oscar/cmds/refcoco/zsl/refcoco.sh and modify the --test_dir
to the feature extracted in the last step. Due to that I do not fit the oscar/zeroshot/refcoco_cpt.py to inference on custom images, there might be some errors. But I think it will not be too difficult to debug.
If any problem, feel free to post it in the issue.
A segmentation mask is one of the choices. Using bounding boxes is also ok, which is CPT-Blk in Table 1.
The VinVL's code does not provide a segmentation function. If you want to generate segmentation masks, you should seek help from other tools like https://github.com/facebookresearch/maskrcnn-benchmark.
Thank you for this amazing work. I was wondering, is there a way to use CPT on my own data? There is a few point in the .json file and code that I don't understand. (1) does CPT require labeled obect bounding boxes for prediction or can it work on unlabeled images? (2) what does the 'rle' field in the segs.json file mean?