linhuixiao / CLIP-VG

[TMM 2023] Self-paced Curriculum Adapting of CLIP for Visual Grounding.
https://github.com/linhuixiao/CLIP-VG
Apache License 2.0
105 stars 5 forks source link

Testing the model #11

Open pyzone49 opened 6 months ago

pyzone49 commented 6 months ago

Is there a way to test directly the model on an image using CPU and. a trained model. Let me know please

linhuixiao commented 6 months ago

@pyzone49 Hi, reasoning about one image on a single GPU / CPU is actually not difficult. You just need to simplify the eval.py code in my project repository. Specifically, we need extract the image read and tensorize part of the build_dataset function. Then, we input the pre-processed image to the pretrained model in accordance with the with_no_grad mode in engine.py without gradient update, and then it will outputs the normalized grounding frame coordinates. After that, we can de-normalize these coordinates to obtain the normal-sized bounding box (x,y,w,h), which can be displayed by calling the visualization script by the matplot lib.

I have also implemented this script in my other work; however, the code is not currently available due to another work has not being accepted. So please watching if you are interested.

Besides, I have find somes similar works for you, maybe you can refer these demos: (The implementation is similar.)

GLIP: https://colab.research.google.com/drive/12x7v-_miN7-SRiziK3Cx4ffJzstBJNqb MDETR: https://colab.research.google.com/drive/11xz5IhwqAqHj9-XAIP17yVIuJsLqeYYJ RelTR: https://colab.research.google.com/drive/1-U642OoCyb8OSM8nx9lme49dmWa_aUcU#scrollTo=TeWdzd5LeOGQ

pyzone49 commented 6 months ago

Thanks, the bbox goes through a lot of processing before passing to the model is there a way you can add a function to Denormalize the image,bbox and text @linhuixiao

linhuixiao commented 5 months ago

@pyzone49 Yes, it is possible to add a de-normalized function to handle this, I have roughly simplified a basic demo function as follow, you can refer to it. The full script for single sample visualization will be released in my other work (HiVG, https://github.com/linhuixiao/HiVG). Please watching if you are interested.

def convert_pred_output_to_xyxy_box(image_size, pred_bbox):
      h, w = image_size
      max_size = float(max(h, w))
      # current the pred box is xcycwh
      pred_bbox = pred_bbox[0].to("cpu")  
      pred_bbox = torch.mul(pred_bbox, torch.tensor([max_size, max_size, max_size, max_size], dtype=torch.float32))
      pred_bbox[0] = pred_bbox[0] - (max_size - w) / 2
      pred_bbox[1] = pred_bbox[1] - (max_size - h) / 2
      pred_bbox = xywh2xyxy(pred_bbox)

and

def xywh2xyxy(x):
    x_c, y_c, w, h = x.unbind(-1)
    b = [(x_c - 0.5 * w), (y_c - 0.5 * h),
         (x_c + 0.5 * w), (y_c + 0.5 * h)]
    return torch.stack(b, dim=-1)