linhuixiao / CLIP-VG

[TMM 2023] Self-paced Curriculum Adapting of CLIP for Visual Grounding.
Apache License 2.0
105 stars 5 forks source link

Testing the model #11

Open pyzone49 opened 6 months ago

pyzone49 commented 6 months ago

Is there a way to test directly the model on an image using CPU and. a trained model. Let me know please

linhuixiao commented 6 months ago

@pyzone49 Hi, reasoning about one image on a single GPU / CPU is actually not difficult. You just need to simplify the code in my project repository. Specifically, we need extract the image read and tensorize part of the build_dataset function. Then, we input the pre-processed image to the pretrained model in accordance with the with_no_grad mode in without gradient update, and then it will outputs the normalized grounding frame coordinates. After that, we can de-normalize these coordinates to obtain the normal-sized bounding box (x,y,w,h), which can be displayed by calling the visualization script by the matplot lib.

I have also implemented this script in my other work; however, the code is not currently available due to another work has not being accepted. So please watching if you are interested.

Besides, I have find somes similar works for you, maybe you can refer these demos: (The implementation is similar.)


pyzone49 commented 6 months ago

Thanks, the bbox goes through a lot of processing before passing to the model is there a way you can add a function to Denormalize the image,bbox and text @linhuixiao

linhuixiao commented 5 months ago

@pyzone49 Yes, it is possible to add a de-normalized function to handle this, I have roughly simplified a basic demo function as follow, you can refer to it. The full script for single sample visualization will be released in my other work (HiVG, Please watching if you are interested.

def convert_pred_output_to_xyxy_box(image_size, pred_bbox):
      h, w = image_size
      max_size = float(max(h, w))
      # current the pred box is xcycwh
      pred_bbox = pred_bbox[0].to("cpu")  
      pred_bbox = torch.mul(pred_bbox, torch.tensor([max_size, max_size, max_size, max_size], dtype=torch.float32))
      pred_bbox[0] = pred_bbox[0] - (max_size - w) / 2
      pred_bbox[1] = pred_bbox[1] - (max_size - h) / 2
      pred_bbox = xywh2xyxy(pred_bbox)


def xywh2xyxy(x):
    x_c, y_c, w, h = x.unbind(-1)
    b = [(x_c - 0.5 * w), (y_c - 0.5 * h),
         (x_c + 0.5 * w), (y_c + 0.5 * h)]
    return torch.stack(b, dim=-1)