Open lisenjie757 opened 4 months ago
The provided models are optimized for a specific number of inputs. I believe the method should work when adding more exemplars, but I am unsure if it yields better results. If you test this, let me know your findings. I will see if in the future I have the time to add a zero-shot demo.
I would also appreciate a clarification of the difference between the models base_0_shot.pth
and DAVE_0_shot.pth
, base_3_shot.pth
and DAVE_3_shot.pth
.
Moreover, I'm trying to implement a zero-shot demo, but I have some doubts:
num_objects
parameter should be 3 (ref), in fact, zero-shot models (i.e. base_0_shot.pth
and DAVE_0_shot.pth
) have objectness
shape equal to (3, 9, 256), where the first dimension is self.num_objects
(ref). Could you explain why?--num_objects 3 --model_name DAVE_0_shot --zero_shot
and changing the model path to os.path.join(args.model_path, 'DAVE_0_shot.pth')
it works (i.e. no errors), but the model still seems to be taking exemplars into account, since it produces different results when provided with different exemplars, both with and without --use_query_pos_emb
. Is this expected?So, could you please provide some hints or code references on how to implement a zero-shot demo?
Thank you!
I would also appreciate a clarification of the difference between the models base_0_shot.pth and DAVE_0_shot.pth, base_3_shot.pth and DAVE_3_shot.pth.
base_3_shot.pth
and base_0_shot.pth
are LOCA weights, so only for the density map prediction part, without the weights for bounding box prediction.
From the zero-shot test script it looks like the num_objects parameter should be 3 (ref), in fact, zero-shot models (i.e. base_0_shot.pth and DAVE_0_shot.pth) have objectness shape equal to (3, 9, 256), where the first dimension is self.num_objects (ref). Could you explain why?
This is also a part of LOCA method. In few-shot it uses an exemplar pooling into 3x3 prototype. When flattened you get 9 (the second dimension). This is kept unchanged in zero-shot, just using trainable parameters instead of roi pooling from exemplars in the image.
Running your demo with --num_objects 3 --model_name DAVE_0_shot --zero_shot and changing the model path to os.path.join(args.model_path, 'DAVE_0_shot.pth') it works (i.e. no errors), but the model still seems to be taking exemplars into account, since it produces different results when provided with different exemplars, both with and without --use_query_pos_emb. Is this expected?
Did you run demo.py with all parameters unchanged except for the model name and the addition of --zero_shot? If not, please try running it this way and let us know if you encounter any issues. I will check this soon and post a demo for the zero-shot setup as soon as possible.
I just saw what the issue probably is: In a few-shot setup, the image is resized based on the exemplar size. Since demo.py
was created for few-shot counting, it includes resizing on line 74. Try adapting it. (modify the function resize in demo.py to return in line 24, before it resizes the image based on the bounding boxes.)
Additionally, note that DAVE in zero-shot performs two passes. In the first pass, it estimates the size of objects, based on which it resizes the image and performs a second pass, which improves the results (see main.py).
Hi @jerpelhan, thank you for your reply.
Did you run demo.py with all parameters unchanged except for the model name and the addition of --zero_shot?
I tried running demo.py
as you suggested. It doesn't raise any error but it still requires the exemplars. Providing them it looks like they are taken into account even if the --zero_shot
parameter is used, since the results seem to change with different exemplars. It may be due to the different resizing that is applied.
Modify the function resize in demo.py to return in line 24, before it resizes the image based on the bounding boxes.
I finally found out what was making the code not working without the exemplars also when removing the dependency of the resize function on the bounding boxes: in COTR.forward(), the line self.num_objects = bboxes.shape[1]
caused a RuntimeError: shape '[1, 0, 3, 3, -1]' is invalid for input of size 6912
when no bbox was provided.
By commenting this line, the code seems to work.
(image from the Video Object Counting Dataset)
Additionally, note that DAVE in zero-shot performs two passes. In the first pass, it estimates the size of objects, based on which it resizes the image and performs a second pass, which improves the results (see main.py).
I'll try to add this two-steps approach in the zero-shot demo as well, thank you very much.
And the demo only provide 3-shot inference, how to do zero-shot inference on my own image, can you provide a demo_zero? Thank you!