hkchengrex / Tracking-Anything-with-DEVA

[ICCV 2023] Tracking Anything with Decoupled Video Segmentation
https://hkchengrex.com/Tracking-Anything-with-DEVA/
Other
1.27k stars 129 forks source link

Using DEVA with a different mask model causing issues with mask encoding #65

Closed vineetparikh closed 8 months ago

vineetparikh commented 8 months ago

Hi there,

I'm using a different image model called "h23" on a new dataset of videos. I've created an extension that uses h23 as the detection/segmentation model instead of the GroundingDINO+SAM model used in other work, and am testing on a new video now (where not all frames are guaranteed to have a mask).

When evaluating on the video, I'm finding the model can initially extract tracks but runs into this mismatch/concat error: what exactly is going on? For context the images are 456x256

Traceback (most recent call last):
  File "/home/vap43/Tracking-Anything-with-DEVA/demo/demo_with_h23.py", line 87, in <module>
    process_frame(deva, h23_model, im_path, result_saver, ti, image_np=frame)
  File "/home/vap43/.conda/envs/amino_hamer/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/vap43/Tracking-Anything-with-DEVA/deva/ext/with_h23_segmentor.py", line 73, in process_frame_with_h23
    _, mask, new_segments_info = deva.vote_in_temporary_buffer(
  File "/home/vap43/Tracking-Anything-with-DEVA/deva/inference/inference_core.py", line 122, in vote_in_temporary_buffer
    projected_ti, projected_mask, projected_info = find_consensus_auto_association(
  File "/home/vap43/Tracking-Anything-with-DEVA/deva/inference/consensus_automatic.py", line 165, in find_consensus_auto_association
    projected_mask = spatial_alignment(ti, image, mask, keyframe_ti, keyframe_image,
  File "/home/vap43/Tracking-Anything-with-DEVA/deva/inference/consensus_associated.py", line 40, in spatial_alignment
    value, sensory = network.encode_mask(src_image,
  File "/home/vap43/Tracking-Anything-with-DEVA/deva/model/network.py", line 54, in encode_mask
    g16, h16 = self.mask_encoder(image,
  File "/home/vap43/.conda/envs/amino_hamer/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/vap43/Tracking-Anything-with-DEVA/deva/model/big_modules.py", line 84, in forward
    g = self.distributor(image, g)
  File "/home/vap43/.conda/envs/amino_hamer/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/vap43/Tracking-Anything-with-DEVA/deva/model/group_modules.py", line 120, in forward
    g = torch.cat([x, g], 2)
RuntimeError: Sizes of tensors must match except in dimension 2. Expected size 480 but got size 256 for tensor number 1 in the list.
/home/vap43/Tracking-Anything-with-DEVA/deva/inference/image_feature_store.py:48: UserWarning: Leaking dict_keys([111, 110]) in the image feature store
hkchengrex commented 8 months ago

It seems to me that the detection model is producing masks of different dimensions. Can you compare that against existing models (e.g., w/ grounded-sam)?

hkchengrex commented 8 months ago

Feel free to re-open if there are follow-up questions.

vineetparikh commented 8 months ago

Hi, sorry for not following up! I was able to fix this by resizing one dimension to 480, which somehow worked :shrug: