Closed vineetparikh closed 8 months ago
It seems to me that the detection model is producing masks of different dimensions. Can you compare that against existing models (e.g., w/ grounded-sam)?
Feel free to re-open if there are follow-up questions.
Hi, sorry for not following up! I was able to fix this by resizing one dimension to 480, which somehow worked :shrug:
Hi there,
I'm using a different image model called "h23" on a new dataset of videos. I've created an extension that uses h23 as the detection/segmentation model instead of the GroundingDINO+SAM model used in other work, and am testing on a new video now (where not all frames are guaranteed to have a mask).
When evaluating on the video, I'm finding the model can initially extract tracks but runs into this mismatch/concat error: what exactly is going on? For context the images are 456x256