hkchengrex / Tracking-Anything-with-DEVA

[ICCV 2023] Tracking Anything with Decoupled Video Segmentation
https://hkchengrex.com/Tracking-Anything-with-DEVA/
Other
1.23k stars 128 forks source link

Wrong segmentation appears after several objects come and go from the scene #64

Closed be-hd closed 6 months ago

be-hd commented 7 months ago

The (gradio) demo works fine with short videos or videos with a few objects. When trying with longer videos (10-20 seconds), with old objects disappear and new objects enters the scene, the segmentation starts going wrong - for example it segments a building or pavement as trucks/cars, or totally non-sense segmentation. Checking the per-frame segmentation, it is correct (or the SAM is working as expected), the messed up segmentations seem to come from the matching objects (of the current frame and previous frames). It happened to both online and semionline mode. image

image

hkchengrex commented 7 months ago

Can you upload the source video for me? The propagation module usually has a harder time handling scenes with a moving camera but I want to test this example and understand it better.

be-hd commented 7 months ago

Please see attached (video from MOT challenge). Shorten the video due to size limit, but you can see the issue near the end. Thanks. https://github.com/hkchengrex/Tracking-Anything-with-DEVA/assets/162417921/e1b27803-9354-4fc0-bff5-531c782a3e0f

hkchengrex commented 6 months ago

Thank you! I spotted a major bug that was preventing unmatched segments from being deleted. A boolean flag was set to True when it shouldn't be. This reduces the noise in the output (there are still some, but accumulation is much less severe). Recall is still low, but I am getting similar outputs from GroundingDINO using the same threshold and prompt so there isn't much that we can do.

Thanks again.

https://github.com/hkchengrex/Tracking-Anything-with-DEVA/assets/7107196/5519804b-cae3-4cf7-b891-14b23230fae8

be-hd commented 6 months ago

Thanks for fixing this. Indeed it looks much better now, however still see the phantom segmentation after a while. I tried different parameters (online/semionline, reduce number max_long_term_elements, ...) but it does not help eliminate these. Looks like instead of denoising the segmentation per frame, it brings in the phantom segments. Will it help if during temporal propagation, the similarity is calculated based on detected segments only, and not on the whole image?

hkchengrex commented 6 months ago

The most important hyperparameters for this problem are probably just the threshold and the "delete segment if undetected for [X]" one. Increasing the threshold and decreasing the deletion interval should help decrease the false positive rate.

Ultimately the merging works on top of the input segmentations, and might struggle to balance false positives with false negatives. Indeed, the temporal propagation module sometimes introduces noise (in this example, sometimes expanding error regions of just a few pixels to a larger area) -- but it is kind of a price to be paid for a higher positive rate. If I run the baseline image detection model on this video with the same threshold, I see quite a low recall which the temporal model can help with.

Not sure what you meant by "the similarity is calculated based on detected segments only, and not on the whole image?"

hkchengrex commented 6 months ago

Feel free to re-open if there are follow-up questions.