Just want to confirm! - Githubissues

JawadTawhidi commented 7 months ago

Hi, I am citing DEVA in my paper. However would you please confirm or correct my explanation about DEVA's approach that is used for DAVIS-2017?

I want to exaplain like this:

For DAVIS-2017 (multi-object), EntitySeg is used as image segmentation model and simi-online protocol is followed. The semi-online protocol combines in-clip consensus with temporal propagation every 5 frames with a clip size of n=3. Specifically, the process starts by performing the initial in-clip consensus on 3 frames at the beginning of the video. The segmentation mask with the highest confidence, generated by image segmentation model, is chosen. This mask is then propagated for the 5 initial frames. Next, another 3 frames are selected, and in-clip consensus is performed again. The result for this in-clip consensus is merged with the temporal propagation from the previous frames, and the final result is propagated for the next 5 frames. These steps are repeated throughout the video.

hkchengrex commented 7 months ago

The segmentation mask with the highest confidence, generated by image segmentation model, is chosen.

This is not true. We merge them by voting.

JawadTawhidi commented 7 months ago

Hi again, thank you so much.

I updated. I would be very grateful if you confirm or correct.

For DAVIS-2016 (single-object), the image saliency model DIS is incorporated as the image model and an offline setting is employed. In its offline setting, the initial in-clip consensus is performed by selecting 10 uniformly spaced frames in the video and choosing the frame with the highest confidence given by the image model as a key frame for aligning the other frames. Then forward and backward propagation is performed from the key frame without using additional image segmentations.

For DAVIS-2017 (multi-object), EntitySeg is used as an image segmentation model, and a simi-online protocol is followed. The semi-online protocol combines in-clip consensus with temporal propagation every 5 frames with a clip size of n=3. Specifically, the process starts by performing the initial in-clip consensus on 3 frames at the beginning of the video. The segmentation masks generated by the image segmentation model are merged by voting and the one with highest support, is chosen. This mask is then propagated for the 5 initial frames. Next, another 3 frames are selected, and the in-clip consensus is performed again. The result for this in-clip consensus is merged with the temporal propagation from the previous frames, and the final result is propagated for the next 5 frames. These steps are repeated throughout the video.

hkchengrex commented 7 months ago

DIS does not provide confidence as far as I know. You can look at the paper's appendix for all the evaluation details. It looks like you are guessing about some of the details so reading that part might help.

JawadTawhidi commented 7 months ago

If possible please edit my text. I understod the concept that DIS is not providing the confidence score, but I can not explain it. Please edit my text, it is very big help.

hkchengrex commented 7 months ago

I'm afraid that I cannot edit or proofread your work.

hkchengrex / Tracking-Anything-with-DEVA

Just want to confirm! #81