hkchengrex / Cutie

[CVPR 2024 Highlight] Putting the Object Back Into Video Object Segmentation
https://hkchengrex.com/Cutie/
MIT License
732 stars 71 forks source link

'a mask prediction' in Sec. 3.2.2 of Paper #88

Closed Huster-Hq closed 4 months ago

Huster-Hq commented 4 months ago

Is the mask prediction single channel, i.e., H×W×1?

image

hkchengrex commented 4 months ago

Yes.

Huster-Hq commented 4 months ago

I have a question about the detail of Object Memory

  1. The object memory are computed by N pooling masks $W$. However, these pooling masks do not have a constraint label, unlike the mask $M_l$ projected from the pixel features constrained by GT mask. I can't understand the information contained in these pooling masks and why one half can be foreground predictions and the other half is background predictions. I wonder if you have directly visualized these masks.
Huster-Hq commented 4 months ago

Isn't $W$ generated by the memory feature $F$ through a MLP? image

Huster-Hq commented 4 months ago

What do you mean by "constraint label"? W is directly constructed from M_l in the screenshot that you provided. There are no additional transformations. Those masks are just the masks in Figure 4 (and their inverse).

Figure 4 shows the $M_l$ rather than pooling masks $W$.

hkchengrex commented 4 months ago

Oh, right. Sorry -- it slipped my mind. We have visualized them before at some point. IIRC those masks are rather diffuse and don't have very recognizable patterns. They are learned end-to-end.