hkchengrex / Cutie

[CVPR 2024 Highlight] Putting the Object Back Into Video Object Segmentation
https://hkchengrex.com/Cutie/
MIT License
579 stars 60 forks source link

What are the C and P dimensions? #64

Closed tonydavis629 closed 2 months ago

tonydavis629 commented 2 months ago

Rex, in your paper you refer to the C (or C^k) dimension, but I can't find a reference as to what this C is. Is it the embedding dimension?

Also, the code refers to a value P, as in B x CK x [HW/P] - Query keys. I'm assuming HW is image height and width, but what is P?

I'm working on strategies to reduce Cutie's memory requirements for high resolution images, but the dimensionality of the similarity/affinity matrix is really severe, so I'm looking for any opportunities to reduce this.

hkchengrex commented 2 months ago

Hi.

In code, C in isolation denotes some channel size -- the exact meaning is context-dependent. In the paper, C is a shared channel size for most of the operations, except the key tensor (which is C^k). See https://github.com/hkchengrex/Cutie/blob/2ac7ac21d048e7ff8b2b033a084e0e4ea7b1216c/cutie/config/model/base.yaml#L4-L8 where C^k is 64, and all the other 256 jointly refer to C. We experimented with different values before (and thus allowed the config to set them differently) but just found that it's easier to tie them to a single value.

For P, it is a value inherited from XMem. It denotes the number of prototypes (Section 3.3 of XMem). Semantically [HW/P] denotes the total number of query elements. During memory reading, it would be the number of pixels HW, and during memory potentiation, it would be the number of prototypes.

tonydavis629 commented 2 months ago

Ah [HW/P] is HW or P, not HW divided by P. I see, thank you.