foreground image mask F = Mn ⊗ I, background image mask B = (1−Mn)⊗In,

Good job! I is the image or the vit out feature of the image?Where can i get the supplementary?

Hello, thank you for bringing this up.

The updated paper (http://arxiv.org/abs/2403.11107) now reflects the fact that we actually multiply the predicted mask M_n from stage 1 with the ViT patch embeddings (after resizing the M_n to the patch embedding tensor's height and width). We do not mask the image directly with the predicted cross-attention map to obtain the foreground and the background average embeddings (as followed in existing works).

Also, we added the supplementary material at the end of the main paper - please find in the updated paper link above. Thanks!

sourachakra / SCoSPARC

foreground image mask F = Mn ⊗ I, background image mask B = (1−Mn)⊗In, #1