Open zhending111 opened 5 months ago
Good job! I is the image or the vit out feature of the image?Where can i get the supplementary?
Hello, thank you for bringing this up.
The updated paper (http://arxiv.org/abs/2403.11107) now reflects the fact that we actually multiply the predicted mask M_n from stage 1 with the ViT patch embeddings (after resizing the M_n to the patch embedding tensor's height and width). We do not mask the image directly with the predicted cross-attention map to obtain the foreground and the background average embeddings (as followed in existing works).
Also, we added the supplementary material at the end of the main paper - please find in the updated paper link above. Thanks!
Good job!
I is the image or the vit out feature of the image?Where can i get the supplementary?