Questions about "Towards In-context Scene Understanding"

pb585 commented 1 year ago

I have read your paper"Towards In-context Scene Understanding" and found it very effective. Thank you for your contribution, but I am somewhat confused about how to use self-supervised methods to train networks in the paper. My confusion mainly lies in Section 3.2, Question 1: During network training, will the memory bank store all the more than 1 million images in the training set?Question 2: I observed that in Appendix B, you mentioned that the length of the memory is 153600, does this mean that it is necessary to save the features of 153600 images?If so, how to select the correct samples from the huge training set?Question 3: Is the memory only constructed by the output of the target network?If so, does the output of the online network and the output of the target network need to calculate the similarity with the features in the memory?Or only the online network does this calculation?

ibalazevic commented 1 year ago

Hi, thank you for your questions.

The memory bank size at training time is different to that at test time: at training time, it contains 153,600 embeddings of mean pooled images from previous batches, so not the whole training set.
We don't select the features, we just mean pool the spatial maps (see Eq 2) of images from previous batches. When the size of the memory bank exceeds 153,600, we discard the oldest memories.
Both the online and the target network contain the contextual pretraining step, even though the memory is constructed purely from the target network embeddings.

pb585 commented 1 year ago

Thank you for your reply.I am very excited to see your response to the paper "Towards In-context Scene Understanding".Thank you again. Can I understand the memory in the training stage as a first-in first-out queue data structure? This queue stores 153,600 images using feature embeddings constructed using formula 2.As training continues, the features in the queue are updated.If so, I am a little confused. The current batch of training data is used to calculate similarity with the features in the memory and fuse them into new feature representations. How can we ensure that the categories of the current batch of images are present in the memory?Is it because the number of 153,600 images is large enough?Or do I have the wrong understanding and it is not necessary for the categories of the current batch of images to also appear in the memory? If the categories of the current batch of images are not present in the memory, then does the step in Section 3.2 of the paper still make sense?Finally, your paper is excellent. I am a fan of yours.

ibalazevic commented 1 year ago

I am glad you liked the paper! Yes, it's a FIFO queue. You are right, we cannot be completely sure that the current categories will appear, but we are fairly confident given that the memory bank is 150x larger than the number of categories of ImageNet and 7x larger than the number of categories of ImageNet-21k. This will mostly affect Hummingbird++ though which uses labels, since for the self-supervised Hummingbird even features from other categories may be semantically near.

ibalazevic / ibalazevic.github.io

Questions about "Towards In-context Scene Understanding" #1