mhamilton723 / STEGO

Unsupervised Semantic Segmentation by Distilling Feature Correspondences
MIT License
730 stars 150 forks source link

About random images correspondence #48

Closed tanveer6715 closed 1 year ago

tanveer6715 commented 2 years ago

Hi,

I am confused about random images correspondence. As in the paper it is stated "STEGO uses three instantiations of the correspondence loss of Equation 4 to train a segmentation head to distill feature relationships between an image and itself, its K-Nearest Neighbors (KNNs), and random other images. The self and KNN correspondence losses primarily provide positive, attractive, signal and random image pairs tend to provide negative, repulsive, signal" What does random image pair means as any dataset always contains images having same features or pixel or classes? But the random image shown in the STEGO architecture portion is very different. Anyone who understand this mechanism please share your thoughts about it.

npielawski commented 2 years ago

Let's say you have 27 classes and your dataset is relatively balanced, that is p(C)=1/27. If you pick two random images, they will likely not have the same class, the probability is 27/27^2 = 1/27 < 3%. That means that more often than not, the images will really not match.

In the case of STEGO, there will still be some matching but not as much as with the positive example, so the neural network does learn to discriminate between positive and negative features.

tanveer6715 commented 2 years ago

Let's say you have 27 classes and your dataset is relatively balanced, that is p(C)=1/27. If you pick two random images, they will likely not have the same class, the probability is 27/27^2 = 1/27 < 3%. That means that more often than not, the images will really not match.

In the case of STEGO, there will still be some matching but not as much as with the positive example, so the neural network does learn to discriminate between positive and negative features.

Thank you for clarification. But if a dataset contains only 2 classes and all the training images contains these two classes then how the model will pick random images from the dataset during training?

npielawski commented 2 years ago

The class analogy from my previous comment is not entirely correct with STEGO. They use a backbone to project the images to a latent space, so imagine that all the images in your dataset are projected to a 2D feature space (usually more in the 100-1000d range). If images are close semantically they will cluster in this space. Note that this is done before training STEGO or anything else since the ViT backbone is already trained, it's all done in the precompute_knn.py file.

Now with that, we pick a random image, that's our anchor. We need positive examples, and those will be some of the k=7 nearest neighbours – i.e. closest images in the feature space (so close semantically). The chances of picking a negative image randomly that is close to the anchor is k/N with N the number of examples in the dataset. And usually N>>k, especially with augmentations.

mhamilton723 commented 1 year ago

Thank you for the eloquent description here @npielawski! You are spot on