lukemelas / deep-spectral-segmentation

[CVPR 2022] Deep Spectral Methods: A Surprisingly Strong Baseline for Unsupervised Semantic Segmentation and Localization
227 stars 41 forks source link

Details regarding baselines (Saliency-DINO-ViT-B and MaskContrast-DINO-ViT-B) #5

Closed zadaianchuk closed 2 years ago

zadaianchuk commented 2 years ago

Hi @lukemelas, fascinating work, thank you for your contribution!

While looking at the semantic segmentation results, I got several questions regarding the baselines used.

Additionally, we give results for directly clustering DINO-pretrained features masked with Deep-USPS saliency maps

Can you explain how you obtained the features for clustering?

  1. How you were training Deep-USPS? Were you using BasNet pretrained from Deep-USPS predictions (similar to MaskContrast)?
  2. Were you averaging DINO features corresponding to the resized mask, or you obtain [CLS] features from crop corresponding to the mask?

we also train a version of MaskContrast based on a DINO-pretrained model

  1. Where using training Deep-USPS or using provided by MaskContrast BasNet (pretrained with Deep-USPS supervision) model?
  2. Do you have an intuition why DINO pretrained MaskContrast model is worse than original MaskContrast one (31.2 vs 35)?
lukemelas commented 2 years ago

Hi, thank you for the interest in our paper!

The two simple clustering baselines are included mainly to demonstrate that this problem is actually quite challenging, and is not trivially solved by clustering features. One of the baselines extracts features across the entire training set and then clusters these (with K-means), while the other baseline clusters features within each image, then clusters them again across the dataset.

Regarding your first set of questions, yes, we used the Deep-USPS predictions with the additional BasNet training similar to MaskContrast. We averaged the DINO features corresponding to the resized mask -- we could have also tried using the [CLS] features from bounding box containing the mask as you said, but we have not tried this.

For the second set of questions, I trained a MaskContrast model using the exact code in their repository, but I added a few lines of code to swap out the MoCo-ResNet-50 backbone for DINO-pretrained backbones (both ResNet-50 and ViT). We ran this baseline to see the effect of the DINO pretraining (and the ViT backbone) on the segmentation performance. It did not make a huge difference, and surprisingly the ViT backbone was actually significantly worse. This might be a consequence of not using one of the new dense-prediction-ViT heads, like DPT, but we did not investigate along these lines. I don't have a huge amount of intuition regarding the performance of the DINO-pretrained MaskContrast model versus the MoCo-pretrained model. I will say that I found it difficult to reproduce the 35 number, even with their exact code, and there is a lot of variance between runs (of all these methods, ours included). The 31.2 number is the average of 5 random seeds that I ran, whereas the 35 number is from their paper.

I hope this helps! Luke

zadaianchuk commented 2 years ago

Hi @lukemelas, thanks a lot for your detailed answer, this helps to understand and reproduce baselines better and have more concrete interpretation of results in Table 4. Once more thank you for your contribution and quick response.

Indeed, using ViT for dense prediction is not a trivial task. Let's see if someone would be able to get ViT + additional contrastive learning on top to work.