Closed zadaianchuk closed 2 years ago
Hi, thank you for the interest in our paper!
The two simple clustering baselines are included mainly to demonstrate that this problem is actually quite challenging, and is not trivially solved by clustering features. One of the baselines extracts features across the entire training set and then clusters these (with K-means), while the other baseline clusters features within each image, then clusters them again across the dataset.
Regarding your first set of questions, yes, we used the Deep-USPS predictions with the additional BasNet training similar to MaskContrast. We averaged the DINO features corresponding to the resized mask -- we could have also tried using the [CLS] features from bounding box containing the mask as you said, but we have not tried this.
For the second set of questions, I trained a MaskContrast model using the exact code in their repository, but I added a few lines of code to swap out the MoCo-ResNet-50 backbone for DINO-pretrained backbones (both ResNet-50 and ViT). We ran this baseline to see the effect of the DINO pretraining (and the ViT backbone) on the segmentation performance. It did not make a huge difference, and surprisingly the ViT backbone was actually significantly worse. This might be a consequence of not using one of the new dense-prediction-ViT heads, like DPT, but we did not investigate along these lines. I don't have a huge amount of intuition regarding the performance of the DINO-pretrained MaskContrast model versus the MoCo-pretrained model. I will say that I found it difficult to reproduce the 35 number, even with their exact code, and there is a lot of variance between runs (of all these methods, ours included). The 31.2 number is the average of 5 random seeds that I ran, whereas the 35 number is from their paper.
I hope this helps! Luke
Hi @lukemelas, thanks a lot for your detailed answer, this helps to understand and reproduce baselines better and have more concrete interpretation of results in Table 4. Once more thank you for your contribution and quick response.
Indeed, using ViT for dense prediction is not a trivial task. Let's see if someone would be able to get ViT + additional contrastive learning on top to work.
Hi @lukemelas, fascinating work, thank you for your contribution!
While looking at the semantic segmentation results, I got several questions regarding the baselines used.
Can you explain how you obtained the features for clustering?