Open mhamilton723 opened 2 years ago
I have the same confusion as you. It seems that this work is avoiding the confusion with the unsupervised learning setup, because it's claimed as zero-shot adaptation. But, the experiment show comparison with unsupervised segmentation method. If only use CLIP without labels, it may close to unsupervised setting. But the label for each image is provided, it should be weakly-supervised setting. Besides, the DenseCLIP is trained with pixel-level annotation. It could be a zero-shot task, not an unsupervised task. I'm thinking that if we only use a CLIP model (without sample or pixel label), how should we define the task? It's unfair for comparison with both unsupervised task and weakly-supervised task.
Hi both,
Thank you for your input.
First, to clarify a couple of points raised by @hq-deng's comment:
@mhamilton723, we did not consider our work to be weakly supervised because we are not training for segmentation on images with class labels (in the same way that PiCIE does not refer to itself as weakly supervised). On the other hand, we recognise that there is a spectrum of supervision from zero supervision up to fully supervised, and that by using CLIP we are not at the "zero" end of the spectrum. As such, a precise name would be useful to avoid confusion.
In response to your question, we thought that perhaps "Unsupervised Semantic Segmentation with Language-Image Pre-training" could be a better fit for the task setting considered by ReCo (and the DenseCLIP baseline we consider in our paper). If this name seems appropriate to you both (feedback is highly welcome - it would be good for us to get the right name), we will create a branch for the task in Papers with code.
Gyungin
Hello @NoelShin ,
Thanks for your comprehensive answer. That is a novel and interesting concept of segmentation. Although this approach is difficult to define, you are bravely exploring it. Congratulations on your groundbreaking work.
Hey @NoelShin thanks for the detailed reply. I think it might be a good idea to split this leaderboard out for one that uses supervised pre-training as you suggested. In some sense text labels provide even more supervision than classes or tags which is why i originally suggested weakly supervised methods. Thanks for being flexible and understanding on this topic :)
Hello, congrats on the release of your fantastic work. I love the fact that you can use language to prompt the segmentation, and we appreciate you citing and comparing against STEGO!
Wanted to quickly reach out with regards to how you want to collectively manage the Papers with code section on unsupervised segmentation. Because CLIP is trained with image-language pairs and you use this to generate the attention maps, I think this might fall under weakly supervised methods such as either of these:
https://paperswithcode.com/task/weakly-supervised-object-localization https://paperswithcode.com/task/weakly-supervised-semantic-segmentation
let me know what you think about this proposal and I'm happy to discuss it further. Congrats again on making your work public!
Best, Mark