Closed soskek closed 2 years ago
Hi @soskek ,
Thanks for your interest in LSeg!
And happy to share some thoughts here. The performance is a bit worse, I didn't try many or tune the hyper-parameters. Maybe it could be improved if we tune the hyper-parameters. To some extent, I am not surprised by this result. CLIP primarily focuses on image classification. While in LSeg, as has been mentioned in the paper, we only select the pre-trained text encoder and fix it during training. We only train the visual encoder for better localization ability. Segmentation is primarily for pixel-level prediction and localization ability, which is totally different from what CLIP aims to do. And of course, our finding is limited, we are happy to see more findings regarding this.
Hope this helps.
Best, Boyi
Thank you for your fast reply! I understand it well and totally agree that CLIP's classification ability could be harmful to pixel-level tasks.
Your careful comments are very helpful. Thank you again!
Best, Sosuke
Happy to hear that!
Good luck with your research!
This is a question on an interesting report in the paper. The paper reported
It seems surprising that CLIP image encoder, which is already well-aligned to the text encoder, is not helpful for the task. Do authors have any guesses about the reason? And, was the performance much worse or a little worse?