Reason on bad results of CLIP-based initialization of image encoder

soskek commented 2 years ago

This is a question on an interesting report in the paper. The paper reported

We also evaluated on a model initialized with the CLIP image encoder with the same setup and hyperparameters, but observed worse performance than using the ViT initialization.

It seems surprising that CLIP image encoder, which is already well-aligned to the text encoder, is not helpful for the task. Do authors have any guesses about the reason? And, was the performance much worse or a little worse?

Boyiliee commented 2 years ago

Hi @soskek ,

Thanks for your interest in LSeg!

And happy to share some thoughts here. The performance is a bit worse, I didn't try many or tune the hyper-parameters. Maybe it could be improved if we tune the hyper-parameters. To some extent, I am not surprised by this result. CLIP primarily focuses on image classification. While in LSeg, as has been mentioned in the paper, we only select the pre-trained text encoder and fix it during training. We only train the visual encoder for better localization ability. Segmentation is primarily for pixel-level prediction and localization ability, which is totally different from what CLIP aims to do. And of course, our finding is limited, we are happy to see more findings regarding this.

Hope this helps.

Best, Boyi

soskek commented 2 years ago

Thank you for your fast reply! I understand it well and totally agree that CLIP's classification ability could be harmful to pixel-level tasks.

Your careful comments are very helpful. Thank you again!

Best, Sosuke

Boyiliee commented 2 years ago

Happy to hear that!

Good luck with your research!

isl-org / lang-seg

Reason on bad results of CLIP-based initialization of image encoder #16