Open fmy7834 opened 3 months ago
https://arxiv.org/abs/2304.08485
I found that the author made some explanations in the Ablations
section
We hypothesize that this is because CLIP’s last layer features may focus more on global and abstract image properties compared to the layer before it, which can focus more on localized properties that are useful for understanding specific image details.
Got it. Thank you!
Question
In training scripts, 'mm_vision_select_layer' is set to be -2, which means the penultimate layer's output of CLIP vision encoder is used as image features. I wonder why not use the last layer's output?