haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
https://llava.hliu.cc
Apache License 2.0
20.21k stars 2.23k forks source link

[Questio why 'mm_vision_select_layer' == -2 in config ? n] #1613

Open fmy7834 opened 3 months ago

fmy7834 commented 3 months ago

Question

In training scripts, 'mm_vision_select_layer' is set to be -2, which means the penultimate layer's output of CLIP vision encoder is used as image features. I wonder why not use the last layer's output? image

wnma3mz commented 2 months ago

https://arxiv.org/abs/2304.08485

I found that the author made some explanations in the Ablations section

We hypothesize that this is because CLIP’s last layer features may focus more on global and abstract image properties compared to the layer before it, which can focus more on localized properties that are useful for understanding specific image details.

fmy7834 commented 2 months ago

Got it. Thank you!