[Question] LLaVA 1.5 sizes and vision encoder

haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

https://llava.hliu.cc

Apache License 2.0

19.36k stars 2.13k forks source link

[Question] LLaVA 1.5 sizes and vision encoder #1083

Open aldoz-mila opened 7 months ago

aldoz-mila commented 7 months ago

Question

Hello, I was trying to get a sense of the number of params. of LLaVA 1.5. I understand that the LLM used is Vicuna 1.5 (either 7B or 13B) and that the vision encoder is CLIP ViT-L/14 336px. Shouldn't the total number of params. reflect the sum of both the LLM and the vision encoder? (e.g. LLaVA 1.5 13B is a combo of Vicuna 13B + CLIP ViT-L/14 336px XB + the MLP projector)? Or do you only consider the number of trainable params (ViT kept frozen) here? Thanks!

NVigne-cloud commented 3 months ago

Hello, I asked myself the same question and found out that the ViT-L/14 weighs around 307M parameters, the MLP should be much lighter. Therefore, these parts of the architecture are negligible compared to Vicuna.