Hello, I was trying to get a sense of the number of params. of LLaVA 1.5. I understand that the LLM used is Vicuna 1.5 (either 7B or 13B) and that the vision encoder is CLIP ViT-L/14 336px. Shouldn't the total number of params. reflect the sum of both the LLM and the vision encoder? (e.g. LLaVA 1.5 13B is a combo of Vicuna 13B + CLIP ViT-L/14 336px XB + the MLP projector)? Or do you only consider the number of trainable params (ViT kept frozen) here? Thanks!
Hello, I asked myself the same question and found out that the ViT-L/14 weighs around 307M parameters, the MLP should be much lighter. Therefore, these parts of the architecture are negligible compared to Vicuna.
Question
Hello, I was trying to get a sense of the number of params. of LLaVA 1.5. I understand that the LLM used is Vicuna 1.5 (either 7B or 13B) and that the vision encoder is CLIP ViT-L/14 336px. Shouldn't the total number of params. reflect the sum of both the LLM and the vision encoder? (e.g. LLaVA 1.5 13B is a combo of Vicuna 13B + CLIP ViT-L/14 336px XB + the MLP projector)? Or do you only consider the number of trainable params (ViT kept frozen) here? Thanks!