Closed MonolithFoundation closed 5 months ago
It's about +3pp on MMB, +100 on MME, +4pp on SEED, and +8pp on LLaVA-w. I think this is not that little. However, since both the 7B and 13B models of Honeybee use the same vision encoder (CLIP ViT-L/14), the improvement in VL performance may appear relatively small compared to the increase in LLM size. It is worth noting that this level of performance improvement is not small, when compared to other MLLMs (e.g., LLaVA-1.5).
Why the performance increase so littel from 7B to 13B?