microsoft / LLaVA-Med

Large Language-and-Vision Assistant for Biomedicine, built towards multimodal GPT-4 level capabilities.
Other
1.48k stars 181 forks source link

The performance without stage 1 training? #8

Open MischaQI opened 11 months ago

MischaQI commented 11 months ago

Congrats on your great work.

Tab.1 shows that the performance of LLaVA drops after stage 1 training. So I'd like to know how about the performance of LLaVA-Med without stage 1 training.

Looking forward to your reply. Thanks.

ChunyuanLI commented 11 months ago

Please check out the latest version on OpenReview and NeurIPS Paper

Impact of Stage-1 Training.

A: Thanks for bringing up this question. We comprehensively add more experiments in two more tables and describe our finding in the paragraph "Impact of Stage-1 training" (Line 565-596) in the updated Appendix Section C.2. Our suggestions on the necessity of Stage-1 training are summarized:

(i) If LLaVA-Med is trained with a customized vision encoder or LLM that are not included in LLaVA (i.e., no LLaVA checkpoint is available), Stage-1 is critical in aligning the multimodal feature space, and yield good performance. (ii) If LLaVA-Med is trained by initializing from LLaVA, the Stage-1 training is optional. In this case, it is more cost-efficient to skip Stage-1 and train Stage-2 only, which can quickly provide good performance on the vertical domains with less cost. However, for scenarios with a large number of in-domain image-text pairs that pre-trained LLaVA does not have much related knowledge, we suggest adding the Stage-1 training on the in-domain pairs: The best strategy in this case is full-model fine-tuning of the LLM and removing the instruction text of describing images. The text and figure in the main paper is also revised accordingly.

image