The performance without stage 1 training?

Please check out the latest version on OpenReview and NeurIPS Paper

Impact of Stage-1 Training.

A: Thanks for bringing up this question. We comprehensively add more experiments in two more tables and describe our finding in the paragraph "Impact of Stage-1 training" (Line 565-596) in the updated Appendix Section C.2. Our suggestions on the necessity of Stage-1 training are summarized:

(i) If LLaVA-Med is trained with a customized vision encoder or LLM that are not included in LLaVA (i.e., no LLaVA checkpoint is available), Stage-1 is critical in aligning the multimodal feature space, and yield good performance. (ii) If LLaVA-Med is trained by initializing from LLaVA, the Stage-1 training is optional. In this case, it is more cost-efficient to skip Stage-1 and train Stage-2 only, which can quickly provide good performance on the vertical domains with less cost. However, for scenarios with a large number of in-domain image-text pairs that pre-trained LLaVA does not have much related knowledge, we suggest adding the Stage-1 training on the in-domain pairs: The best strategy in this case is full-model fine-tuning of the LLM and removing the instruction text of describing images. The text and figure in the main paper is also revised accordingly.

microsoft / LLaVA-Med

The performance without stage 1 training? #8