Open MischaQI opened 1 year ago
Please check out the latest version on OpenReview and NeurIPS Paper
Impact of Stage-1 Training.
A: Thanks for bringing up this question. We comprehensively add more experiments in two more tables and describe our finding in the paragraph "Impact of Stage-1 training" (Line 565-596) in the updated Appendix Section C.2. Our suggestions on the necessity of Stage-1 training are summarized:
(i) If LLaVA-Med is trained with a customized vision encoder or LLM that are not included in LLaVA (i.e., no LLaVA checkpoint is available), Stage-1 is critical in aligning the multimodal feature space, and yield good performance. (ii) If LLaVA-Med is trained by initializing from LLaVA, the Stage-1 training is optional. In this case, it is more cost-efficient to skip Stage-1 and train Stage-2 only, which can quickly provide good performance on the vertical domains with less cost. However, for scenarios with a large number of in-domain image-text pairs that pre-trained LLaVA does not have much related knowledge, we suggest adding the Stage-1 training on the in-domain pairs: The best strategy in this case is full-model fine-tuning of the LLM and removing the instruction text of describing images. The text and figure in the main paper is also revised accordingly.
Congrats on your great work.
Tab.1 shows that the performance of LLaVA drops after stage 1 training. So I'd like to know how about the performance of LLaVA-Med without stage 1 training.
Looking forward to your reply. Thanks.