Impact of Including GPT-4V in LVLM Pool?

Thank you for engaging with our paper. We appreciate your thoughtful question and the opportunity to clarify the inclusion of the GPT-4V model in our study.

GPT-4V is integrated into our LVLM pool due to its status as a representative commercial LVLM that is readily accessible. As highlighted in the preliminary study on GPT-4V (refer to link), it stands out as one of the most powerful LVLMs currently available. Importantly, its performance serves as a benchmark, forming the foundation for its role as the annotator in our ensemble.

Concerning the potential bias towards GPT-4V outcomes, particularly in annotated ratings, we acknowledge the possibility of unreliability and bias associated with GPT-4V annotations. To address this, we conducted a correlation analysis (refer to Paragraph 3 in Sec 2.4) comparing human annotators to GPT-4V. Impressively, this analysis revealed an average agreement rate of 83.1%, demonstrating a substantial alignment between human and GPT-4V annotations.

Moreover, in experiments involving DPO, we implemented a GPT-4V always as the best strategy, where GPT-4V responses were consistently chosen as the 'best' in DPO pairs. Notably, this simple heuristic outperformed the original backbone model significantly. This outcome suggests that biasing decisions towards GPT-4V does not guarantee a one-size-fits-all solution for performance improvement, emphasizing the nuanced nature of model ensemble dynamics.

We hope this provides clarity on the conceptual benefits of incorporating GPT-4V into our LVLM pool and how potential biases are addressed and validated in our study. If you have any further questions or require additional information, please feel free to ask.

vlf-silkie / VLFeedback

Impact of Including GPT-4V in LVLM Pool? #1