New paper: STLLaVA-Med: Self-Training Large Language and Vision Assistant for

Paper: STLLaVA-Med: Self-Training Large Language and Vision Assistant for

Authors: Guohao Sun and Can Qin and Huazhu Fu and Linwei Wang and Zhiqiang Tao

Abstract: Large Vision-Language Models (LVLMs) have shown significant potential inassisting medical diagnosis by leveraging extensive biomedical datasets.However, the advancement of medical image understanding and reasoningcritically depends on building high-quality visual instruction data, which iscostly and labor-intensive to obtain, particularly in the medical domain. Tomitigate this data-starving issue, we introduce Self-Training Large Languageand Vision Assistant for Medical (STLLaVA-Med). The proposed method is designedto train a policy model (an LVLM) capable of auto-generating medical visualinstruction data to improve data efficiency, guided through Direct PreferenceOptimization (DPO). Specifically, a more powerful and larger LVLM (e.g.,GPT-4o) is involved as a biomedical expert to oversee the DPO fine-tuningprocess on the auto-generated data, encouraging the policy model to alignefficiently with human preferences. We validate the efficacy and dataefficiency of STLLaVA-Med across three major medical Visual Question Answering(VQA) benchmarks, demonstrating competitive zero-shot performance with theutilization of only 9% of the medical data.

Link: https://arxiv.org/abs/2406.19973

Reasoning: Reasoning: Let's think step by step in order to determine if the paper is about a language model. We start by examining the title and abstract for any mention of language models. The title mentions "Self-Training Large Language and Vision Assistant," which suggests the involvement of a language model. The abstract further elaborates on the use of a "Large Vision-Language Model (LVLM)" and mentions "GPT-4o," which is a well-known language model. The focus is on training a policy model that includes a language model component to generate medical visual instruction data. Therefore, the paper is indeed about a language model.

ur-whitelab / LLMs-in-science

New paper: STLLaVA-Med: Self-Training Large Language and Vision Assistant for #8