LLaVA-Med: Large Language and Vision Assistant for Biomedicine

Visual instruction tuning towards building large language and vision models with GPT-4 level capabilities in the biomedicine space.

[Paper, NeurIPS 2023 Datasets and Benchmarks Track (Spotlight)]

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day

Chunyuan Li*, Cliff Wong*, Sheng Zhang*, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, Jianfeng Gao (*Equal Contribution)

*Generated by GLIGEN using the grounded inpainting mode, with three boxes: ``white doctor coat``, ``stethoscope``, ``white doctor hat with a red cross sign``.*

Release

[May 13, 2024] 🔥LLaVA-Med v1.5 is out! It is not only significantly better (see the evaluation results.) but also much easier to use: no more delta weights! Now you can directly load our model from the 🤗 Hub. The original LLaVA-Med (i.e., v1.0.0) codebase has been moved to Archive.
[Nov 8, 2023] LLaVA-Med is open-sourced under the MSR release policy. Huge thanks to commitment of the team, and patience of the community.
[Sept, 2023] LLaVA-Med is accepted in NeurIPS 2023 Datasets and Benchmarks Track, as a spotlight presentation.
[June 1, 2023] 🔥 We released LLaVA-Med: Large Language and Vision Assistant for Biomedicine, a step towards building biomedical domain large language and vision models with GPT-4 level capabilities. Checkout the paper

*LLaVA-Med was initialized with the general-domain LLaVA and then continuously trained in a curriculum learning fashion (first biomedical concept alignment then full-blown instruction-tuning). We evaluated LLaVA-Med on standard visual conversation and question answering tasks.*

Usage and License Notices: The data, code, and model checkpoints are intended and licensed for research use only. They are also subject to additional restrictions dictated by the Terms of Use: LLaMA, Vicuna and GPT-4 respectively. The data is made available under CC BY NC 4.0. The data, code, and model checkpoints may be used for non-commercial purposes and any models trained using the dataset should be used only for research purposes. It is expressly prohibited for models trained on this data to be used in clinical care or for any clinical decision making purposes.

Install
Model Download
Serving
Evaluation
Data Download
Archive
Model Description

Install

Clone this repository and navigate to LLaVA-Med folder

https://github.com/microsoft/LLaVA-Med.git
cd LLaVA-Med

Install Package: Create conda environment

conda create -n llava-med python=3.10 -y
conda activate llava-med
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

Model Download

Model Descriptions	🤗 Huggingface Hub
LLaVA-Med v1.5	microsoft/llava-med-v1.5-mistral-7b

Serving

Web UI

Launch a controller

python -m llava.serve.controller --host 0.0.0.0 --port 10000

Launch a model worker

python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path microsoft/llava-med-v1.5-mistral-7b --multi-modal

Wait until the process finishes loading the model and you see "Uvicorn running on ...".

Launch a model worker (Multiple GPUs, when GPU VRAM <= 24GB)

If your the VRAM of your GPU is less than 24GB (e.g., RTX 3090, RTX 4090, etc.), you may try running it with multiple GPUs.

python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path microsoft/llava-med-v1.5-mistral-7b --multi-modal --num-gpus 2

Wait until the process finishes loading the model and you see "Uvicorn running on ...".

Send a test message

python -m llava.serve.test_message --model-name llava-med-v1.5-mistral-7b --controller http://localhost:10000

Launch a gradio web server.

python -m llava.serve.gradio_web_server --controller http://localhost:10000

You can open your browser and chat with a model now.

Evaluation

Medical Visual Chat (GPT-assisted Evaluation)

Our GPT-assisted evaluation pipeline for multimodal modeling is provided for a comprehensive understanding of the capabilities of vision-language models. Please see our paper for more details.

1. Azure OpenAI Connection Info.

Open llava/eval/llm.py and insert your Azure OpenAI Endpoint and API KEY

openai_cxn_dict = {
    'default': {
      'endpoint': "INSERT YOUR AZURE OPENAI ENDPOINT HERE",
      'api_key': "INSERT YOUR AZURE OPENAI API KEY HERE",
    },
  }

GPT-4 inference was only tested using Azure OpenAI API. If you are using OpenAI API, you need to replace llava/eval/llm.py (line 55) AsyncAzureOpenAI with AsyncOpenAI.

2. Deployment ID

In llava/eval/eval_multimodal_chat_gpt_score.py (line 55), replace with your GPT-4 model deployment id if necessary:

3. Download Images

wget https://hanoverprod.z21.web.core.windows.net/med_llava/multimodal_chat_eval/llava_med_test_image_urls.jsonl -P data/
python llava/data/download_images.py \
    --input_path data/llava_med_test_image_urls.jsonl \
    --pmc_output_path data/pmc \
    --images_output_path data/images

4. Multimodal Chat Inference

In our case, llava_med_eval_qa50_qa.jsonl contains the questions, context (captions and inline-mentions) and responses generated by text-only GPT-4 (0314), which we treat as ground truth.

PYTHONPATH=. python llava/eval/model_vqa.py \
    --conv-mode mistral_instruct \
    --model-path microsoft/llava-med-v1.5-mistral-7b \
    --question-file data/eval/llava_med_eval_qa50_qa.jsonl \
    --image-folder data/images \
    --answers-file /path/to/answer-file.jsonl \
    --temperature 0.0

5. GPT-4 Evaluation of the Generated Answers

python llava/eval/eval_multimodal_chat_gpt_score.py \
    --answers-file /path/to/answer-file.jsonl \
    --question-file data/eval/llava_med_eval_qa50_qa.jsonl \
    --scores-file /path/to/scores-file.jsonl

6. Summarize the Evaluation Results

python llava/eval/summarize_gpt_review.py \
    --scores-file /path/to/scores-file.jsonl

Data Download

LLaVA-Med Dataset

*The data statistics of biomedical multimodal instruction-following data: (a,b) The root verb-noun pairs of instruction and responses, where the inner circle of the plot represents the root verb of the output response, and the outer circle represents the direct nouns. (c) The distribution of images and QA pairs on the five domains, one image is shown per domain.*

Data Download

Alignment data files	Size
llava_med_alignment_500k.json	341.52 MiB

Instruction-Tuning data files	Size
llava_med_instruct_10k.json	19.24 MiB
llava_med_instruct_60k.json	84.65 MiB
llava_med_instruct_60k_inline_mention.json	83.61 MiB
llava_med_instruct_fig_captions.json	161.39 MiB

Evaluation files	Size
llava_med_eval_qa50_qa.jsonl	256.18 KiB
llava_med_eval_qa50_fig_captions.json	51.82 KiB
llava_med_qa50_instruct_caption_in_text_cleaned-60k-3epoch.json	100.97 KiB

Image URLS	Size
llava_med_image_urls.jsonl	122.82 MiB

download_images.py is used to download the PMC articles using the above image_urls file and extract the images

To download our langauge-image multimodal instruction-folllowing dataset, please run the following script:

sh download_data.sh

Model Description

Large Language and Vision Assistant for bioMedicine (i.e., “LLaVA-Med”) is a large language and vision model trained using a curriculum learning method for adapting LLaVA to the biomedical domain. It is an open-source release intended for research use only to facilitate reproducibility of the corresponding paper which claims improved performance for open-ended biomedical questions answering tasks, including common visual question answering (VQA) benchmark datasets such as PathVQA and VQA-RAD.

Model Uses

Intended Use

The data, code, and model checkpoints are intended to be used solely for (I) future research on visual-language processing and (II) reproducibility of the experimental results reported in the reference paper. The data, code, and model checkpoints are not intended to be used in clinical care or for any clinical decision making purposes.

Primary Intended Use

The primary intended use is to support AI researchers reproducing and building on top of this work. LLaVA-Med and its associated models should be helpful for exploring various biomedical vision-language processing (VLP ) and vision question answering (VQA) research questions.

Out-of-Scope Use

Any deployed use case of the model --- commercial or otherwise --- is out of scope. Although we evaluated the models using a broad set of publicly-available research benchmarks, the models and evaluations are intended for research use only and not intended for deployed use cases. Please refer to the associated paper for more details.

Data

This model builds upon PMC-15M dataset, which is a large-scale parallel image-text dataset for biomedical vision-language processing. It contains 15 million figure-caption pairs extracted from biomedical research articles in PubMed Central. It covers a diverse range of biomedical image types, such as microscopy, radiography, histology, and more.

Limitations

This model was developed using English corpora, and thus may be considered English-only. This model is evaluated on a narrow set of biomedical benchmark tasks, described in LLaVA-Med paper. As such, it is not suitable for use in any clinical setting. Under some conditions, the model may make inaccurate predictions and display limitations, which may require additional mitigation strategies. In particular, this model is likely to carry many of the limitations of the model from which it is derived, LLaVA.

Further, this model was developed in part using the PMC-15M dataset. The figure-caption pairs that make up this dataset may contain biases reflecting the current practice of academic publication. For example, the corresponding papers may be enriched for positive findings, contain examples of extreme cases, and otherwise reflect distributions that are not representative of other sources of biomedical data.

Acknowledgement

If you find LLaVA-Med useful for your your research and applications, please cite using this BibTeX:

@article{li2023llavamed,
  title={Llava-med: Training a large language-and-vision assistant for biomedicine in one day},
  author={Li, Chunyuan and Wong, Cliff and Zhang, Sheng and Usuyama, Naoto and Liu, Haotian and Yang, Jianwei and Naumann, Tristan and Poon, Hoifung and Gao, Jianfeng},
  journal={arXiv preprint arXiv:2306.00890},
  year={2023}
}

microsoft / LLaVA-Med

readme

LLaVA-Med: Large Language and Vision Assistant for Biomedicine

Release

Contents

Install

Model Download

Serving

Web UI

Launch a controller

Launch a model worker

Launch a model worker (Multiple GPUs, when GPU VRAM <= 24GB)

Send a test message

Launch a gradio web server.

You can open your browser and chat with a model now.

Evaluation

Medical Visual Chat (GPT-assisted Evaluation)

1. Azure OpenAI Connection Info.

2. Deployment ID

3. Download Images

4. Multimodal Chat Inference

5. GPT-4 Evaluation of the Generated Answers

6. Summarize the Evaluation Results

Data Download

LLaVA-Med Dataset

Data Download

Archive

Model Description

Model Uses

Intended Use

Primary Intended Use

Out-of-Scope Use

Data

Limitations

Acknowledgement

Related Projects