haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
https://llava.hliu.cc
Apache License 2.0
17.84k stars 1.93k forks source link

[Discussion] Request for Guidance on Stage Two: Converting LLaVA-V1.6 into 4-bit GGUF Format (PAPER) #1551

Open rohithbojja opened 3 weeks ago

rohithbojja commented 3 weeks ago

Discussion

LLaVA-Med V1.6: Training a Large Language-and-Vision Assistant for Biomedicine in Two and Half Hours

Abstract

Large Language Models (LLMs) have revolutionized natural language processing, but biomedical applications necessitate comprehension of both textual and visual data. General-purpose vision-language models struggle with biomedical domain-specific knowledge. Our novel approach, LLaVA-Med V1.6, significantly outperforms previous models, including Microsoft's Med-LLaVA, in visual question answering accuracy and precision. Achieved through visual instruction tuning (VIT) applied to LLaVA V1.6 using a substantial portion of the PMC-VQA dataset, the fine-tuning process on a four-A6000 GPU setup takes only 2.5 hours. LLaVA-Med V1.6 also supports any-resolution images and exhibits enhanced visual reasoning and Optical Character Recognition (OCR) capabilities.

1. Introduction

The release of GPT-4 has spurred intensive research into multimodal language models (MLLMs). While these models showcase impressive capabilities in general contexts, their effectiveness in biomedical scenarios is limited. Existing models may falter in addressing biomedical inquiries, risking inaccurate responses. LLaVA-Med V1.6, built upon the LLaVA V1.6 architecture with Mistral as the base model, leverages GPT-4 to generate diverse biomedical multimodal instruction-following data by synthesizing image-text pairs from the PMC-15M dataset. Utilizing novel curriculum learning techniques, LLaVA-Med V1.6 achieves remarkable performance enhancements.

2. Model Architecture

LLaVA-Med V1.6 employs a minimalist architectural design, similar to prefix tuning of language models. A novel trainable module seamlessly integrates a static image encoder with a dynamic language model (LM). The model trains a straightforward fully-connected projection layer on a modest dataset comprising 14k image-text pairs. Leveraging GPT-4 to autonomously curate biomedical task instructions from PubMed Central's extensive data repository, LLaVA-Med V1.6 achieves performance parity with state-of-the-art prefix tuning LMs.

3. Training and Dataset

The training process involved visual instruction tuning (VIT) applied to the latest version of LLaVA (v1.6) using a substantial portion of the PMC-VQA dataset. This fine-tuning was implemented on a four-A6000 GPU setup and took only 2.5 hours. The model increases input image resolution to capture finer visual details critical in the medical domain, supporting any-resolution images, including resolutions up to 672x672, 336x1344, and 1344x336.

4. Model Quantization and Format Conversion

To optimize the model for efficient deployment and inference, we converted the original 32-bit model to FP16 (16-bit), reducing the model size to 14 GB. Further, we quantized the FP16 model to 4 bits using the Q_4_K_M method from llama.cpp, reducing the model size to 5 GB. This quantized model was then converted to GGUF and subsequently to the Ollama format, facilitating deployment across multiple devices.

Benefits of Quantization and Format Conversion:

Experimental Results:

5. Evaluation

LLaVA-Med V1.6 was evaluated on three standard biomedical visual question answering datasets. The model demonstrated superior visual conversation capabilities and enhanced OCR and visual reasoning abilities. It outperformed previous state-of-the-art models on specific metrics across these datasets, including Microsoft's Med-LLaVA.

6. Conclusion

LLaVA-Med V1.6 represents a significant advancement in biomedical conversational AI, offering enhanced performance in visual question answering and improved deployment efficiency. The quantization and format conversion processes ensure that the model is both high-performing and versatile, suitable for a wide range of medical applications. Our contributions pave the way for more sophisticated biomedical AI applications, enabling more accurate and informative responses to biomedical inquiries.

Query for Guidance

I am currently writing a paper on LLaVA-Med V1.6, which we have fine-tuned on medical images. The paper's first stage, detailing the fine-tuning process, is complete. I am now focusing on stage two: converting our fine-tuned model into a 4-bit GGUF format.

Could you please advise on what key points to include in this section? Additionally, could you suggest any relevant references or previous papers that discuss similar quantization processes?

Thank you for your assistance.

Best regards,

Rohith