Llama 3.2 Vision Models Fine-Tuning Recipe

meta-llama / llama-recipes

Scripts for fine-tuning Meta Llama with composable FSDP & PEFT methods to cover single/multi-node GPUs. Supports default & custom datasets for applications such as summarization and Q&A. Supporting a number of candid inference solutions such as HF TGI, VLLM for local or cloud deployment. Demo apps to showcase Meta Llama for WhatsApp & Messenger.

15.29k stars 2.21k forks source link

🚀 The feature, motivation and pitch

Notice that in the original paper "The Llama 3 Herd of Models", section 7.5.2 on vision model SFT states that only the vision encoder and image adapter weights should be updated, while the LLM weights remain frozen.

However, in the fine-tuning recipe for vision models that you provided, it seems like all LLM weights are being tuned. Is this an oversight, or are you working on updating the training script to only tune the vision encoder and image adapter?

Alternatives

No response

Additional context

No response

meta-llama / llama-recipes