Finetuning multimodal vision models? (Llava, and BakLLaVA)

babycommando commented 7 months ago

Hey unsloth team, beautiful work being done here.

I am the author of MachinaScript for Robots - a framework for building LLM-powered robots in your garage!

The LLM basically outputs a JSON-like set of instructions for actions, movements and skill usages that are then parsed by a raspberry pi and serialized to an arduino to be executed. I am using unsloth for training a model that outputs this synthax so we can have smaller system prompts and faster execution for the robot.

However these are for receiving text instructions - there's no vision related so it makes it difficult to make a fully self operating robot out of it.

The project was initially based on GPT-4-V however with the great multimodal open models out there like Obsidian, Llava, and BakLLaVA the world of llm-powered robots is ready to take a great leap forward. I would love to plan a dataset and finetune a vision model to output machinascript synthax code using the awesome capabilities of unsloth. Is it possible or will it be in the future to finetune multimodal llms?

danielhanchen commented 7 months ago

@babycommando Hey thanks for the cool request! Super cool repo as well! And super interesting you're finetuning to output instructions then actually executing them!! That's super cool!!!

Hmm currently vision models can get more complex. Technically in the Llava paper, a LLM is first used, then the final layer is projected into a new feature space then a Vision Encoder is used:

So in theory the LLM part can be optimized with Unsloth, and the rest can be optimized at a later date. I just haven't gotten time to work on vision + LLM type models, but we will do so at a later date :)

Again super cool project!

babycommando commented 6 months ago

Hey Daniel, sorry for the delay! I went on a deep research on the finetuning of multimodal models, and turns out LLaVA repo already provides most of the things we need to get started.

Would be so cool if we could borrow Unsloth awesome capabilities to execute it.

This is an official doc for finetuning LLaVA: https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md

it mentions that:

if you have a lot of data use this script (12 hours on 8xA100's): https://github.com/haotian-liu/LLaVA/blob/main/scripts/v1_5/finetune_task_lora.sh
if you don't have a lot of data use this script (a few hours and a single A100): https://github.com/haotian-liu/LLaVA/blob/main/scripts/v1_5/finetune_task.sh

(try to also take a look at the whole /scripts directory)

Also the dataset format is the "share gpt" as mentioned in the doc. This is the dataset they used to make llava, for anyone else wondering how the finetuning dataset should be formatted: https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K

About Obsidian 3B, after some talk with one of the core engineers it is clear that is a version of LLaVA using a strain of Zephyr3B underneath. It is by the same people who made Hermes (Nous).

They recommend using this script for finetuning with a big dataset: https://github.com/NousResearch/Obsidian/blob/main/scripts/finetune.sh
And this one for smaller datasets: https://github.com/NousResearch/Obsidian/blob/main/scripts/finetune_qlora.sh

So, I hope this brings up some light for the integration of multimodal training. They both seems to be using deepspeed, I myself haven't tried it yet. Would love to use Unsloth for this!

And again, thank you so much for the interest in MachinaScript, free the robots!!!!

danielhanchen commented 6 months ago

@babycommando Thanks for the writeup! Super useful and wonderful insights! :) I will check these all out in the following days! :)) Hopefully Unsloth will have support for LlaVa type models in the near future :))

oliverbob commented 6 months ago

Can't wait to see it implemented. Thanks.

linshi1111 commented 4 months ago

I am currently exploring the qnguyen3/nanoLLaVA model, which it is built on top of Quyen-SE-v0.1 (Qwen1.5-0.5B) and incorporates Google SigLIP-400M.

Would there be support for Colab or Kaggle version fine-tuning for qnguyen3/nanoLLaVA?

Thank you for making unsloth project open-source. I am eagerly looking forward to seeing its implementation.

Here are the links to nanoLLaVA project:

https://huggingface.co/qnguyen3/nanoLLaVA

https://github.com/qnguyen3/nanoLLaVA

danielhanchen commented 4 months ago

Hmm Llava probs for a future release

kinchahoy commented 2 months ago

Big +1 for any fairly recent vision LLM models. Ideally one of the smaller ones like nanoLLaVA etc.

Namzakku commented 1 week ago

Huggingface now has supported LLaVA, LLaVA-NeXT and LLaVA-NeXT-Video (LLaVA-Next is the better version of LLaVA) and has multiple tutorials with Pytorch Lightning (which can be converted to HFTrainer) and also with HFTrainer (for the video version).

LLaVa LLaVa-NeXT LLaVA-NeXT-Video

Hope this helps!

kinchahoy commented 1 week ago

Thank you! Is support for Phi-3.5 vision likely? (Sorry the multimodal world moves fast!)

On Mon, Aug 26, 2024 at 3:14 PM Namzakku @.***> wrote:

Huggingface now has supported LLaVA, LLaVA-NeXT and LLaVA-NeXT-Video (LLaVA-Next is the better version of LLaVA) and has multiple tutorials with Pytorch Lightning (which can be converted to HFTrainer) and also with HFTrainer (for the video version).

LLaVa https://github.com/NielsRogge/Transformers-Tutorials/tree/master/LLaVa LLaVa-NeXT https://github.com/NielsRogge/Transformers-Tutorials/tree/master/LLaVa-NeXT LLaVA-NeXT-Video https://github.com/NielsRogge/Transformers-Tutorials/tree/master/LLaVA-NeXT-Video

Hope this helps!

— Reply to this email directly, view it on GitHub https://github.com/unslothai/unsloth/issues/158#issuecomment-2311197640, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABRT7PNTMH7DL4OCIAO3D5DZTOSFRAVCNFSM6AAAAABC4N2CEGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJRGE4TONRUGA . You are receiving this because you commented.Message ID: @.***>

Namzakku commented 1 week ago

Looking at the tutorials and their transformers library doc, at the multimodal model section, I don't think they have support for Phi-3.5 yet.

unslothai / unsloth

Finetuning multimodal vision models? (Llava, and BakLLaVA) #158