DeepSeek VL finetune vision encoder?

modelscope / ms-swift

Use PEFT or Full-parameter to finetune 400+ LLMs or 100+ MLLMs. (LLM: Qwen2.5, Llama3.2, GLM4, Internlm2.5, Yi1.5, Mistral, Baichuan2, DeepSeek, Gemma2, ...; MLLM: Qwen2-VL, Qwen2-Audio, Llama3.2-Vision, Llava, InternVL2, MiniCPM-V-2.6, GLM4v, Xcomposer2.5, Yi-VL, DeepSeek-VL, Phi3.5-Vision, ...)

https://swift.readthedocs.io/zh-cn/latest/Instruction/index.html

Apache License 2.0

4.46k stars 391 forks source link

DeepSeek VL finetune vision encoder? #543

Closed SinanAkkoyun closed 7 months ago

SinanAkkoyun commented 8 months ago

Hi! Does finetuning deepseek VL also finetune the vision encoder?

Jintao-Huang commented 8 months ago

This model currently only supports fine-tuning of the LLM portion

wwzhuang01 commented 8 months ago

hi! is finetuning the aligner of deepseek-vl supported?

Jintao-Huang commented 8 months ago

I am trying to provide support.

SinanAkkoyun commented 8 months ago

@Jintao-Huang Can you please provide pretraining code for the vision encoder, we need to give it new capabilites :)

Jintao-Huang commented 8 months ago

CPT:

NPROC_PER_NODE=4 CUDA_VISIBLE_DEVICES=0,1,2,3 swift sft \
    --model_type deepseek-vl-7b-chat \
    --custom_train_dataset_path xxx.jsonl \
    --custom_val_dataset_path yyy.jsonl \
    --train_dataset_sample -1 \
    --sft_type full \
    --deepspeed default-zero2

SinanAkkoyun commented 8 months ago

Wow! That was super quick, thank you so so much!!! ❤️

SinanAkkoyun commented 8 months ago

In what format must the custom train dataset be? (And what does the val dataset do exactly?)

soloice commented 8 months ago

Bro, you really rock!

SinanAkkoyun commented 8 months ago

{"query": "EEEEE", "response": "FFFFF", "history": [["AAAAA", "BBBBB"], ["CCCCC", "DDDDD"]], "images": ["https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"]}

Is that the right format; would I now need to place <image_placeholder> where each image should go? Also, is it possible to make multi-turn with multiple images?

Jintao-Huang commented 8 months ago

Format is similar to this:

[PROMPT]<｜begin▁of▁sentence｜>You are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.

User: <image_placeholder>please describe the image.

Assistant:[OUTPUT]A large airplane is suspended from the ceiling.<｜end▁of▁sentence｜>

Jintao-Huang commented 8 months ago

The image will be automatically placed into the image_placeholder to form inputs_embeds. It supports multi-turn conversations with multiple images, but each turn of the conversation can only contain one image.

SinanAkkoyun commented 8 months ago

@Jintao-Huang Thank you so much :)

I don't quite understand these points:

How would the training jsonl look like exactly?
How can I use the llava chat format or any other format for that matter (they use that for the deepseek vl chat models)?
What's the val dataset being used for and is it necessary?

Again thanks for your help!

Jintao-Huang commented 8 months ago

{"query": "55555", "response": "66666", "images": ["image_path"]}
{"query": "eeeee", "response": "fffff", "history": [], "images": ["image_path"]}
{"query": "EEEEE", "response": "FFFFF", "history": [["AAAAA", "BBBBB"], ["CCCCC", "DDDDD"]], "images": ["image_path", "image_path2", "image_path3"]}

https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/deepseek-vl%E6%9C%80%E4%BD%B3%E5%AE%9E%E8%B7%B5.md

Jintao-Huang commented 8 months ago

Every multimodal model has its own custom dataset style, and it is demonstrated in best practices. For example, some multimodal models support zero or multiple images in a single conversation, such as qwen-vl. Some multimodal models only allow one image per conversation, like deepseek-vl. And there are multimodal models that require only one image for the entire dialogue, such as cogvlm.

Jintao-Huang commented 8 months ago

The val_dataset is used to evaluate the eval_loss. You can also choose to not provide the custom_val_dataset_path and only pass the custom_train_dataset_path.

SinanAkkoyun commented 8 months ago

Thank you so so much!!! 🙏❤️

SinanAkkoyun commented 7 months ago

Hey @Jintao-Huang , thank you again for implementing the model, I really appreciate your work!

Upon further testing, I realized that DeepSeek VL supports multi-image prompts. Could you please implement multi-image support for training? I'd love to train the model on some image comparisons

daihuangyu commented 7 months ago

This format seems to only support the training method of images + text. Can it be training with plain text data？

CudaMem commented 7 months ago

@Jintao-Huang 您好！请问deepseek-vl暂时只能微调llm部分，不能训练aligner（connector）部分吗？谢谢🙏

Jintao-Huang commented 7 months ago

@Jintao-Huang 您好！请问deepseek-vl暂时只能微调llm部分，不能训练aligner（connector）部分吗？谢谢🙏

--lora_target_modules设置为ALL就可以了，可以查看最佳实践哈，里面有写的。包括如何进行全参数训练

SinanAkkoyun commented 7 months ago

Thank you so so much! 😊

SinanAkkoyun commented 7 months ago

@Jintao-Huang How much VRAM is needed to finetune the 7b VL model?

# Experimental Environment: A100
# GPU Memory Requirement: 80GB
# Runtime: 2.5 hours
CUDA_VISIBLE_DEVICES=0 \
swift sft \
    --model_type qwen1half-7b-chat \
    --dataset blossom-math-zh \
    --num_train_epochs 5 \
    --sft_type full \
    --output_dir output \
    --eval_steps 500 \

The docs say one needs 80GB for a normal 7b model, however when I try to train on the research rig with an A100 I get an OOM. When trying to split across 4 GPUs (1 A100 and 3 4090s), it does not utilize the A100 and OOMs with the 3 4090s before training can start