modelscope / ms-swift

Use PEFT or Full-parameter to finetune 400+ LLMs or 100+ MLLMs. (LLM: Qwen2.5, Llama3.2, GLM4, Internlm2.5, Yi1.5, Mistral, Baichuan2, DeepSeek, Gemma2, ...; MLLM: Qwen2-VL, Qwen2-Audio, Llama3.2-Vision, Llava, InternVL2, MiniCPM-V-2.6, GLM4v, Xcomposer2.5, Yi-VL, DeepSeek-VL, Phi3.5-Vision, ...)
https://swift.readthedocs.io/zh-cn/latest/Instruction/index.html
Apache License 2.0
4.46k stars 391 forks source link

DeepSeek VL finetune vision encoder? #543

Closed SinanAkkoyun closed 7 months ago

SinanAkkoyun commented 8 months ago

Hi! Does finetuning deepseek VL also finetune the vision encoder?

Jintao-Huang commented 8 months ago

This model currently only supports fine-tuning of the LLM portion

wwzhuang01 commented 8 months ago

hi! is finetuning the aligner of deepseek-vl supported?

Jintao-Huang commented 8 months ago

I am trying to provide support.

SinanAkkoyun commented 8 months ago

@Jintao-Huang Can you please provide pretraining code for the vision encoder, we need to give it new capabilites :)

Jintao-Huang commented 8 months ago

CPT:

NPROC_PER_NODE=4 CUDA_VISIBLE_DEVICES=0,1,2,3 swift sft \
    --model_type deepseek-vl-7b-chat \
    --custom_train_dataset_path xxx.jsonl \
    --custom_val_dataset_path yyy.jsonl \
    --train_dataset_sample -1 \
    --sft_type full \
    --deepspeed default-zero2
SinanAkkoyun commented 8 months ago

Wow! That was super quick, thank you so so much!!! ❤️

SinanAkkoyun commented 8 months ago

In what format must the custom train dataset be? (And what does the val dataset do exactly?)

soloice commented 8 months ago

Bro, you really rock!

SinanAkkoyun commented 8 months ago
{"query": "EEEEE", "response": "FFFFF", "history": [["AAAAA", "BBBBB"], ["CCCCC", "DDDDD"]], "images": ["https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"]}

Is that the right format; would I now need to place <image_placeholder> where each image should go? Also, is it possible to make multi-turn with multiple images?

Jintao-Huang commented 8 months ago

Format is similar to this:

[PROMPT]<|begin▁of▁sentence|>You are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.

User: <image_placeholder>please describe the image.

Assistant:[OUTPUT]A large airplane is suspended from the ceiling.<|end▁of▁sentence|>
Jintao-Huang commented 8 months ago

The image will be automatically placed into the image_placeholder to form inputs_embeds. It supports multi-turn conversations with multiple images, but each turn of the conversation can only contain one image.

SinanAkkoyun commented 8 months ago

@Jintao-Huang Thank you so much :)

I don't quite understand these points:

Again thanks for your help!

Jintao-Huang commented 8 months ago
{"query": "55555", "response": "66666", "images": ["image_path"]}
{"query": "eeeee", "response": "fffff", "history": [], "images": ["image_path"]}
{"query": "EEEEE", "response": "FFFFF", "history": [["AAAAA", "BBBBB"], ["CCCCC", "DDDDD"]], "images": ["image_path", "image_path2", "image_path3"]}

https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/deepseek-vl%E6%9C%80%E4%BD%B3%E5%AE%9E%E8%B7%B5.md

Jintao-Huang commented 8 months ago

Every multimodal model has its own custom dataset style, and it is demonstrated in best practices. For example, some multimodal models support zero or multiple images in a single conversation, such as qwen-vl. Some multimodal models only allow one image per conversation, like deepseek-vl. And there are multimodal models that require only one image for the entire dialogue, such as cogvlm.

Jintao-Huang commented 8 months ago

The val_dataset is used to evaluate the eval_loss. You can also choose to not provide the custom_val_dataset_path and only pass the custom_train_dataset_path.

SinanAkkoyun commented 8 months ago

Thank you so so much!!! 🙏❤️

SinanAkkoyun commented 7 months ago

Hey @Jintao-Huang , thank you again for implementing the model, I really appreciate your work!

Upon further testing, I realized that DeepSeek VL supports multi-image prompts. Could you please implement multi-image support for training? I'd love to train the model on some image comparisons

daihuangyu commented 7 months ago

This format seems to only support the training method of images + text. Can it be training with plain text data?

CudaMem commented 7 months ago

@Jintao-Huang 您好!请问deepseek-vl暂时只能微调llm部分,不能训练aligner(connector)部分吗?谢谢🙏

Jintao-Huang commented 7 months ago

@Jintao-Huang 您好!请问deepseek-vl暂时只能微调llm部分,不能训练aligner(connector)部分吗?谢谢🙏

--lora_target_modules设置为ALL就可以了,可以查看最佳实践哈,里面有写的。包括如何进行全参数训练

SinanAkkoyun commented 7 months ago

Thank you so so much! 😊

SinanAkkoyun commented 7 months ago

@Jintao-Huang How much VRAM is needed to finetune the 7b VL model?

# Experimental Environment: A100
# GPU Memory Requirement: 80GB
# Runtime: 2.5 hours
CUDA_VISIBLE_DEVICES=0 \
swift sft \
    --model_type qwen1half-7b-chat \
    --dataset blossom-math-zh \
    --num_train_epochs 5 \
    --sft_type full \
    --output_dir output \
    --eval_steps 500 \

The docs say one needs 80GB for a normal 7b model, however when I try to train on the research rig with an A100 I get an OOM. When trying to split across 4 GPUs (1 A100 and 3 4090s), it does not utilize the A100 and OOMs with the 3 4090s before training can start