Closed SinanAkkoyun closed 7 months ago
This model currently only supports fine-tuning of the LLM portion
hi! is finetuning the aligner of deepseek-vl supported?
I am trying to provide support.
@Jintao-Huang Can you please provide pretraining code for the vision encoder, we need to give it new capabilites :)
CPT:
NPROC_PER_NODE=4 CUDA_VISIBLE_DEVICES=0,1,2,3 swift sft \
--model_type deepseek-vl-7b-chat \
--custom_train_dataset_path xxx.jsonl \
--custom_val_dataset_path yyy.jsonl \
--train_dataset_sample -1 \
--sft_type full \
--deepspeed default-zero2
Wow! That was super quick, thank you so so much!!! ❤️
In what format must the custom train dataset be? (And what does the val dataset do exactly?)
Bro, you really rock!
{"query": "EEEEE", "response": "FFFFF", "history": [["AAAAA", "BBBBB"], ["CCCCC", "DDDDD"]], "images": ["https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"]}
Is that the right format; would I now need to place <image_placeholder>
where each image should go?
Also, is it possible to make multi-turn with multiple images?
Format is similar to this:
[PROMPT]<|begin▁of▁sentence|>You are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.
User: <image_placeholder>please describe the image.
Assistant:[OUTPUT]A large airplane is suspended from the ceiling.<|end▁of▁sentence|>
The image will be automatically placed into the image_placeholder to form inputs_embeds. It supports multi-turn conversations with multiple images, but each turn of the conversation can only contain one image.
@Jintao-Huang Thank you so much :)
I don't quite understand these points:
val
dataset being used for and is it necessary?Again thanks for your help!
{"query": "55555", "response": "66666", "images": ["image_path"]}
{"query": "eeeee", "response": "fffff", "history": [], "images": ["image_path"]}
{"query": "EEEEE", "response": "FFFFF", "history": [["AAAAA", "BBBBB"], ["CCCCC", "DDDDD"]], "images": ["image_path", "image_path2", "image_path3"]}
Every multimodal model has its own custom dataset style, and it is demonstrated in best practices. For example, some multimodal models support zero or multiple images in a single conversation, such as qwen-vl. Some multimodal models only allow one image per conversation, like deepseek-vl. And there are multimodal models that require only one image for the entire dialogue, such as cogvlm.
The val_dataset is used to evaluate the eval_loss. You can also choose to not provide the custom_val_dataset_path and only pass the custom_train_dataset_path.
Thank you so so much!!! 🙏❤️
Hey @Jintao-Huang , thank you again for implementing the model, I really appreciate your work!
Upon further testing, I realized that DeepSeek VL supports multi-image prompts. Could you please implement multi-image support for training? I'd love to train the model on some image comparisons
This format seems to only support the training method of images + text. Can it be training with plain text data?
@Jintao-Huang 您好!请问deepseek-vl暂时只能微调llm部分,不能训练aligner(connector)部分吗?谢谢🙏
@Jintao-Huang 您好!请问deepseek-vl暂时只能微调llm部分,不能训练aligner(connector)部分吗?谢谢🙏
--lora_target_modules设置为ALL就可以了,可以查看最佳实践哈,里面有写的。包括如何进行全参数训练
Thank you so so much! 😊
@Jintao-Huang How much VRAM is needed to finetune the 7b VL model?
# Experimental Environment: A100
# GPU Memory Requirement: 80GB
# Runtime: 2.5 hours
CUDA_VISIBLE_DEVICES=0 \
swift sft \
--model_type qwen1half-7b-chat \
--dataset blossom-math-zh \
--num_train_epochs 5 \
--sft_type full \
--output_dir output \
--eval_steps 500 \
The docs say one needs 80GB for a normal 7b model, however when I try to train on the research rig with an A100 I get an OOM. When trying to split across 4 GPUs (1 A100 and 3 4090s), it does not utilize the A100 and OOMs with the 3 4090s before training can start
Hi! Does finetuning deepseek VL also finetune the vision encoder?