Fine-tuning best practices for qwen2.5-72b-instruct and qwen2-vl-72b-instruct.

Jintao-Huang commented 2 days ago

More docs:

qwen2-vl: https://github.com/modelscope/ms-swift/blob/main/docs/source/Multi-Modal/qwen2-vl%E6%9C%80%E4%BD%B3%E5%AE%9E%E8%B7%B5.md

qwen1.5: https://github.com/modelscope/ms-swift/blob/main/docs/source/LLM/Qwen1.5%E5%85%A8%E6%B5%81%E7%A8%8B%E6%9C%80%E4%BD%B3%E5%AE%9E%E8%B7%B5.md

我们使用ms-swift对qwen2.5和qwen2-vl进行自我认知微调和图像OCR微调，并对微调后的模型进行推理。

在开始微调之前，请确保您的环境已正确安装

# 安装ms-swift.
git clone https://github.com/modelscope/ms-swift.git
cd ms-swift
pip install -e .[llm]

# qwen2-vl
# https://github.com/QwenLM/Qwen2-VL/issues/96
pip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830
# vllm加速
pip install vllm>=0.6.1

通常，大模型微调通常使用自定义数据集进行微调。在这里，我们将展示可直接运行的demo。

qwen2.5-72b-instruct

我们对Qwen2.5-72B-Instruct进行自我认知微调。

自我认知数据集：https://www.modelscope.cn/datasets/swift/self-cognition

通用混合数据集：

微调脚本:

# 实验环境：4 * A100
# 显存占用：4 * 70GB
NPROC_PER_NODE=4 CUDA_VISIBLE_DEVICES=0,1,2,3 swift sft \
    --model_type qwen2_5-72b-instruct \
    --model_id_or_path qwen/Qwen2.5-72B-Instruct \
    --dataset qwen2-pro-en#500 qwen2-pro-zh#500 self-cognition#500 \
    --logging_steps 5 \
    --learning_rate 1e-4 \
    --output_dir output \
    --lora_target_modules ALL \
    --model_name 小黄 'Xiao Huang' \
    --model_author 魔搭 ModelScope \
    --system "You are a helpful assistant." \
    --deepspeed default-zero3

# 单卡A10/3090可运行的例子 （Qwen2.5-7B-Instruct）
# 显存占用：24GB
CUDA_VISIBLE_DEVICES=0 swift sft \
    --model_type qwen2_5-7b-instruct \
    --model_id_or_path qwen/Qwen2.5-7B-Instruct \
    --dataset qwen2-pro-en#500 qwen2-pro-zh#500 self-cognition#500 \
    --logging_steps 5 \
    --max_length 2048 \
    --learning_rate 1e-4 \
    --output_dir output \
    --lora_target_modules ALL \
    --model_name 小黄 'Xiao Huang' \
    --model_author 魔搭 ModelScope \
    --system "You are a helpful assistant."

自定义数据集文档可以查看：https://github.com/modelscope/ms-swift/blob/main/docs/source/Instruction/%E8%87%AA%E5%AE%9A%E4%B9%89%E4%B8%8E%E6%8B%93%E5%B1%95.md

微调显存消耗：

微调过程的loss可视化：

微调后推理脚本如下，这里的ckpt_dir需要修改为训练生成的last checkpoint文件夹。我们可以使用vLLM对merge后的checkpoint进行推理加速：

# 直接推理
CUDA_VISIBLE_DEVICES=0,1 swift infer \
    --ckpt_dir output/qwen2_5-72b-instruct/vx-xxx/checkpoint-xxx \

# merge-lora并使用vLLM进行推理加速
CUDA_VISIBLE_DEVICES=0,1 swift export \
    --ckpt_dir output/qwen2_5-72b-instruct/vx-xxx/checkpoint-xxx \
    --merge_lora true

CUDA_VISIBLE_DEVICES=0,1,2,3 swift infer \
    --ckpt_dir output/qwen2_5-72b-instruct/vx-xxx/checkpoint-xxx-merged \
    --infer_backend vllm --max_model_len 8192 \
    --tensor_parallel_size 4

微调后模型对验证集进行推理的示例：

qwen2-vl-72b-instruct

我们对Qwen2-VL-72B-Instruct进行OCR微调。Grouding任务和视频微调的例子可以查看ms-swift文档：https://github.com/modelscope/ms-swift/blob/main/docs/source/Multi-Modal/qwen2-vl%E6%9C%80%E4%BD%B3%E5%AE%9E%E8%B7%B5.md

微调数据集：https://modelscope.cn/datasets/AI-ModelScope/LaTeX_OCR 微调脚本：

# 实验环境：8 * A100
SIZE_FACTOR=8 MAX_PIXELS=602112 \
NPROC_PER_NODE=8 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
swift sft \
  --model_type qwen2-vl-72b-instruct \
  --model_id_or_path qwen/Qwen2-VL-72B-Instruct \
  --sft_type lora \
  --dataset latex-ocr-print#20000 \
  --deepspeed default-zero3

如果要使用自定义数据集，只需按以下方式进行指定：

# val_dataset可选，如果不指定，则会从dataset中切出一部分数据集作为验证集
  --dataset train.jsonl \
  --val_dataset val.jsonl \

自定义数据集格式：

{"query": "<image>55555", "response": "66666", "images": ["image_path"]}
{"query": "<image><image>eeeee", "response": "fffff", "history": [], "audios": ["image_path1", "image_path2"]}
{"query": "EEEEE", "response": "FFFFF", "history": [["query1", "response1"], ["query2", "response2"]]}

微调显存消耗：

微调过程的loss可视化：（由于时间原因，这里只微调了250个steps）

微调后推理脚本如下，这里的ckpt_dir需要修改为训练生成的last checkpoint文件夹。我们可以使用vLLM对merge后的checkpoint进行推理加速：

# 直接推理
CUDA_VISIBLE_DEVICES=0,1 swift infer \
    --ckpt_dir output/qwen2-vl-72b-instruct/vx-xxx/checkpoint-xxx \
    --load_dataset_config true

# merge-lora并使用vLLM进行推理加速
CUDA_VISIBLE_DEVICES=0,1 swift export \
    --ckpt_dir output/qwen2-vl-72b-instruct/vx-xxx/checkpoint-xxx \
    --merge_lora true

CUDA_VISIBLE_DEVICES=0,1,2,3 swift infer \
    --ckpt_dir output/qwen2-vl-72b-instruct/vx-xxx/checkpoint-xxx-merged \
    --load_dataset_config true --infer_backend vllm \
    --tensor_parallel_size 4 --max_model_len 16384

微调后模型对验证集进行推理的示例：

llp1992 commented 1 day ago

qwen2-vl支持多图多伦对话训练吗？

etemiz commented 1 day ago

can I train 72b with 2A6000? (248GB)

Jintao-Huang commented 1 day ago

qwen2-vl支持多图多伦对话训练吗？

支持的

Jintao-Huang commented 1 day ago

can I train 72b with 2_A6000? (2_48GB)

maybe qlora

# GPU Memory: 2 * 28GB
SIZE_FACTOR=8 MAX_PIXELS=602112 \
CUDA_VISIBLE_DEVICES=0,1 \
swift sft \
  --model_type qwen2-vl-72b-instruct-gptq-int4 \
  --model_id_or_path qwen/Qwen2-VL-72B-Instruct-GPTQ-Int4 \
  --sft_type lora \
  --dataset latex-ocr-print#20000

Jintao-Huang commented 1 day ago

lora & device_map

# GPU Memory: 2 * 75GB
SIZE_FACTOR=8 MAX_PIXELS=602112 \
CUDA_VISIBLE_DEVICES=0,1 \
swift sft \
  --model_type qwen2-vl-72b-instruct \
  --model_id_or_path qwen/Qwen2-VL-72B-Instruct \
  --sft_type lora \
  --dataset latex-ocr-print#20000

modelscope / ms-swift

Fine-tuning best practices for qwen2.5-72b-instruct and qwen2-vl-72b-instruct. #2064

qwen2.5-72b-instruct

qwen2-vl-72b-instruct