Best practice for Qwen2-Audio

Jintao-Huang commented 1 month ago

环境准备（Environmental Preparation）

# 安装ms-swift （Install ms-swift）
pip install git+https://github.com/modelscope/swift.git#egg=ms-swift[llm]

# 安装最新的transformers（Install the latest transformers.）
pip install git+https://github.com/huggingface/transformers.git

pip install librosa

推理（Inference）

instruct model:

CUDA_VISIBLE_DEVICES=0 swift infer --model_type qwen2-audio-7b-instruct
# 如果是本地路径（If it is a local path.）
CUDA_VISIBLE_DEVICES=0 swift infer \
    --model_type qwen2-audio-7b-instruct \
    --model_id_or_path '<local_path>'

推理效果：（Inference result:）

<<< <audio>
Input an audio path or URL <<< https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/guess_age_gender.wav
Yes, I can guess that you are a female in your twenties.
--------------------------------------------------
<<< <audio>
Input an audio path or URL <<< https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/translate_to_chinese.wav
每个人都希望被欣赏，所以如果你欣赏某人，不要把它保密。
--------------------------------------------------
<<< clear
<<< 你是谁
我是来自达摩院的语言模型，我叫通义千问。

使用python调用：（Using Python）

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

from swift.llm import (
    get_model_tokenizer, get_template, inference, ModelType,
    get_default_template_type, inference_stream
)
from swift.utils import seed_everything
import torch

model_type = ModelType.qwen2_audio_7b_instruct
model_id_or_path = None
template_type = get_default_template_type(model_type)
print(f'template_type: {template_type}')

model, tokenizer = get_model_tokenizer(model_type, torch.float16, model_id_or_path=model_id_or_path,
                                       model_kwargs={'device_map': 'auto'})
model.generation_config.max_new_tokens = 256
template = get_template(template_type, tokenizer)
seed_everything(42)

query = '<audio>这段语音说了什么'
audios = ['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/weather.wav']
response, history = inference(model, template, query, audios=audios)
print(f'query: {query}')
print(f'response: {response}')

# 流式（streaming）
query = '这段语音是男生还是女生'
gen = inference_stream(model, template, query, history, audios=audios)
print_idx = 0
print(f'query: {query}\nresponse: ', end='')
for response, history in gen:
    delta = response[print_idx:]
    print(delta, end='', flush=True)
    print_idx = len(response)
print()
print(f'history: {history}')
"""
query: <audio>这段语音说了什么
response: 这段语音说的是:'今天天气真好呀'
query: 这段语音是男生还是女生
response: 男声。
history: [['<audio>这段语音说了什么', "这段语音说的是:'今天天气真好呀'"], ['这段语音是男生还是女生', '男声。']]
"""

显存占用：（Memory usage:）

Base Model:

CUDA_VISIBLE_DEVICES=0 swift infer --model_type qwen2-audio-7b

推理效果：(Inference result)

<<< <audio>Generate the caption in English:
Input an audio path or URL <<< https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Audio/glass-breaking-151256.mp3
Glass is breaking.

Jintao-Huang commented 1 month ago

微调（Fine-tuning）

通常，多模态大模型微调会使用自定义数据集进行微调。在这里，我们将展示可直接运行的demo。我们使用aishell1-zh-mini数据集进行微调，您可以在 modelscope 上找到该数据集：https://modelscope.cn/datasets/speech_asr/speech_asr_aishell1_trainsets

Typically, fine-tuning multimodal large models involves using custom datasets for the process. Here, we will demonstrate a runnable demo. We use the aishell1-zh-mini dataset for fine-tuning, which you can find on Modelscope at: https://modelscope.cn/datasets/speech_asr/speech_asr_aishell1_trainsets

使用python：（Using python）

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

from swift.llm import sft_main, SftArguments, ModelType, DatasetName

sft_main(SftArguments(model_type=ModelType.qwen2_audio_7b_instruct,
                      model_id_or_path=None,
                      dataset=[DatasetName.aishell1_zh_mini]))

ZeRO2:

# 如果是本地路径需要增加：`--model_id_or_path <local_path>` （If it is a local path, it needs to be added.）
NPROC_PER_NODE=4 CUDA_VISIBLE_DEVICES=0,1,2,3 swift sft \
    --model_type qwen2-audio-7b-instruct \
    --dataset aishell1-zh-mini \
    --deepspeed default-zero2

如果要使用自定义数据集，只需按以下方式进行指定：（If you want to use a custom dataset, simply specify it as follows:）

# val_dataset可选，如果不指定，则会从dataset中切出一部分数据集作为验证集
    --dataset train.jsonl \
    --val_dataset val.jsonl \

自定义数据集支持json和jsonl样式。以下提供了两种自定义数据集格式：（Custom datasets support JSON and JSONL formats. Below are two formats for custom datasets:）

[
    {"conversations": [
        {"from": "user", "value": "<audio>audio_path</audio>11111"},
        {"from": "assistant", "value": "22222"}
    ]},
    {"conversations": [
        {"from": "user", "value": "<audio>audio_path</audio><audio>audio_path2</audio><audio>audio_path3</audio>aaaaa"},
        {"from": "assistant", "value": "bbbbb"},
        {"from": "user", "value": "<audio>audio_path</audio>ccccc"},
        {"from": "assistant", "value": "ddddd"}
    ]},
    {"conversations": [
        {"from": "user", "value": "AAAAA"},
        {"from": "assistant", "value": "BBBBB"},
        {"from": "user", "value": "CCCCC"},
        {"from": "assistant", "value": "DDDDD"}
    ]}
]

{"query": "<audio>55555", "response": "66666", "audios": ["audio_path"]}
{"query": "<audio><audio>eeeee", "response": "fffff", "history": [], "audios": ["audio_path1", "audio_path2"]}
{"query": "EEEEE", "response": "FFFFF", "history": [["query1", "response1"], ["query2", "response2"]]}

显存占用：（Memory Usage）

微调后推理脚本：（Fine-tuned inference script:）

CUDA_VISIBLE_DEVICES=0 swift infer \
    --ckpt_dir output/qwen2-audio-7b-instruct/vx-xxx/checkpoint-xxx \
    --load_dataset_config true

# merge-lora and inference
CUDA_VISIBLE_DEVICES=0 swift infer \
    --ckpt_dir output/qwen2-audio-7b-instruct/vx-xxx/checkpoint-xxx \
    --load_dataset_config true --merge_lora true

微调后模型对验证集进行推理的示例，时间原因，只跑了400个steps：（Example of the model performing inference on the validation set after fine-tuning. Due to time constraints, only 400 steps were run）

heiyonghua commented 1 month ago

训练过程中的log没有报告acc值，这个是我设置的问题吗？

export WANDB_API_KEY=""

swift sft \
    --model_type qwen2-audio-7b-instruct \
    --model_id_or_path "" \
    --sft_type full \
    --freeze_parameters 0.999 \
    --template_type AUTO \
    --dtype AUTO \
    --output_dir output \
    --custom_train_dataset_path "" \
    --val_dataset ''\
    --val_dataset_sample -1 \
    --train_dataset_sample -1 \
    --num_train_epochs 1 \
    --max_length 2048 \
    --check_dataset_strategy warning \
    --gradient_checkpointing true \
    --batch_size 1 \
    --weight_decay 0.1 \
    --learning_rate 1e-4 \
    --gradient_accumulation_steps 32 \
    --max_grad_norm 0.5 \
    --warmup_ratio 0.03 \
    --eval_steps 100 \
    --save_steps 100 \
    --save_total_limit 2 \
    --logging_steps 10 \
    --lazy_tokenize true \
    --evaluation_strategy 'no' \
    --system '' \
    --save_strategy "steps" \
    --report_to 'wandb' \
    --acc_strategy 'token' \
    --acc_steps 10

JulianGerhard21 commented 1 month ago

Hi @Jintao-Huang ,

I'd be interested in further finetuning it to improve on german language. Are there any plans to include this architecture in mergekit? Obviously, my thoughts were to either:

Merge e.g. VAGOsolutions/Llama-3.1-SauerkrautLM-8b-Instruct into it (assuming that this merge won't injure the audio layers)
Finetune it on a german dataset (most likely synthetic)

Any hints on how to proceed?

Best Julian

zhangfan-algo commented 4 weeks ago

Qwen2-Audio微调时可以选择的lora_target_modules有哪些呢

zhangfan-algo commented 4 weeks ago

check了一下 peft_config.target_modules里面是空的

Jintao-Huang commented 3 weeks ago

Qwen2-Audio微调时可以选择的lora_target_modules有哪些呢

https://github.com/modelscope/ms-swift/issues/1747

kindaQ commented 3 weeks ago

请教一下，我在做lora sft时，几个step之后loss变成0，grad_norm变成nan，此后就一直这样，尝试了不同的lora参数和batch_size，结果一定会变成0和nan，只是开始的step数量不同，各位大神能不能给点建议，可能是哪里的问题

{'loss': 2.09631252, 'grad_norm': 7.01568747, 'learning_rate': 3.4e-07, 'memory(GiB)': 63.49, 'train_speed(iter/s)': 0.027149, 'epoch': 0.0, 'global_step/max_steps': '1/5828', 'percentage': '0.02%', 'elapsed_time': '34s', 'remaining_time': '2d 7h 26m 48s'}
{'loss': 1.99507056, 'grad_norm': 6.31390953, 'learning_rate': 3.42e-06, 'memory(GiB)': 63.49, 'train_speed(iter/s)': 0.029137, 'epoch': 0.0, 'global_step/max_steps': '10/5828', 'percentage': '0.17%', 'elapsed_time': '5m 40s', 'remaining_time': '2d 7h 2m 59s'}
{'loss': 1.66510525, 'grad_norm': 4.81519556, 'learning_rate': 6.85e-06, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029251, 'epoch': 0.0, 'global_step/max_steps': '20/5828', 'percentage': '0.34%', 'elapsed_time': '11m 21s', 'remaining_time': '2d 6h 56m 50s'}
{'loss': 1.06762638, 'grad_norm': 3.60125303, 'learning_rate': 1.027e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029195, 'epoch': 0.01, 'global_step/max_steps': '30/5828', 'percentage': '0.51%', 'elapsed_time': '17m 4s', 'remaining_time': '2d 7h 1m 36s'}
{'loss': 0.48049116, 'grad_norm': 1.70112872, 'learning_rate': 1.37e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029208, 'epoch': 0.01, 'global_step/max_steps': '40/5828', 'percentage': '0.69%', 'elapsed_time': '22m 46s', 'remaining_time': '2d 6h 56m 30s'}
{'loss': 1.17152777, 'grad_norm': nan, 'learning_rate': 1.712e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029222, 'epoch': 0.01, 'global_step/max_steps': '50/5828', 'percentage': '0.86%', 'elapsed_time': '28m 28s', 'remaining_time': '2d 6h 50m 28s'}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.055e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029447, 'epoch': 0.01, 'global_step/max_steps': '60/5828', 'percentage': '1.03%', 'elapsed_time': '33m 54s', 'remaining_time': '2d 6h 20m 28s'}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.397e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.0296, 'epoch': 0.01, 'global_step/max_steps': '70/5828', 'percentage': '1.20%', 'elapsed_time': '39m 22s', 'remaining_time': '2d 5h 58m 35s'}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.74e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029741, 'epoch': 0.01, 'global_step/max_steps': '80/5828', 'percentage': '1.37%', 'elapsed_time': '44m 47s', 'remaining_time': '2d 5h 38m 5s'}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3.082e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029822, 'epoch': 0.02, 'global_step/max_steps': '90/5828', 'percentage': '1.54%', 'elapsed_time': '50m 15s', 'remaining_time': '2d 5h 24m 5s'}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3.425e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029877, 'epoch': 0.02, 'global_step/max_steps': '100/5828', 'percentage': '1.72%', 'elapsed_time': '55m 44s', 'remaining_time': '2d 5h 12m 53s'}

数据格式

 {"conversations": [{"from": "user", "value": "<audio>xxxx.wav</audio>textabcd"}, {"from": "assistant", "value": "texthijk"}]}

命令行参数

OMP_NUM_THREADS=4 NPROC_PER_NODE=2 CUDA_VISIBLE_DEVICES=3,4 swift sft \
        --model_type qwen2-audio-7b-instruct \
        --model_id_or_path ./Qwen2-Audio-7B-Instruct \
        --tuner_backend peft \
        --dataset ./total_audios_prompt_qwen2.jsonl \
        --dataset_test_ratio 0.01 \
        --dataloader_num_workers 1 \
        --report_to "none" \
        --max_length 1024 \
        --save_steps 100 \
        --eval_steps 100 \
        --logging_steps 10 \
        --batch_size 16 \
        --gradient_accumulation_steps 5 \
        --output_dir output \
        --save_total_limit 50 \
        --lazy_tokenize true \
        --preprocess_num_proc 1 \
        --weight_decay 0.1 \
        --learning_rate 1e-4 \
        --sft_type lora \
        --lora_rank 8 \
        --lora_alpha 32 \
        --use_flash_attn false \
        --dtype bf16 \
        --warmup_ratio 0.05 \
        --num_train_epochs 1

Winterspringkle commented 2 weeks ago

How to achieve batch inference based on swift framework? Is there any parameter like --batch-size to accelerate the swift infer script?

zhanghanweii commented 2 weeks ago

如何使用vllm或lmdeploy进行加速呢

HyacinthJingjing commented 4 days ago

Qwen2-audio微调时能使用lora训练只训练audio-encoder部分吗？怎么配置能实现此功能？@Jintao-Huang

Winterspringkle commented 3 days ago

请教一下，我在做lora sft时，几个step之后loss变成0，grad_norm变成nan，此后就一直这样，尝试了不同的lora参数和batch_size，结果一定会变成0和nan，只是开始的step数量不同，各位大神能不能给点建议，可能是哪里的问题

{'loss': 2.09631252, 'grad_norm': 7.01568747, 'learning_rate': 3.4e-07, 'memory(GiB)': 63.49, 'train_speed(iter/s)': 0.027149, 'epoch': 0.0, 'global_step/max_steps': '1/5828', 'percentage': '0.02%', 'elapsed_time': '34s', 'remaining_time': '2d 7h 26m 48s'}
{'loss': 1.99507056, 'grad_norm': 6.31390953, 'learning_rate': 3.42e-06, 'memory(GiB)': 63.49, 'train_speed(iter/s)': 0.029137, 'epoch': 0.0, 'global_step/max_steps': '10/5828', 'percentage': '0.17%', 'elapsed_time': '5m 40s', 'remaining_time': '2d 7h 2m 59s'}
{'loss': 1.66510525, 'grad_norm': 4.81519556, 'learning_rate': 6.85e-06, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029251, 'epoch': 0.0, 'global_step/max_steps': '20/5828', 'percentage': '0.34%', 'elapsed_time': '11m 21s', 'remaining_time': '2d 6h 56m 50s'}
{'loss': 1.06762638, 'grad_norm': 3.60125303, 'learning_rate': 1.027e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029195, 'epoch': 0.01, 'global_step/max_steps': '30/5828', 'percentage': '0.51%', 'elapsed_time': '17m 4s', 'remaining_time': '2d 7h 1m 36s'}
{'loss': 0.48049116, 'grad_norm': 1.70112872, 'learning_rate': 1.37e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029208, 'epoch': 0.01, 'global_step/max_steps': '40/5828', 'percentage': '0.69%', 'elapsed_time': '22m 46s', 'remaining_time': '2d 6h 56m 30s'}
{'loss': 1.17152777, 'grad_norm': nan, 'learning_rate': 1.712e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029222, 'epoch': 0.01, 'global_step/max_steps': '50/5828', 'percentage': '0.86%', 'elapsed_time': '28m 28s', 'remaining_time': '2d 6h 50m 28s'}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.055e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029447, 'epoch': 0.01, 'global_step/max_steps': '60/5828', 'percentage': '1.03%', 'elapsed_time': '33m 54s', 'remaining_time': '2d 6h 20m 28s'}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.397e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.0296, 'epoch': 0.01, 'global_step/max_steps': '70/5828', 'percentage': '1.20%', 'elapsed_time': '39m 22s', 'remaining_time': '2d 5h 58m 35s'}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.74e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029741, 'epoch': 0.01, 'global_step/max_steps': '80/5828', 'percentage': '1.37%', 'elapsed_time': '44m 47s', 'remaining_time': '2d 5h 38m 5s'}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3.082e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029822, 'epoch': 0.02, 'global_step/max_steps': '90/5828', 'percentage': '1.54%', 'elapsed_time': '50m 15s', 'remaining_time': '2d 5h 24m 5s'}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3.425e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029877, 'epoch': 0.02, 'global_step/max_steps': '100/5828', 'percentage': '1.72%', 'elapsed_time': '55m 44s', 'remaining_time': '2d 5h 12m 53s'}

数据格式

 {"conversations": [{"from": "user", "value": "<audio>xxxx.wav</audio>textabcd"}, {"from": "assistant", "value": "texthijk"}]}

命令行参数

OMP_NUM_THREADS=4 NPROC_PER_NODE=2 CUDA_VISIBLE_DEVICES=3,4 swift sft \
        --model_type qwen2-audio-7b-instruct \
        --model_id_or_path ./Qwen2-Audio-7B-Instruct \
        --tuner_backend peft \
        --dataset ./total_audios_prompt_qwen2.jsonl \
        --dataset_test_ratio 0.01 \
        --dataloader_num_workers 1 \
        --report_to "none" \
        --max_length 1024 \
        --save_steps 100 \
        --eval_steps 100 \
        --logging_steps 10 \
        --batch_size 16 \
        --gradient_accumulation_steps 5 \
        --output_dir output \
        --save_total_limit 50 \
        --lazy_tokenize true \
        --preprocess_num_proc 1 \
        --weight_decay 0.1 \
        --learning_rate 1e-4 \
        --sft_type lora \
        --lora_rank 8 \
        --lora_alpha 32 \
        --use_flash_attn false \
        --dtype bf16 \
        --warmup_ratio 0.05 \
        --num_train_epochs 1

我也遇到这个问题了，一个数据集微调顺利，另一个数据集数个step后稳定出现Nan，各种排查最后发现读取到了损坏的数据，建议log一下transformer包的trainer.py，出现Nan后反复确认下当前step和之前一个step的数据。

modelscope / ms-swift