Closed SparrowZheyuan18 closed 1 month ago
When I use deepspeed, another error will occur when initiating deepspeed:
File "/workspace/lib/python3.10/site-packages/deepspeed/runtime/lr_schedules.py", line 814, in __init__
self.warmup_num_steps = max(2, warmup_num_steps)
TypeError: '>' not supported between instances of 'str' and 'int'
I've found the same issue in https://github.com/LianjiaTech/BELLE/issues/558. The script I use is:
nproc_per_node=4
CUDA_VISIBLE_DEVICES=0,1,2,3 \
NPROC_PER_NODE=$nproc_per_node \
MASTER_PORT=29500 \
swift rlhf \
--rlhf_type dpo \
--model_type minicpm-v-v2_5-chat \
--model_id_or_path /workspace/MiniCPM-V/merged_MiniCPM-Llama3-V-2_5 \
--ref_model_type minicpm-v-v2_5-chat \
--ref_model_id_or_path /workspace/MiniCPM-V/merged_MiniCPM-Llama3-V-2_5 \
--sft_type lora \
--tuner_backend swift \
--dtype AUTO \
--output_dir output/minicpm_dpo \
--dataset /workspace/DPO/data/dpo_data.jsonl \
--beta 0.1 \
--sft_beta 0.1 \
--num_train_epochs 4 \
--max_length 1200 \
--max_prompt_length 512 \
--check_dataset_strategy none \
--lora_rank 8 \
--lora_alpha 32 \
--lora_dropout 0.05 \
--lora_target_modules ALL \
--gradient_checkpointing true \
--batch_size 1 \
--weight_decay 0.1 \
--learning_rate 5e-5 \
--gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
--max_grad_norm 1.0 \
--warmup_ratio 0.03 \
--eval_steps 2000 \
--save_steps 100 \
--save_total_limit 2 \
--logging_steps 10 \
--use_flash_attn true \
--deepspeed zero3-offload
Use with --tuner_backend peft --lazy_tokenize true
And may I ask how many rows in the data file please?
The DeepSpeed problem is because our ds config has fields with values of auto
, this will be fixed today
In this case, I have only 2000 rows in the dataset. Thanks for your help :)
I use the peft as backend and lazy_tokenize. I've encountered another problem:
Traceback (most recent call last):
File "/workspace/miniconda3/envs/geoguessr/lib/python3.10/site-packages/swift/cli/rlhf.py", line 5, in
seems to occur because of padding.
Fixed. Please use the main branch for experimentation.
Describe the bug What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图)
I encountered an OOM error when trying to DPO MiniCPM-LLaMA-v-2.5 with my own dataset and 4 rtx6000ada. The OOM error seems to occur at the
part of the code. Why is this happening? Do you have any solution to this? Thanks!
My training script:
nproc_per_node=2
CUDA_VISIBLE_DEVICES=0,1,2,3 \ NPROC_PER_NODE=$nproc_per_node \ MASTER_PORT=29500 \ swift rlhf \ --rlhf_type dpo \ --model_type minicpm-v-v2_5-chat \ --model_id_or_path workspace/MiniCPM-V/merged_MiniCPM-Llama3-V-2_5 \ --ref_model_type minicpm-v-v2_5-chat \ --ref_model_id_or_path workspace/MiniCPM-V/merged_MiniCPM-Llama3-V-2_5 \ --sft_type lora \ --tuner_backend swift \ --dtype AUTO \ --output_dir output/minicpm_dpo \ --dataset /workspace/DPO/data/dpo_data.jsonl \ --beta 0.1 \ --sft_beta 0.1 \ --num_train_epochs 4 \ --max_length 1200 \ --max_prompt_length 512 \ --check_dataset_strategy none \ --lora_rank 8 \ --lora_alpha 32 \ --lora_dropout 0.05 \ --lora_target_modules DEFAULT \ --gradient_checkpointing true \ --batch_size 1 \ --weight_decay 0.1 \ --learning_rate 5e-5 \ --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \ --max_grad_norm 1.0 \ --warmup_ratio 0.03 \ --eval_steps 2000 \ --save_steps 100 \ --save_total_limit 2 \ --logging_steps 10 \ --use_flash_attn true \
Your hardware and system info Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)
Additional context Add any other context about the problem here(在这里补充其他信息)