hiyouga / LLaMA-Factory

Unified Efficient Fine-Tuning of 100+ LLMs (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
33.89k stars 4.17k forks source link

多卡RLHF训练时报错Tensors must be CUDA and denseRuntimeError #456

Closed liangjh2001 closed 1 year ago

liangjh2001 commented 1 year ago

您好,我在用多卡RLHF训练时报错Tensors must be CUDA and denseRuntimeError,用单卡训练在这一步是不会报错的,但是单卡我的显存不够也跑不了,您知道是什么问题吗?

参数如下: accelerate launch src/train_bash.py \ --stage ppo \ --model_name_or_path "../model_set/llama-7b" \ --do_train \ --dataset alpaca_gpt4_zh \ --prompt_template default \ --finetuning_type lora --lora_target "q_proj,v_proj" \ --resume_lora_training False \ --reward_model "./output/llama-7b-rm" \ --output_dir "./output/llama-7b-ppo" \ --per_device_train_batch_size 2 \ --gradient_accumulation_steps 4 \ --lr_scheduler_type cosine \ --logging_steps 10 \ --save_steps 1000 \ --learning_rate 5e-5 \ --num_train_epochs 1.0 \ --quantization_bit 4 \ --fp16 \ --plot_loss

报错信息如下: Traceback (most recent call last): File "/data0/ljh/LLaMA-Efficient-Tuning-main/src/train_bash.py", line 23, in main() File "/data0/ljh/LLaMA-Efficient-Tuning-main/src/train_bash.py", line 14, in main run_ppo(model_args, data_args, training_args, finetuning_args) File "/data0/ljh/LLaMA-Efficient-Tuning-main/src/llmtuner/tuner/ppo/workflow.py", line 52, in run_ppo ppo_trainer = PPOPeftTrainer( File "/data0/ljh/LLaMA-Efficient-Tuning-main/src/llmtuner/tuner/ppo/trainer.py", line 35, in init PPOTrainer.init(self, **kwargs) File "/home/jhliang/yes/envs/qlora/lib/python3.9/site-packages/trl/trainer/ppo_trainer.py", line 293, in init ) = self.accelerator.prepare( File "/home/jhliang/yes/envs/qlora/lib/python3.9/site-packages/accelerate/accelerator.py", line 1251, in prepare result = tuple( File "/home/jhliang/yes/envs/qlora/lib/python3.9/site-packages/accelerate/accelerator.py", line 1252, in self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement) File "/home/jhliang/yes/envs/qlora/lib/python3.9/site-packages/accelerate/accelerator.py", line 1079, in _prepare_one return self.prepare_model(obj, device_placement=device_placement) File "/home/jhliang/yes/envs/qlora/lib/python3.9/site-packages/accelerate/accelerator.py", line 1389, in prepare_model model = torch.nn.parallel.DistributedDataParallel( File "/home/jhliang/yes/envs/qlora/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 676, in init _sync_module_states( File "/home/jhliang/yes/envs/qlora/lib/python3.9/site-packages/torch/distributed/utils.py", line 142, in _sync_module_states _sync_params_and_buffers( File "/home/jhliang/yes/envs/qlora/lib/python3.9/site-packages/torch/distributed/utils.py", line 160, in _sync_params_and_buffers dist._broadcast_coalesced( RuntimeError: Tensors must be CUDA and dense

accelerate 配置文件如下: compute_environment: LOCAL_MACHINE distributed_type: MULTI_GPU downcast_bf16: 'no' gpu_ids: all machine_rank: 0 main_training_function: main mixed_precision: fp16 num_machines: 1 num_processes: 4 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

hiyouga commented 1 year ago

qlora 目前不支持多卡

liangjh2001 commented 1 year ago

qlora 目前不支持多卡

请问qlora是仅在PPO阶段不支持多卡吗?前面的SFT和RM阶段我用qlora多卡是可以跑起来的

liangjh2001 commented 1 year ago

qlora 目前不支持多卡

我换成用deepspeed来跑,能跑上了,可能是accelerate库有点问题吗?

deepspeed --num_gpus=4 src/train_bash.py \ --stage ppo \ --model_name_or_path "../model_set/llama-7b" \ --do_train \ --dataset alpaca_gpt4_zh \ --prompt_template default \ --finetuning_type lora --lora_target "q_proj,v_proj" \ --resume_lora_training False \ --reward_model "./output/llama-7b-rm" \ --output_dir "./output/llama-7b-ppo" \ --per_device_train_batch_size 2 \ --gradient_accumulation_steps 4 \ --lr_scheduler_type cosine \ --logging_steps 10 \ --save_steps 1000 \ --learning_rate 5e-5 \ --num_train_epochs 1.0 \ --quantization_bit 4 \ --fp16 \ --plot_loss \ --deepspeed deepspeed_config.json

hiyouga commented 1 year ago

qlora 不兼容 PPO 多卡