Closed liangjh2001 closed 1 year ago
qlora 目前不支持多卡
qlora 目前不支持多卡
请问qlora是仅在PPO阶段不支持多卡吗?前面的SFT和RM阶段我用qlora多卡是可以跑起来的
qlora 目前不支持多卡
我换成用deepspeed来跑,能跑上了,可能是accelerate库有点问题吗?
deepspeed --num_gpus=4 src/train_bash.py \ --stage ppo \ --model_name_or_path "../model_set/llama-7b" \ --do_train \ --dataset alpaca_gpt4_zh \ --prompt_template default \ --finetuning_type lora --lora_target "q_proj,v_proj" \ --resume_lora_training False \ --reward_model "./output/llama-7b-rm" \ --output_dir "./output/llama-7b-ppo" \ --per_device_train_batch_size 2 \ --gradient_accumulation_steps 4 \ --lr_scheduler_type cosine \ --logging_steps 10 \ --save_steps 1000 \ --learning_rate 5e-5 \ --num_train_epochs 1.0 \ --quantization_bit 4 \ --fp16 \ --plot_loss \ --deepspeed deepspeed_config.json
qlora 不兼容 PPO 多卡
您好,我在用多卡RLHF训练时报错Tensors must be CUDA and denseRuntimeError,用单卡训练在这一步是不会报错的,但是单卡我的显存不够也跑不了,您知道是什么问题吗?
参数如下: accelerate launch src/train_bash.py \ --stage ppo \ --model_name_or_path "../model_set/llama-7b" \ --do_train \ --dataset alpaca_gpt4_zh \ --prompt_template default \ --finetuning_type lora --lora_target "q_proj,v_proj" \ --resume_lora_training False \ --reward_model "./output/llama-7b-rm" \ --output_dir "./output/llama-7b-ppo" \ --per_device_train_batch_size 2 \ --gradient_accumulation_steps 4 \ --lr_scheduler_type cosine \ --logging_steps 10 \ --save_steps 1000 \ --learning_rate 5e-5 \ --num_train_epochs 1.0 \ --quantization_bit 4 \ --fp16 \ --plot_loss
报错信息如下: Traceback (most recent call last): File "/data0/ljh/LLaMA-Efficient-Tuning-main/src/train_bash.py", line 23, in
main()
File "/data0/ljh/LLaMA-Efficient-Tuning-main/src/train_bash.py", line 14, in main
run_ppo(model_args, data_args, training_args, finetuning_args)
File "/data0/ljh/LLaMA-Efficient-Tuning-main/src/llmtuner/tuner/ppo/workflow.py", line 52, in run_ppo
ppo_trainer = PPOPeftTrainer(
File "/data0/ljh/LLaMA-Efficient-Tuning-main/src/llmtuner/tuner/ppo/trainer.py", line 35, in init
PPOTrainer.init(self, **kwargs)
File "/home/jhliang/yes/envs/qlora/lib/python3.9/site-packages/trl/trainer/ppo_trainer.py", line 293, in init
) = self.accelerator.prepare(
File "/home/jhliang/yes/envs/qlora/lib/python3.9/site-packages/accelerate/accelerator.py", line 1251, in prepare
result = tuple(
File "/home/jhliang/yes/envs/qlora/lib/python3.9/site-packages/accelerate/accelerator.py", line 1252, in
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/home/jhliang/yes/envs/qlora/lib/python3.9/site-packages/accelerate/accelerator.py", line 1079, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/home/jhliang/yes/envs/qlora/lib/python3.9/site-packages/accelerate/accelerator.py", line 1389, in prepare_model
model = torch.nn.parallel.DistributedDataParallel(
File "/home/jhliang/yes/envs/qlora/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 676, in init
_sync_module_states(
File "/home/jhliang/yes/envs/qlora/lib/python3.9/site-packages/torch/distributed/utils.py", line 142, in _sync_module_states
_sync_params_and_buffers(
File "/home/jhliang/yes/envs/qlora/lib/python3.9/site-packages/torch/distributed/utils.py", line 160, in _sync_params_and_buffers
dist._broadcast_coalesced(
RuntimeError: Tensors must be CUDA and dense
accelerate 配置文件如下: compute_environment: LOCAL_MACHINE distributed_type: MULTI_GPU downcast_bf16: 'no' gpu_ids: all machine_rank: 0 main_training_function: main mixed_precision: fp16 num_machines: 1 num_processes: 4 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false