microsoft / DeepSpeedExamples

Example models using DeepSpeed
Apache License 2.0
6.06k stars 1.03k forks source link

【Update request】Deepspeed-chat's codes does not support the Hybrid Engine of DeepSpeed the version 0.9.5 #650

Open Looong01 opened 1 year ago

Looong01 commented 1 year ago

No problems with Step 1 & 2, but for Step 3, it returns a runtime error: The size of tensor a (6144) must match the size of tensor b (8192) at non-singleton dimension 0

PyTorch:1.12.1, 1.13.1, 2.0.1 CUDA: 11.8 Deepspeed: 0.9.5

image

Error message:

******************[end] Initialized Reward Model [end] (duration: 1.96s)******************
***** Running training *****
Beginning of Epoch 1/1, Total Generation Batches 1907
Traceback (most recent call last):
  File "/home/yang/Codes/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 521, in <module>
    main()
  File "/home/yang/Codes/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 429, in main
    out = trainer.generate_experience(batch_prompt['prompt'],
  File "/home/yang/Codes/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 101, in generate_experience
    seq = self._generate_sequence(prompts, mask)
  File "/home/yang/Codes/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 74, in _generate_sequence
    seq = self.actor_model.module.generate(
  File "/home/yang/miniconda3/envs/PyTorch112/lib/python3.10/site-packages/deepspeed/runtime/hybrid_engine.py", line 263, in generate
    self.fuse_lora_weight()
  File "/home/yang/miniconda3/envs/PyTorch112/lib/python3.10/site-packages/deepspeed/runtime/hybrid_engine.py", line 141, in fuse_lora_weight
    self._fuse_lora(self.layer_params[layer_id], self.lora_params[layer_id])
  File "/home/yang/miniconda3/envs/PyTorch112/lib/python3.10/site-packages/deepspeed/runtime/hybrid_engine.py", line 137, in _fuse_lora
    weight.data += lora_scaling * torch.matmul(lora_left_weight.t(), lora_right_weight.t())
RuntimeError: The size of tensor a (6144) must match the size of tensor b (8192) at non-singleton dimension 0
[2023-07-18 15:27:17,450] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 329353
[2023-07-18 15:27:17,450] [ERROR] [launch.py:321:sigkill_handler] ['/home/yang/miniconda3/envs/PyTorch112/bin/python', '-u', 'main.py', '--local_rank=0', '--actor_model_name_or_path', '/home/yang/Codes/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b', '--critic_model_name_or_path', '/home/yang/Codes/DeepSpeedExamples/applications/DeepSpeed-Chat/output/reward-models/350m', '--actor_zero_stage', '0', '--critic_zero_stage', '0', '--num_padding_at_beginning', '1', '--gradient_accumulation_steps', '2', '--deepspeed', '--actor_lora_dim', '128', '--enable_hybrid_engine', '--actor_gradient_checkpointing', '--disable_actor_dropout', '--output_dir', '/home/yang/Codes/DeepSpeedExamples/applications/DeepSpeed-Chat/output/step3-models/1.3b'] exits with return code = 1
Looong01 commented 1 year ago

587 solves this problem

Looong01 commented 1 year ago

The lastest version deepspeed update some codes of Hybrid Engine. But Deepspeed chat's codes in this example still does not support it.

Please add supports!!!