Open oolongoo opened 1 year ago
@oolongoo -- can you please update to the latest DeepSpeedExamples and DeepSpeed and try again? Some LoRA-related fixes have been merged today (https://github.com/microsoft/DeepSpeed/pull/3563) so please try and let us know.
get a new error with newest master:
192.168.1.51: Traceback (most recent call last):
192.168.1.51: File "/mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 521, in <module>
192.168.1.51: main()
192.168.1.51: File "/mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 429, in main
192.168.1.51: out = trainer.generate_experience(prompts,
192.168.1.51: File "/mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 101, in generate_experience
192.168.1.51: seq = self._generate_sequence(prompts, mask)
192.168.1.51: File "/mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 73, in _generate_sequence
192.168.1.51: seq = self.actor_model.module.generate(
192.168.1.51: File "/opt/anaconda3/lib/python3.10/site-packages/deepspeed/runtime/hybrid_engine.py", line 234, in generate
192.168.1.51: generate_ret_vals = self._generate(*inputs, **kwargs)
192.168.1.51: File "/opt/anaconda3/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
192.168.1.51: return func(*args, **kwargs)
192.168.1.51: File "/opt/anaconda3/lib/python3.10/site-packages/transformers/generation/utils.py", line 1527, in generate
192.168.1.51: return self.greedy_search(
192.168.1.51: File "/opt/anaconda3/lib/python3.10/site-packages/transformers/generation/utils.py", line 2349, in greedy_search
192.168.1.51: outputs = self(
192.168.1.51: File "/opt/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
192.168.1.51: result = forward_call(*input, **kwargs)
192.168.1.51: File "/opt/anaconda3/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 944, in forward
192.168.1.51: outputs = self.model.decoder(
192.168.1.51: File "/opt/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
192.168.1.51: result = forward_call(*input, **kwargs)
192.168.1.51: File "/opt/anaconda3/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 650, in forward
192.168.1.51: causal_attention_mask = self._prepare_decoder_attention_mask(
192.168.1.51: File "/opt/anaconda3/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 551, in _prepare_decoder_attention_mask
192.168.1.51: expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask + combined_attention_mask
192.168.1.51: RuntimeError: The size of tensor a (4) must match the size of tensor b (16) at non-singleton dimension 0
get a new error with newest master:
192.168.1.51: Traceback (most recent call last): 192.168.1.51: File "/mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 521, in <module> 192.168.1.51: main() 192.168.1.51: File "/mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 429, in main 192.168.1.51: out = trainer.generate_experience(prompts, 192.168.1.51: File "/mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 101, in generate_experience 192.168.1.51: seq = self._generate_sequence(prompts, mask) 192.168.1.51: File "/mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 73, in _generate_sequence 192.168.1.51: seq = self.actor_model.module.generate( 192.168.1.51: File "/opt/anaconda3/lib/python3.10/site-packages/deepspeed/runtime/hybrid_engine.py", line 234, in generate 192.168.1.51: generate_ret_vals = self._generate(*inputs, **kwargs) 192.168.1.51: File "/opt/anaconda3/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context 192.168.1.51: return func(*args, **kwargs) 192.168.1.51: File "/opt/anaconda3/lib/python3.10/site-packages/transformers/generation/utils.py", line 1527, in generate 192.168.1.51: return self.greedy_search( 192.168.1.51: File "/opt/anaconda3/lib/python3.10/site-packages/transformers/generation/utils.py", line 2349, in greedy_search 192.168.1.51: outputs = self( 192.168.1.51: File "/opt/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl 192.168.1.51: result = forward_call(*input, **kwargs) 192.168.1.51: File "/opt/anaconda3/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 944, in forward 192.168.1.51: outputs = self.model.decoder( 192.168.1.51: File "/opt/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl 192.168.1.51: result = forward_call(*input, **kwargs) 192.168.1.51: File "/opt/anaconda3/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 650, in forward 192.168.1.51: causal_attention_mask = self._prepare_decoder_attention_mask( 192.168.1.51: File "/opt/anaconda3/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 551, in _prepare_decoder_attention_mask 192.168.1.51: expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask + combined_attention_mask 192.168.1.51: RuntimeError: The size of tensor a (4) must match the size of tensor b (16) at non-singleton dimension 0
Got the same error.
Similar... 6144 and 8192
Same to me. It seems a tp related bug. It works fine when not enabling tp.
I have successfully run step 1 and step 2 and generated the models, but encountered an error when running step 3: "RuntimeError: The size of tensor a (5120) must match the size of tensor b (20480) at non-singleton dimension 0"
DeepSpeed 0.10.0 Cuda 11.7 pytorch 1.13.1
run with 4 * A10 24G
run script:
run_13b.sh:
error log: