RuntimeError: size mismatch when train with llama 7b/30b (deepspeed zero3)

hi, I met following error when train llama7b/30b model with deepspeed zero3 (the error for llama7b and llama30b is similar)

Traceback (most recent call last):                                                                                                                              
File "/data/public/aic/lwz/lwz_code/trl/examples/stack_llama/scripts/reward_modeling.py", line 296, in <module>                                               
trainer.train(script_args.resume_from_checkpoint)                                                                                                           
File "/root/code/transformers/src/transformers/trainer.py", line 1664, in train                                                                                                                                  
return inner_training_loop(                                                                                                                                                                                      File "/root/code/transformers/src/transformers/trainer.py", line 1940, in _inner_training_loop                                                                                                                   
tr_loss_step = self.training_step(model, inputs)                                                        
File "/root/code/transformers/src/transformers/trainer.py", line 2735, in training_step                                                                                                                              loss = self.compute_loss(model, inputs)                                                              
File "/data/public/aic/lwz/lwz_code/trl/examples/stack_llama/scripts/reward_modeling.py", line 278, in compute_loss                                                                                              
rewards_j = model(input_ids=inputs["input_ids_j"], attention_mask=inputs["attention_mask_j"])[0]                                                                                                                 File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl                                                                                                        
return forward_call(*input, **kwargs)                                                                
File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn                                                                                                             
ret_val = func(*args, **kwargs)                 
File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1675, in forward                                                                                                          
loss = self.module(*inputs, **kwargs)                                                                
File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl                                                                                                        
result = forward_call(*input, **kwargs)                                                              
File "/root/code/peft/src/peft/peft_model.py", line 513, in forward                                    
return self.base_model(                         
File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl                                                                                                        
result = forward_call(*input, **kwargs)                                                              
File "/root/code/transformers/src/transformers/models/llama/modeling_llama.py", line 833, in forward                                                                                                             
logits = self.score(hidden_states)              
File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl                                                                                                        
result = forward_call(*input, **kwargs)                                                              
File "/root/code/peft/src/peft/utils/other.py", line 109, in forward                                   
return self.modules_to_save[self.active_adapter](*args, **kwargs)                                    
File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl                                                                                                        
result = forward_call(*input, **kwargs)                                                              
File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward                                                                                                            
return F.linear(input, self.weight, self.bias)                                                       
File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py", line 119, in zero3_linear_wrap                                                                                            
return LinearFunctionForZeroStage3.apply(input, weight)                                              
File "/root/miniconda3/lib/python3.10/site-packages/torch/cuda/amp/autocast_mode.py", line 97, in decorate_fwd                                                                                                   
return fwd(*args, **kwargs)                     
File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py", line 66, in forward                                                                                                       
output = input.matmul(weight.t())               
RuntimeError: size mismatch, got 1452, 1452x4096,0

training cmd

torchrun --nnodes 1  --nproc_per_node 2 reward_modeling.py \                                                                                                                                                       
--model_name=xxxx --dump_root=xxx \                                                                            
--deepspeed ds_config_zero3.json

deepspeed ds_config_zero3.json

{
"fp16": {
   "enabled": false,
   "auto_cast": false,
   "loss_scale": 0,
   "initial_scale_power": 16,
   "loss_scale_window": 1000,
   "hysteresis": 2,
   "min_loss_scale": 1
},
"bf16": {
   "enabled": true
},
"zero_optimization": {
   "stage": 3,
   "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
   },
   "offload_param": {
      "device": "cpu",
      "pin_memory": true
   },
   "overlap_comm": true,
   "contiguous_gradients": true,
   "reduce_bucket_size": 205520896,
   "stage3_prefetch_bucket_size": 184968807,
   "stage3_param_persistence_threshold": 143360,
   "sub_group_size": 1e9,
   "stage3_max_live_parameters": 1e9,
   "stage3_max_reuse_distance": 1e9,
   "stage3_gather_16bit_weights_on_model_save": true
},
"steps_per_print": 100,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": 1,
"wall_clock_breakdown": false,
"optimizer": {
  "type": "Adam",
  "params": {
    "adam_w_mode": true,
    "lr": 2e-5,
    "betas": [
      0.9,
      0.999
    ],
    "eps": 1e-8,
    "weight_decay": 0.001
  }
}
}

env
- deepspeed-0.9.1
- transformers 4.30.0.dev0
- peft 0.4.0.dev0
- accelerate 0.19.0

Does anyone have some idea about this error? or can anyone provide a example deepspeed config which can run successfully?

Thanks.

And I also met some error when train rl model with accelerate and deepspeed. The error is different with different transformers version.

with transformers 4.28.1

Traceback (most recent call last):                                                                       
File "/data/public/aic/lwz/lwz_code/trl/examples/stack_llama/scripts/rl_training.py", line 266, in <module>                                                                                                      
pipe_outputs = sentiment_pipe(texts, **sent_kwargs)                                                                                                                                                            
File "/root/code/transformers/src/transformers/pipelines/text_classification.py", line 155, in __call__                                                                                                          
result = super().__call__(*args, **kwargs)                                                           
File "/root/code/transformers/src/transformers/pipelines/base.py", line 1090, in __call__                                                                                                                        
outputs = list(final_iterator)                                                                       
File "/root/code/transformers/src/transformers/pipelines/pt_utils.py", line 125, in __next__                                                                                                                     
processed = self.infer(item, **self.params)                                                          
File "/root/code/transformers/src/transformers/pipelines/text_classification.py", line 214, in postprocess                                                                                                       
dict_scores = [                                 
File "/root/code/transformers/src/transformers/pipelines/text_classification.py", line 215, in <listcomp>                                                                                                        
{"label": self.model.config.id2label[i], "score": score.item()} for i, score in enumerate(scores)                                                                                                              
ValueError: can only convert an array of size 1 to a Python scalar

with transformers 4.30.0.dev0

_ /root/miniconda3/lib/python3.10/contextlib.py:79 in inner                                        _                                                                                                               
_                                                                                                  _                                                                                                               
_    76 _   _   @wraps(func)                                                                       _                                                                                                               
_    77 _   _   def inner(*args, **kwds):                                                          _                                                                                                               
_    78 _   _   _   with self._recreate_cm():                                                      _                                                                                                               
_ _  79 _   _   _   _   return func(*args, **kwds)                                                 _                                                                                                               
_    80 _   _   return inner                                                                       _                                                                                                               
_    81                                                                                            _                                                                                                               
_    82                                                                                            _                                                                                                               
_                                                                                                  _                                                                                                               
_ /data/public/aic/lwz/lwz_code/trl/trl/trainer/ppo_trainer.py:859 in batched_forward_pass         _                                                                                                               
_                                                                                                  _                                                                                                               
_    856 _   _   _   _   input_ids = input_kwargs["input_ids"]                                     _                                                                                                               
_    857 _   _   _   _   attention_mask = input_kwargs["attention_mask"]                           _                                                                                                               
_    858 _   _   _                                                                                 _                                                                                                               
_ _  859 _   _   _   logprobs = logprobs_from_logits(logits[:, :-1, :], input_ids[:, 1:])          _                                                                                                               
_    860 _   _   _   masks = torch.zeros_like(attention_mask)                                      _                                                                                                               
_    861 _   _   _   masks[:, :-1] = attention_mask[:, 1:]                                         _                                                                                                               
_    862                                                                                           _                                                                                                               
_                                                                                                  _                                                                                                               
_ /data/public/aic/lwz/lwz_code/trl/trl/core.py:106 in logprobs_from_logits                        _                                                                                                               
_                                                                                                  _                                                                                                               
_   103 _   # if gt_device != pred_device:                                                         _                                                                                                               
_   104 _   #     labels = labels.to(pred_device)                                                  _                                                                                                               
_   105 _   # print("after: pred_device: {} gt_device: {}".format(logp.device, labels.device))     _                                                                                                               
_ _ 106 _   logpy = torch.gather(logp, 2, labels.unsqueeze(2)).squeeze(-1)                         _                                                                                                               
_   107 _   return logpy                                                                           _                                                                                                               
_   108                                                                                            _                                                                                                               
_   109                                                                                            _                                                                                                               
____________________________________________________________________________________________________                                                                                                               
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!                                                                                                    
(when checking argument for argument index in method wrapper_gather)

Could anyone give an example deepspeed zero3 config and the specific env list, which can train rmo and rl successfully? Thanks.

huggingface / trl

RuntimeError: size mismatch when train with llama 7b/30b (deepspeed zero3) #360