[BUG] Running llama2-7b step3 with tensor parallel and HE fails due to incompatible shapes

Hi, I get an error when trying to run step3 of llama2-7b with tensor parallel. The error happens in merge_qkv:

return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1569, in _call_impl result = forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 1038, in forward outputs = self.model( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1569, in _call_impl result = forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 925, in forward layer_outputs = decoder_layer( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1569, in _call_impl result = forward_call(*args, **kwargs) File "/software/users/snahir1/DeepSpeed/deepspeed/model_implementations/transformers/ds_transformer.py", line 171, in forward self.attention(input, File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1528, in _call_impl return forward_call(*args, **kwargs) File "/software/users/snahir1/DeepSpeed/deepspeed/ops/transformer/inference/ds_attention.py", line 141, in forward self._attn_qkvw, self._attn_qkvb = self._merge_qkv() File "/software/users/snahir1/DeepSpeed/deepspeed/ops/transformer/inference/ds_attention.py", line 118, in _merge_qkv qvkw[:self.hidden_size_per_partition, :] = self.attn_qw # type: ignore RuntimeError: The expanded size of the tensor (4096) must match the existing size (0) at non-singleton dimension 1. Target sizes: [512, 4096]. Tensor sizes: [0]

The target slice size is [512, 4096], and self.attn_qw size is 0. self.attn_qw is initialized in DeepSpeedSelfAttention as None when initializing the actor model. maybe the issue is in HybridSplitQKVContainer and specifically in the implementation of set_q_k_v() ?

This specific logs are for batch size =1 and total batch size = 8. Originally I run the model with batch_size=4, total_batch_size=32 and it failed earlier, when trying to combine regular attention mask and casual attention mask, with:

line 841, in _prepare_decoder_attention_mask expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask + combined_attention_mask in modeling_llama.py.

The regular attention mask is calculated in the data loader with dim[0] equal to the batch size. The causal mask is created later on in _make_causal_mask function in the transformers library, taking its shape from the "input_ids" argument. input_ids tensor is created in the hybrid engine, which uses it as a destination tensor to the dist.all_gather_into_tensor operation (hence its size is the batch size * tp size). therefore dim[0] of casual mask = total_batch_size.

To Reproduce add --enable_hybrid_engine and --inference_tp_size 8 to training_scripts/llama2/run_llama2_7b.sh

microsoft / DeepSpeed

[BUG] Running llama2-7b step3 with tensor parallel and HE fails due to incompatible shapes #5656