microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.38k stars 4.11k forks source link

[BUG] container dose #4469

Open hxdtest opened 1 year ago

hxdtest commented 1 year ago

Describe the bug Describe the bug In DeepSpeed-Chat step3, a runtime error: The size of tensor a (4) must match the size of tensor b (8) at non-singleton dimension 0 will be thrown when inference_tp_size>1 and hybrid engine is enabled. I encountered this bug on the given 13b training scripts but not on that of 1.3b. I found that the main differences between the provided 13b and 1.3b training scripts are that zero stage are set to 3 and inference_tp_size is larger than 1 on 13b models, while zero stage are set to 2 and inference_tp_size are set to default on 1.3b scripts.

Log output

  File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/ds_attention.py", line 138, in forward
RuntimeError: The expanded size of the tensor (1024) must match the existing size (0) at non-singleton dimension 1.  Target sizes: [128, 1024].  Tensor sizes: [0]
    self._attn_qkvw, self._attn_qkvb = self._merge_qkv()
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/ds_attention.py", line 115, in _merge_qkv
    qvkw[:self.hidden_size_per_partition, :] = self.attn_qw  # type: ignore
RuntimeError: The expanded size of the tensor (1024) must match the existing size (0) at non-singleton dimension 1.  Target sizes: [128, 1024].  Tensor sizes: [0]
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/ds_attention.py", line 138, in forward
    self._attn_qkvw, self._attn_qkvb = self._merge_qkv()
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/ds_attention.py", line 115, in _merge_qkv
    qvkw[:self.hidden_size_per_partition, :] = self.attn_qw  # type: ignore
RuntimeError: The expanded size of the tensor (1024) must match the existing size (0) at non-singleton dimension 1.  Target sizes: [128, 1024].  Tensor sizes: [0]

some parameters of container.module are not gathered

for i in self._inference_container: 
    print(i.get_all_params())

result is

tensor([1., 1., 1.,  ..., 1., 1., 1.], device='cuda:0', dtype=torch.bfloat16,
       requires_grad=True), Parameter containing:
tensor([0., 0., 0.,  ..., 0., 0., 0.], device='cuda:0', dtype=torch.bfloat16,
       requires_grad=True), Parameter containing:
tensor([1., 1., 1.,  ..., 1., 1., 1.], device='cuda:0', dtype=torch.bfloat16,
       requires_grad=True), Parameter containing:
tensor([0., 0., 0.,  ..., 0., 0., 0.], device='cuda:0', dtype=torch.bfloat16,
       requires_grad=True), Parameter containing:
tensor([], device='cuda:0', dtype=torch.bfloat16, requires_grad=True), Parameter containing:
tensor([], device='cuda:0', dtype=torch.bfloat16, requires_grad=True), Parameter containing:
tensor([], device='cuda:0', dtype=torch.bfloat16), Parameter containing:
tensor([], device='cuda:0', dtype=torch.bfloat16, requires_grad=True), Parameter containing:
tensor([], device='cuda:0', dtype=torch.bfloat16), Parameter containing:
tensor([], device='cuda:0', dtype=torch.bfloat16, requires_grad=True), Parameter containing:
tensor([], device='cuda:0', dtype=torch.bfloat16), Parameter containing:
tensor([], device='cuda:0', dtype=torch.bfloat16, requires_grad=True), Parameter containing:
tensor([], device='cuda:0', dtype=torch.bfloat16), Parameter containing:
tensor([], device='cuda:0', dtype=torch.bfloat16, requires_grad=True), Parameter containing:
tensor([], device='cuda:0', dtype=torch.bfloat16), Parameter containing:
tensor([], device='cuda:0', dtype=torch.bfloat16, requires_grad=True), Parameter containing:
tensor([], device='cuda:0', dtype=torch.bfloat16), Parameter containing:
tensor([], device='cuda:0', dtype=torch.bfloat16, requires_grad=True)]

To Reproduce opt-1.3b+opt-350m, GPU: 8*40G A100 torch1.12, deepspeed-0.10.0+d6f62217, deepspeed example-f9c3ae05, transformers-4.30.0

modify inputs in hybrid_engine.py to walk around problem in https://github.com/microsoft/DeepSpeed/issues/3998

inputs  = (inputs[0][0:4] , )

DeepSpeed Team

ACTOR_MODEL_PATH=$1 CRITIC_MODEL_PATH=$2 DATA_PATH=$3 ACTOR_ZERO_STAGE=$4 CRITIC_ZERO_STAGE=$5 OUTPUT=$6 if [ "$OUTPUT" == "" ]; then OUTPUT=./output fi if [ "$ACTOR_ZERO_STAGE" == "" ]; then ACTOR_ZERO_STAGE=3 fi if [ "$CRITIC_ZERO_STAGE" == "" ]; then CRITIC_ZERO_STAGE=3 fi mkdir -p $OUTPUT

Num_Padding_at_Beginning=1 # this is model related

Actor_Lr=9.65e-6 Critic_Lr=5e-6 INFERENCE_TP_SIZE=2

deepspeed --master_port 12346 main.py \ --data_path $DATA_PATH \ --data_split 2,4,4 \ --actor_model_name_or_path $ACTOR_MODEL_PATH \ --critic_model_name_or_path $CRITIC_MODEL_PATH \ --num_padding_at_beginning 1 \ --per_device_train_batch_size 4 \ --per_device_mini_train_batch_size 4 \ --generation_batch_numbers 1 \ --ppo_epochs 1 \ --max_answer_seq_len 256 \ --max_prompt_seq_len 256 \ --actor_learning_rate ${Actor_Lr} \ --critic_learning_rate ${Critic_Lr} \ --num_train_epochs 1 \ --lr_scheduler_type cosine \ --gradient_accumulation_steps 1 \ --disable_actor_dropout \ --num_warmup_steps 100 \ --deepspeed --seed 1234 \ --enable_hybrid_engine \ --actor_zero_stage $ACTOR_ZERO_STAGE \ --critic_zero_stage $CRITIC_ZERO_STAGE \ --inference_tp_size ${INFERENCE_TP_SIZE} \ --tp_gather_partition_size 2 \ --enable_ema \ --output_dir $OUTPUT \ &> $OUTPUT/training.log Expected behavior Run the script successfully

LSC527 commented 1 year ago

Facing the same issue.

Little-rookie-ee commented 11 months ago

same error

lusongshuo-mt commented 11 months ago

hi, Is the problem solved?

sunxiaojie99 commented 9 months ago

Is the problem solved?

garrett4wade commented 8 months ago

Hi guys, I believe this is because the gathered ZeRO-3 parameters are not being properly saved into the ds inference module.

I've hacked this issue in several commits within my forked repo, but my testing primarily focused on Llama models, and I utilized it for benchmarking throughput only. I hope this information proves to be useful in resolving the issue in your own use cases.