microsoft / DeepSpeedExamples

Example models using DeepSpeed
Apache License 2.0
6.06k stars 1.03k forks source link

"RuntimeError: The size of tensor a (5120) must match the size of tensor b (20480) at non-singleton dimension 0" in step3 #622

Open oolongoo opened 1 year ago

oolongoo commented 1 year ago

I have successfully run step 1 and step 2 and generated the models, but encountered an error when running step 3: "RuntimeError: The size of tensor a (5120) must match the size of tensor b (20480) at non-singleton dimension 0"

DeepSpeed 0.10.0 Cuda 11.7 pytorch 1.13.1

run with 4 * A10 24G

run script:

# python train.py --step 3 --actor-model facebook/opt-13b --reward-model facebook/opt-350m --deployment-type single_node
bash /mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/training_scripts/single_node/run_13b.sh /mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/13b /mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/output/reward-models/350m '' '' /mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/output/step3-models/13b

run_13b.sh:

#!/bin/bash
# Copyright (c) Microsoft Corporation.
# SPDX-License-Identifier: Apache-2.0

# DeepSpeed Team
ACTOR_MODEL_PATH=$1
CRITIC_MODEL_PATH=$2
ACTOR_ZERO_STAGE=$3
CRITIC_ZERO_STAGE=$4
OUTPUT=$5
if [ "$OUTPUT" == "" ]; then
    OUTPUT=./output
fi
if [ "$ACTOR_ZERO_STAGE" == "" ]; then
    ACTOR_ZERO_STAGE=3
fi
if [ "$CRITIC_ZERO_STAGE" == "" ]; then
    CRITIC_ZERO_STAGE=3
fi
mkdir -p $OUTPUT

Num_Padding_at_Beginning=1 # this is model related

Actor_Lr=5e-4
Critic_Lr=5e-6

deepspeed --master_port 12346 main.py \
   --data_path Dahoas/rm-static \
   --data_split 2,4,4 \
   --actor_model_name_or_path $ACTOR_MODEL_PATH \
   --critic_model_name_or_path $CRITIC_MODEL_PATH \
   --num_padding_at_beginning 1 \
   --per_device_train_batch_size 2 \
   --per_device_mini_train_batch_size 2 \
   --generation_batch_numbers 1 \
   --ppo_epochs 1 \
   --max_answer_seq_len 256 \
   --max_prompt_seq_len 256 \
   --actor_learning_rate ${Actor_Lr} \
   --critic_learning_rate ${Critic_Lr} \
   --num_train_epochs 1 \
   --lr_scheduler_type cosine \
   --gradient_accumulation_steps 1 \
   --num_warmup_steps 100 \
   --deepspeed --seed 1234 \
   --enable_hybrid_engine \
   --inference_tp_size 2 \
   --actor_zero_stage $ACTOR_ZERO_STAGE \
   --critic_zero_stage $CRITIC_ZERO_STAGE \
   --actor_gradient_checkpointing \
   --disable_actor_dropout \
   --actor_lora_dim 128 \
   --actor_lora_module_name decoder.layers. \
   --output_dir $OUTPUT \
    &> $OUTPUT/training.log

error log:

192.168.1.51: *****************[end] Initialized Reward Model [end] (duration: 10.30s)******************
192.168.1.51: ***** Running training *****
192.168.1.51: Beginning of Epoch 1/1, Total Generation Batches 954
192.168.1.54: Traceback (most recent call last):
192.168.1.54:   File "/mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 521, in <module>
192.168.1.54:     main()
192.168.1.54:   File "/mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 429, in main
192.168.1.54:     out = trainer.generate_experience(batch_prompt['prompt'],
192.168.1.54:   File "/mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 98, in generate_experience
192.168.1.54:     seq = self._generate_sequence(prompts, mask)
192.168.1.54:   File "/mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 73, in _generate_sequence
192.168.1.54:     seq = self.actor_model.module.generate(prompts,
192.168.1.54:   File "/opt/anaconda3/lib/python3.10/site-packages/deepspeed/runtime/hybrid_engine.py", line 207, in generate
192.168.1.51: Traceback (most recent call last):
192.168.1.51:   File "/mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 521, in <module>
192.168.1.54:     self._fuse_lora(self.layer_params[layer_id], self.lora_params[layer_id])
192.168.1.54:   File "/opt/anaconda3/lib/python3.10/site-packages/deepspeed/runtime/hybrid_engine.py", line 137, in _fuse_lora
192.168.1.54:     weight.data += lora_scaling * torch.matmul(lora_left_weight.t(), lora_right_weight.t())
192.168.1.54: RuntimeError: The size of tensor a (5120) must match the size of tensor b (20480) at non-singleton dimension 0
awan-10 commented 1 year ago

@oolongoo -- can you please update to the latest DeepSpeedExamples and DeepSpeed and try again? Some LoRA-related fixes have been merged today (https://github.com/microsoft/DeepSpeed/pull/3563) so please try and let us know.

oolongoo commented 1 year ago

get a new error with newest master:

192.168.1.51: Traceback (most recent call last):
192.168.1.51:   File "/mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 521, in <module>
192.168.1.51:     main()
192.168.1.51:   File "/mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 429, in main
192.168.1.51:     out = trainer.generate_experience(prompts,
192.168.1.51:   File "/mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 101, in generate_experience
192.168.1.51:     seq = self._generate_sequence(prompts, mask)
192.168.1.51:   File "/mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 73, in _generate_sequence
192.168.1.51:     seq = self.actor_model.module.generate(
192.168.1.51:   File "/opt/anaconda3/lib/python3.10/site-packages/deepspeed/runtime/hybrid_engine.py", line 234, in generate
192.168.1.51:     generate_ret_vals = self._generate(*inputs, **kwargs)
192.168.1.51:   File "/opt/anaconda3/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
192.168.1.51:     return func(*args, **kwargs)
192.168.1.51:   File "/opt/anaconda3/lib/python3.10/site-packages/transformers/generation/utils.py", line 1527, in generate
192.168.1.51:     return self.greedy_search(
192.168.1.51:   File "/opt/anaconda3/lib/python3.10/site-packages/transformers/generation/utils.py", line 2349, in greedy_search
192.168.1.51:     outputs = self(
192.168.1.51:   File "/opt/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
192.168.1.51:     result = forward_call(*input, **kwargs)
192.168.1.51:   File "/opt/anaconda3/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 944, in forward
192.168.1.51:     outputs = self.model.decoder(
192.168.1.51:   File "/opt/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
192.168.1.51:     result = forward_call(*input, **kwargs)
192.168.1.51:   File "/opt/anaconda3/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 650, in forward
192.168.1.51:     causal_attention_mask = self._prepare_decoder_attention_mask(
192.168.1.51:   File "/opt/anaconda3/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 551, in _prepare_decoder_attention_mask
192.168.1.51:     expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask + combined_attention_mask
192.168.1.51: RuntimeError: The size of tensor a (4) must match the size of tensor b (16) at non-singleton dimension 0
haolin-nju commented 1 year ago

get a new error with newest master:

192.168.1.51: Traceback (most recent call last):
192.168.1.51:   File "/mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 521, in <module>
192.168.1.51:     main()
192.168.1.51:   File "/mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 429, in main
192.168.1.51:     out = trainer.generate_experience(prompts,
192.168.1.51:   File "/mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 101, in generate_experience
192.168.1.51:     seq = self._generate_sequence(prompts, mask)
192.168.1.51:   File "/mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 73, in _generate_sequence
192.168.1.51:     seq = self.actor_model.module.generate(
192.168.1.51:   File "/opt/anaconda3/lib/python3.10/site-packages/deepspeed/runtime/hybrid_engine.py", line 234, in generate
192.168.1.51:     generate_ret_vals = self._generate(*inputs, **kwargs)
192.168.1.51:   File "/opt/anaconda3/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
192.168.1.51:     return func(*args, **kwargs)
192.168.1.51:   File "/opt/anaconda3/lib/python3.10/site-packages/transformers/generation/utils.py", line 1527, in generate
192.168.1.51:     return self.greedy_search(
192.168.1.51:   File "/opt/anaconda3/lib/python3.10/site-packages/transformers/generation/utils.py", line 2349, in greedy_search
192.168.1.51:     outputs = self(
192.168.1.51:   File "/opt/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
192.168.1.51:     result = forward_call(*input, **kwargs)
192.168.1.51:   File "/opt/anaconda3/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 944, in forward
192.168.1.51:     outputs = self.model.decoder(
192.168.1.51:   File "/opt/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
192.168.1.51:     result = forward_call(*input, **kwargs)
192.168.1.51:   File "/opt/anaconda3/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 650, in forward
192.168.1.51:     causal_attention_mask = self._prepare_decoder_attention_mask(
192.168.1.51:   File "/opt/anaconda3/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 551, in _prepare_decoder_attention_mask
192.168.1.51:     expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask + combined_attention_mask
192.168.1.51: RuntimeError: The size of tensor a (4) must match the size of tensor b (16) at non-singleton dimension 0

Got the same error.

hch1017 commented 1 year ago

Similar... 6144 and 8192

LSC527 commented 1 year ago

Same to me. It seems a tp related bug. It works fine when not enabling tp.