Open hxdtest opened 1 year ago
Facing the same issue.
same error
hi, Is the problem solved?
Is the problem solved?
Hi guys, I believe this is because the gathered ZeRO-3 parameters are not being properly saved into the ds inference module.
I've hacked this issue in several commits within my forked repo, but my testing primarily focused on Llama models, and I utilized it for benchmarking throughput only. I hope this information proves to be useful in resolving the issue in your own use cases.
Describe the bug Describe the bug In DeepSpeed-Chat step3, a runtime error: The size of tensor a (4) must match the size of tensor b (8) at non-singleton dimension 0 will be thrown when inference_tp_size>1 and hybrid engine is enabled. I encountered this bug on the given 13b training scripts but not on that of 1.3b. I found that the main differences between the provided 13b and 1.3b training scripts are that zero stage are set to 3 and inference_tp_size is larger than 1 on 13b models, while zero stage are set to 2 and inference_tp_size are set to default on 1.3b scripts.
Log output
some parameters of container.module are not gathered
result is
To Reproduce opt-1.3b+opt-350m, GPU: 8*40G A100 torch1.12, deepspeed-0.10.0+d6f62217, deepspeed example-f9c3ae05, transformers-4.30.0
modify inputs in
hybrid_engine.py
to walk around problem in https://github.com/microsoft/DeepSpeed/issues/3998DeepSpeed Team
ACTOR_MODEL_PATH=$1 CRITIC_MODEL_PATH=$2 DATA_PATH=$3 ACTOR_ZERO_STAGE=$4 CRITIC_ZERO_STAGE=$5 OUTPUT=$6 if [ "$OUTPUT" == "" ]; then OUTPUT=./output fi if [ "$ACTOR_ZERO_STAGE" == "" ]; then ACTOR_ZERO_STAGE=3 fi if [ "$CRITIC_ZERO_STAGE" == "" ]; then CRITIC_ZERO_STAGE=3 fi mkdir -p $OUTPUT
Num_Padding_at_Beginning=1 # this is model related
Actor_Lr=9.65e-6 Critic_Lr=5e-6 INFERENCE_TP_SIZE=2
deepspeed --master_port 12346 main.py \ --data_path $DATA_PATH \ --data_split 2,4,4 \ --actor_model_name_or_path $ACTOR_MODEL_PATH \ --critic_model_name_or_path $CRITIC_MODEL_PATH \ --num_padding_at_beginning 1 \ --per_device_train_batch_size 4 \ --per_device_mini_train_batch_size 4 \ --generation_batch_numbers 1 \ --ppo_epochs 1 \ --max_answer_seq_len 256 \ --max_prompt_seq_len 256 \ --actor_learning_rate ${Actor_Lr} \ --critic_learning_rate ${Critic_Lr} \ --num_train_epochs 1 \ --lr_scheduler_type cosine \ --gradient_accumulation_steps 1 \ --disable_actor_dropout \ --num_warmup_steps 100 \ --deepspeed --seed 1234 \ --enable_hybrid_engine \ --actor_zero_stage $ACTOR_ZERO_STAGE \ --critic_zero_stage $CRITIC_ZERO_STAGE \ --inference_tp_size ${INFERENCE_TP_SIZE} \ --tp_gather_partition_size 2 \ --enable_ema \ --output_dir $OUTPUT \ &> $OUTPUT/training.log Expected behavior Run the script successfully