microsoft / DeepSpeedExamples

Example models using DeepSpeed
Apache License 2.0
5.96k stars 1.01k forks source link

[DeepSpeedExamples/applications/DeepSpeed-Chat/] Error happened when running step3_rlhf_finetuning in enable_hybrid_engine mode with togethercomputer/GPT-NeoXT-Chat-Base-20B #448

Open GxjGit opened 1 year ago

GxjGit commented 1 year ago

Error info:

File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/hybrid_engine.py", line 99, in new_inference_container File "/opt/conda/lib/python3.8/site-packages/deepspeed/module_inject/containers/gptneox.py", line 95, in get_hidden_heads IndexError _container.create_ds_model_config() File "/opt/conda/lib/python3.8/site-packages/deepspeed/module_inject/containers/base.py", line 79, in create_ds_model_config : tuple index out of range return self.client_module.attention.query_key_value.weight.shape[1], \ self.set_hidden_heads(*self.policy.get_hidden_heads()) File "/opt/conda/lib/python3.8/site-packages/deepspeed/module_inject/containers/gptneox.py", line 95, in get_hidden_heads IndexError return self.client_module.attention.query_key_value.weight.shape[1], \ : tuple index out of range IndexError: tuple index out of range

I have printed "self.client_module.attention.query_key_value.weight.shape", the result is torch.Size([0]).

I wonder if DeepSpeed-Chat has supported togethercomputer/GPT-NeoXT-Chat-Base-20B with --enable_hybrid_engine.

My running script is:

#!/bin/bash
# Copyright (c) Microsoft Corporation.
# SPDX-License-Identifier: Apache-2.0

ACTOR_MODEL_PATH="togethercomputer/GPT-NeoXT-Chat-Base-20B"
CRITIC_MODEL_PATH="togethercomputer/GPT-NeoXT-Chat-Base-20B"

ACTOR_ZERO_STAGE=$3
CRITIC_ZERO_STAGE=$4
OUTPUT=$5
if [ "$OUTPUT" == "" ]; then
    OUTPUT=/home/notebook/data/personal/deepspeed-llama/RLHF
fi
if [ "$ACTOR_ZERO_STAGE" == "" ]; then
    ACTOR_ZERO_STAGE=3
fi
if [ "$CRITIC_ZERO_STAGE" == "" ]; then
    CRITIC_ZERO_STAGE=3
fi
mkdir -p $OUTPUT

Num_Padding_at_Beginning=1 # this is model related

Actor_Lr=9.65e-6
Critic_Lr=5e-6

python -m torch.distributed.launch --nproc_per_node=8 /home/notebook/data/personal/80350607/0472/code/dev/llama/star-acc/StarEngine/nlp/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py \
   --data_split 2,4,4 \
   --actor_model_name_or_path $ACTOR_MODEL_PATH \
   --critic_model_name_or_path $CRITIC_MODEL_PATH \
   --num_padding_at_beginning 1 \
   --per_device_train_batch_size 4 \
   --per_device_mini_train_batch_size 4 \
   --generation_batch_numbers 1 \
   --inference_tp_size 1 \
   --tp_gather_partition_size 1 \
   --ppo_epochs 1 \
   --max_answer_seq_len 256 \
   --max_prompt_seq_len 256 \
   --actor_learning_rate ${Actor_Lr} \
   --critic_learning_rate ${Critic_Lr} \
   --actor_weight_decay 0.1 \
   --critic_weight_decay 0.1 \
   --num_train_epochs 1 \
   --lr_scheduler_type cosine \
   --gradient_accumulation_steps 1 \
   --num_warmup_steps 100 \
   --deepspeed --seed 1234 \
   --enable_hybrid_engine \
   --actor_zero_stage $ACTOR_ZERO_STAGE \
   --critic_zero_stage $CRITIC_ZERO_STAGE \
   --output_dir $OUTPUT \
leuchine commented 1 year ago

I also encounter this issue. Is there an easy fix? Thanks.

Ricardokevins commented 8 months ago

anything new?