microsoft / DeepSpeedExamples

Example models using DeepSpeed
Apache License 2.0
5.87k stars 997 forks source link

[bug]AttributeError: 'DeepSpeedHybridEngine' object has no attribute 'mp_group' #525

Open qingchu123 opened 1 year ago

qingchu123 commented 1 year ago

my training environment is a docker image pulled from deepspeed/deepspeed:v072_torch112_cu117 and i run it with docker run -it --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --network train-net --name fuyx-work -v /home/fuyx/big_disk_1000/DeepSpeedExamples/applications/DeepSpeed-Chat:/root/DeepSpeed-Chat b1d in a overlay docker network. then after i complete The previous two steps,i run the last step by python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type multi_node --step 3 my hostfile is

jes-work slots=1
fuyx-work slots=1

and i get this error

jes-work: Traceback (most recent call last):
jes-work:   File "main.py", line 522, in <module>
jes-work:     main()
jes-work:   File "main.py", line 390, in main
jes-work:     rlhf_engine = DeepSpeedRLHFEngine(
jes-work:   File "/root/DeepSpeed-Chat/training/step3_rlhf_finetuning/rlhf_engine.py", line 48, in __init__
jes-work:     self.actor = self._init_actor(
jes-work:   File "/root/DeepSpeed-Chat/training/step3_rlhf_finetuning/rlhf_engine.py", line 119, in _init_actor
jes-work:     actor_engine, *_ = deepspeed.initialize(model=actor_model,
jes-work:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/__init__.py", line 153, in initialize
jes-work:     engine = DeepSpeedHybridEngine(args=args,
jes-work:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/hybrid_engine.py", line 52, in __init__
jes-work:     self.create_inference_module()
jes-work:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/hybrid_engine.py", line 359, in create_inference_module
jes-work:     self.create_inference_containers(self.module)
jes-work:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/hybrid_engine.py", line 308, in create_inference_containers
jes-work:     self.create_inference_containers(child, layer_id=layer_id)
jes-work:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/hybrid_engine.py", line 308, in create_inference_containers
jes-work:     self.create_inference_containers(child, layer_id=layer_id)
jes-work:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/hybrid_engine.py", line 308, in create_inference_containers
jes-work:     self.create_inference_containers(child, layer_id=layer_id)
jes-work:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/hybrid_engine.py", line 288, in create_inference_containers
jes-work:     self._inference_containers.append(self.inference_policies[child.__class__][0](
jes-work:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/hybrid_engine.py", line 107, in new_inference_container
jes-work:     _container.set_tensor_parallel_config(self._config.hybrid_engine.inference_tp_size, self.mp_group)
jes-work:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 461, in __getattr__
jes-work:     raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
jes-work: AttributeError: 'DeepSpeedHybridEngine' object has no attribute 'mp_group'

the deepspeed command is below,i don't have any change except reduce some batch size to slow the gpu's pressure:

deepspeed --master_port 12346\
    --hostfile=hostfile \
     main.py \
   --data_path Dahoas/rm-static \
   --data_split 2,4,4 \
   --actor_model_name_or_path $ACTOR_MODEL_PATH \
   --critic_model_name_or_path $CRITIC_MODEL_PATH \
   --num_padding_at_beginning 1 \
   --per_device_train_batch_size 1 \
   --per_device_mini_train_batch_size 1 \
   --generation_batch_numbers 1 \
   --ppo_epochs 1 \
   --max_answer_seq_len 256 \
   --max_prompt_seq_len 256 \
   --actor_learning_rate ${Actor_Lr} \
   --critic_learning_rate ${Critic_Lr} \
   --actor_weight_decay 0.1 \
   --critic_weight_decay 0.1 \
   --num_train_epochs 1 \
   --lr_scheduler_type cosine \
   --gradient_accumulation_steps 1 \
   --num_warmup_steps 100 \
   --deepspeed --seed 1234 \
   --enable_hybrid_engine \
   --inference_tp_size 8 \
   --tp_gather_partition_size 4 \
   --actor_zero_stage $ACTOR_ZERO_STAGE \
   --critic_zero_stage $CRITIC_ZERO_STAGE \
   --actor_gradient_checkpointing \
   --disable_actor_dropout \
   --actor_lora_dim 128 \
   --actor_lora_module_name decoder.layers. \
   --output_dir $OUTPUT \
    &> $OUTPUT/training.log
jomayeri commented 1 year ago

Hi @qingchu123 could you report which version of DeepSpeed you are running?

qingchu123 commented 1 year ago

@jomayeri i use pip show deepspeed and it shows:

Name: deepspeed
Version: 0.9.3+5c6da1f0
Summary: DeepSpeed library
Home-page: http://deepspeed.ai
Author: DeepSpeed Team
Author-email: deepspeed-info@microsoft.com
License: Apache Software License 2.0
Location: /opt/conda/lib/python3.8/site-packages
Requires: hjson, ninja, numpy, packaging, psutil, py-cpuinfo, pydantic, torch, tqdm

i have install the git last deepspeed,Commits on May 13, 2023,sha:5c6da1f001f936234a31a238e71ca386e34eb51a

jomayeri commented 1 year ago

@qingchu123 try adjusting the --inference_tp_size to a lower number, it may be you don't have enough GPUs across your nodes.

kkk935208447 commented 4 months ago

try adjusting the --inference_tp_size to a lower number, it may be you don't have enough GPUs across your nodes.

thanks,it work