microsoft / DeepSpeedExamples

Example models using DeepSpeed
Apache License 2.0
5.83k stars 990 forks source link

enable_hybrid_engine issue #456

Open llllooong opened 1 year ago

llllooong commented 1 year ago

Error Info: File "/data/rooter_use/conda/envs/llama-env39/lib/python3.9/site-packages/deepspeed/runtime/hybrid_engine.py", line 398, in step

actor_loss, critic_loss = trainer.train_rlhf(exp_data) File "/data/rooter_use/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 173, in train_rlhf actor_loss, critic_loss = trainer.train_rlhf(exp_data)
if(self._inference_containers[0].module.attention.attn_qkvw is not None and \ File "/data/rooter_use/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 173, in train_rlhf

self.actor_model.step() File "/data/rooter_use/conda/envs/llama-env39/lib/python3.9/site-packages/deepspeed/runtime/hybrid_engine.py", line 398, in step IndexError: list index out of range self.actor_model.step() File "/data/rooter_use/conda/envs/llama-env39/lib/python3.9/site-packages/deepspeed/runtime/hybrid_engine.py", line 398, in step if(self._inference_containers[0].module.attention.attn_qkvw is not None and \ IndexErrorif(self._inference_containers[0].module.attention.attn_qkvw is not None and \: list index out of range IndexError: list index out of range

llllooong commented 1 year ago

Hello, I'm runing step 3 with 2 llama models for both actor and rw part, and the two models are both small (2G for actor and 1G for rw). I got error when I set enable_hybrid_engine as True. Is it possible that enable_hybrid_engine does not support llama model, or enable_hybrid_engine is only available when the model is large?

hijkzzz commented 1 year ago

I also encountered the error with enable_hybrid_engine + bloomz-560m + TP=8

 self._inference_containers[layer_id].apply_tensor_parallelism(
  File "/home/jianh/.local/lib/python3.8/site-packages/deepspeed/module_inject/containers/features/meta_tensor.py", line 27, in apply_tensor_parallelism
        dst.data.copy_(src[:, self.gpu_index * dst_shape[self.in_dim]: (self.gpu_index + 1) * dst_shape[self.in_dim]] if inner_dim == 1 else \super().apply_tensor_parallelism(mp_replace, mp_group, tp_size)

  File "/home/jianh/.local/lib/python3.8/site-packages/deepspeed/module_inject/containers/base.py", line 206, in apply_tensor_parallelism
RuntimeError: unsupported operation: some elements of the input tensor and the written-to tensor refer to a single memory location. Please clone() the tensor before performing the operation.
    self.attention_qkv_mp(mp_replace, reversed_dim=reversed_dim)
  File "/home/jianh/.local/lib/python3.8/site-packages/deepspeed/module_inject/containers/bloom.py", line 32, in attention_qkv_mp
    self.module.attention.attn_qkvw = mp_replace.copy(self.module.attention.attn_qkvw, self.qkvw)
  File "/home/jianh/.local/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 107, in copy
    dst.data.copy_(src[:, self.gpu_index * dst_shape[self.in_dim]: (self.gpu_index + 1) * dst_shape[self.in_dim]] if inner_dim == 1 else \
RuntimeError: The size of tensor a (3072) must match the size of tensor b (0) at non-singleton dimension 0
mcc311 commented 1 year ago

Same issue here 🙋‍♂️ 🙌

devinzhang91 commented 10 months ago

I meet the same issue ! When "Actor" is llama2 and "Critic" is opt-350m.

Maybe the "Actor" and "Critic" is difference model, it look like caused by tokenizer encode, the two different tokenizer encode ids are quite different.

I change "Actor" and "Critic" so that the models come form the same. When "Actor" and "Critic" are llama2 both, it work....

GasolSun36 commented 10 months ago

I meet the same issue ! When "Actor" is llama2 and "Critic" is opt-350m.

Maybe the "Actor" and "Critic" is difference model, it look like caused by tokenizer encode, the two different tokenizer encode ids are quite different.

I change "Actor" and "Critic" so that the models come form the same. When "Actor" and "Critic" are llama2 both, it work....

Hi, I use two llama2-7b as the actor and critic, but it didn't work. Can you share your running sh? thanks a lot!

devinzhang91 commented 10 months ago

Hi, I use two llama2-7b as the actor and critic, but it didn't work. Can you share your running sh? thanks a lot!

Just same as applications/DeepSpeed-Chat/training

xhwang22 commented 10 months ago

I meet the same issue ! When "Actor" is llama2 and "Critic" is opt-350m. Maybe the "Actor" and "Critic" is difference model, it look like caused by tokenizer encode, the two different tokenizer encode ids are quite different. I change "Actor" and "Critic" so that the models come form the same. When "Actor" and "Critic" are llama2 both, it work....

Hi, I use two llama2-7b as the actor and critic, but it didn't work. Can you share your running sh? thanks a lot!

Hi, I meet the same issue. Have you resolved it?

lucywang720 commented 10 months ago

I meet the same issue ! When "Actor" is llama2 and "Critic" is opt-350m. Maybe the "Actor" and "Critic" is difference model, it look like caused by tokenizer encode, the two different tokenizer encode ids are quite different. I change "Actor" and "Critic" so that the models come form the same. When "Actor" and "Critic" are llama2 both, it work....

Hi, I use two llama2-7b as the actor and critic, but it didn't work. Can you share your running sh? thanks a lot!

Hi, I meet the same issue. Have you resolved it?

I meet too...Have you resolved it?

GasolSun36 commented 10 months ago

Latest update: I haved solved the issue, see https://github.com/microsoft/DeepSpeed/issues/4229#issuecomment-1704004959