microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.56k stars 4.14k forks source link

self.client_module.attn.q_proj.weight.shape[1] returns IndexError: tuple index out of range #3287

Open publicstaticvo opened 1 year ago

publicstaticvo commented 1 year ago

Describe the bug I am getting the following error while attempting to run deepspeed-chat step 3 with the actor model CarperAI/openai_summarize_tldr_sft (gpt-j 6B) and critic model CarperAI/openai_summarize_tldr_rm_checkpoint (gpt-j 6B) and ZeRO stage level 3.

Traceback (most recent call last): File "main.py", line 523, in main() File "main.py", line 394, in main rlhf_engine = DeepSpeedRLHFEngine(
File "/data/nt12_ssd_gluster/myself/yts/dc/training/step3_rlhf_finetuning/rlhf_engine.py", line 49, in init self.actor = self._init_actor(actor_model_name_or_path=actor_model_name_or_path)
File "/data/nt12_ssd_gluster/myself/yts/dc/training/step3_rlhf_finetuning/rlhf_engine.py", line 115, in _init_actor actorengine, * = deepspeed.initialize(model=actor_model, File "/data/nt12_ssd_gluster/myself/yts/dc/training/step1_supervised_finetuning/DeepSpeed/deepspeed/init.py", line 144, in initialize engine = DeepSpeedHybridEngine(args=args,
File "/data/nt12_ssd_gluster/myself/yts/dc/training/step1_supervised_finetuning/DeepSpeed/deepspeed/runtime/hybrid_engine.py", line 52, in init self.create_inference_module()
File "/data/nt12_ssd_gluster/myself/yts/dc/training/step1_supervised_finetuning/DeepSpeed/deepspeed/runtime/hybrid_engine.py", line 326, in create_inference_module self.create_inference_containers(self.module) File "/data/nt12_ssd_gluster/myself/yts/dc/training/step1_supervised_finetuning/DeepSpeed/deepspeed/runtime/hybrid_engine.py", line 296, in create_inference_containers self.create_inference_containers(child, layer_id=layer_id) File "/data/nt12_ssd_gluster/myself/yts/dc/training/step1_supervised_finetuning/DeepSpeed/deepspeed/runtime/hybrid_engine.py", line 296, in create_inference_containers self.create_inference_containers(child, layer_id=layer_id)
File "/data/nt12_ssd_gluster/myself/yts/dc/training/step1_supervised_finetuning/DeepSpeed/deepspeed/runtime/hybrid_engine.py", line 276, in create_inference_containers self._inference_containers.append(self.inference_policies[child.class][0]( File "/data/nt12_ssd_gluster/myself/yts/dc/training/step1_supervised_finetuning/DeepSpeed/deepspeed/runtime/hybrid_engine.py", line 99, in new_inference_container _container.create_ds_model_config() File "/data/nt12_ssd_gluster/myself/yts/dc/training/step1_supervised_finetuning/DeepSpeed/deepspeed/module_inject/containers/base.py", line 79, in create_ds_model_config
self.set_hidden_heads(*self.policy.get_hidden_heads())
File "/data/nt12_ssd_gluster/myself/yts/dc/training/step1_supervised_finetuning/DeepSpeed/deepspeed/module_inject/containers/gptj.py", line 73, in get_hidden_heads return self.client_module.attn.q_proj.weight.shape[1], \ IndexError: tuple index out of range

Adding print(self.client_module.attn.q_proj.weight) and print(self.client_module.attn.q_proj.weight.shape) right above return self.client_module.attn.q_proj.weight.shape[1] gets the output Parameter containing: tensor([], device='cuda:0', dtype=torch.float16, requires_grad=True) and torch.Size([0]). It seems that the parameters of the model are missing during the initialization of deepspeed engine.

ds_report output


DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

async_io ............... [YES] ...... [OKAY] cpu_adagrad ............ [YES] ...... [OKAY] cpu_adam ............... [YES] ...... [OKAY] fused_adam ............. [YES] ...... [OKAY] fused_lamb ............. [YES] ...... [OKAY] quantizer .............. [YES] ...... [OKAY] random_ltd ............. [YES] ...... [OKAY] sparse_attn ............ [YES] ...... [OKAY] spatial_inference ...... [YES] ...... [OKAY] transformer ............ [YES] ...... [OKAY] stochastic_transformer . [YES] ...... [OKAY] transformer_inference .. [YES] ...... [OKAY] utils .................. [YES] ...... [OKAY]

DeepSpeed general environment info: torch install path ............... ['/data/nt12_ssd_gluster/myself/miniconda3/lib/python3.8/site-packages/torch'] torch version .................... 1.10.0+cu113 deepspeed install path ........... ['/data/nt12_ssd_gluster/myself/yts/dc/training/step1_supervised_finetuning/DeepSpeed/deepspeed'] deepspeed info ................... 0.9.1+cc67f22f, cc67f22f, master torch cuda version ............... 11.3 torch hip version ................ None nvcc version ..................... 11.3 deepspeed wheel compiled w. ...... torch 1.10, cuda 11.3

System info (please complete the following information):

Additional context I check the source code of deepspeed and find two free_param(param) operations in deepspeed/runtime/zero/partition_parameters.py, line 1115 and 1186, where the parameters are turned into torch.empty(0). The It seems that the params aren't restored after this operation, and remain empty till the above error occurs.

publicstaticvo commented 1 year ago

@cmikeh2 this is another error with zero_stage=3

SAXSUN commented 1 year ago

@publicstaticvo hello,I also have this error, may I ask if you have solved it?

publicstaticvo commented 1 year ago

I deleted free_param(param) at line 1115 of deepspeed/runtime/zero/partition_parameters.py and it seems to work, but I don't know if it is the right solution. For example, I met another Cuda Out Of Memory problem later, and I don't know if that was the reason.

@SAXSUN which base model are you using? cmikeh2 said the project currently only supports Meta OPT models in another issue (#3284).

SAXSUN commented 1 year ago

actor_model dolly7b,ritic_model opt350m

ciayomin commented 1 year ago

when will this bug be solved?