Open publicstaticvo opened 1 year ago
@cmikeh2 this is another error with zero_stage=3
@publicstaticvo hello,I also have this error, may I ask if you have solved it?
I deleted free_param(param)
at line 1115 of deepspeed/runtime/zero/partition_parameters.py
and it seems to work, but I don't know if it is the right solution. For example, I met another Cuda Out Of Memory problem later, and I don't know if that was the reason.
@SAXSUN which base model are you using? cmikeh2 said the project currently only supports Meta OPT models in another issue (#3284).
actor_model dolly7b,ritic_model opt350m
when will this bug be solved?
Describe the bug I am getting the following error while attempting to run deepspeed-chat step 3 with the actor model CarperAI/openai_summarize_tldr_sft (gpt-j 6B) and critic model CarperAI/openai_summarize_tldr_rm_checkpoint (gpt-j 6B) and ZeRO stage level 3.
Adding
print(self.client_module.attn.q_proj.weight)
andprint(self.client_module.attn.q_proj.weight.shape)
right abovereturn self.client_module.attn.q_proj.weight.shape[1]
gets the outputParameter containing: tensor([], device='cuda:0', dtype=torch.float16, requires_grad=True)
andtorch.Size([0])
. It seems that the parameters of the model are missing during the initialization of deepspeed engine.ds_report output
DeepSpeed C++/CUDA extension op report
NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.
JIT compiled ops requires ninja ninja .................. [OKAY]
op name ................ installed .. compatible
async_io ............... [YES] ...... [OKAY] cpu_adagrad ............ [YES] ...... [OKAY] cpu_adam ............... [YES] ...... [OKAY] fused_adam ............. [YES] ...... [OKAY] fused_lamb ............. [YES] ...... [OKAY] quantizer .............. [YES] ...... [OKAY] random_ltd ............. [YES] ...... [OKAY] sparse_attn ............ [YES] ...... [OKAY] spatial_inference ...... [YES] ...... [OKAY] transformer ............ [YES] ...... [OKAY] stochastic_transformer . [YES] ...... [OKAY] transformer_inference .. [YES] ...... [OKAY] utils .................. [YES] ...... [OKAY]
DeepSpeed general environment info: torch install path ............... ['/data/nt12_ssd_gluster/myself/miniconda3/lib/python3.8/site-packages/torch'] torch version .................... 1.10.0+cu113 deepspeed install path ........... ['/data/nt12_ssd_gluster/myself/yts/dc/training/step1_supervised_finetuning/DeepSpeed/deepspeed'] deepspeed info ................... 0.9.1+cc67f22f, cc67f22f, master torch cuda version ............... 11.3 torch hip version ................ None nvcc version ..................... 11.3 deepspeed wheel compiled w. ...... torch 1.10, cuda 11.3
System info (please complete the following information):
Additional context I check the source code of deepspeed and find two
free_param(param)
operations indeepspeed/runtime/zero/partition_parameters.py
, line 1115 and 1186, where the parameters are turned intotorch.empty(0)
. The It seems that the params aren't restored after this operation, and remain empty till the above error occurs.