Closed liuaiting closed 1 year ago
@cmikeh2 @jeffra @lekurile @awan-10
Hello @liuaiting. Thank you for reporting this issue to us. One of our recent fixes https://github.com/microsoft/DeepSpeed/pull/3462 may have already fixed this error. Could you update your deepspeed and give it another try?
Hello @liuaiting. Thank you for reporting this issue to us. One of our recent fixes #3462 may have already fixed this error. Could you update your deepspeed and give it another try?
After I update deepspeed, it can run successfully, thank you very much for your reply.
@liuaiting Glad to hear the error is fixed. Closing the issue
@HeyangQin Still encounter this with the deepspeed version 0.10.3, running step3 use llama2 + lora + zero3, v100*32G
anaconda3.9/envs/dschat/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 52, in setitem raise RuntimeError(f"{param.ds_summary()} already in registry") RuntimeError: {'id': 0, 'status': 'INFLIGHT', 'numel': 262144000, 'ds_numel': 262144000, 'shape': (64000, 4096), 'ds_shape': (64000, 4096), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([65536000])} already in registry
Even though my local copy of repository is up to date I am encountering this error. Log is below. Last line of the log shows the command I run with all the options.
Invalidate trace cache @ step 55440: expected module 0, but got module 13
Traceback (most recent call last):
Traceback (most recent call last):
File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 632, in
main() File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 492, in main
File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 492, in main
main()
File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 492, in main
out = trainer.generate_experience(batch_prompt['prompt'],
File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 129, in generate_experience
out = trainer.generate_experience(batch_prompt['prompt'],
out = trainer.generate_experience(batch_prompt['prompt'], File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 129, in generate_experience
output = self.actor_model(seq, attention_mask=attention_mask)
out = trainer.generate_experience(batch_prompt['prompt'],
main() File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 129, in generate_experience
main()
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
output = self.actor_model(seq, attention_mask=attention_mask) File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 129, in generate_experience File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 492, in main File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 492, in main
main()output = self.actor_model(seq, attention_mask=attention_mask) File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 492, in main
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl output = self.actor_model(seq, attention_mask=attention_mask) File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl out = trainer.generate_experience(batch_prompt['prompt'],out = trainer.generate_experience(batch_prompt['prompt'],
File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 129, in generate_experience
File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 129, in generate_experience
out = trainer.generate_experience(batch_prompt['prompt'],
return forward_call(*args, **kwargs) File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 129, in generate_experience
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
output = self.actor_model(seq, attention_mask=attention_mask) output = self.actor_model(seq, attention_mask=attention_mask)return forward_call(*args, **kwargs) return forward_call(*args, **kwargs)
return forward_call(*args, **kwargs)
output = self.actor_model(seq, attention_mask=attention_mask) File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl ret_val = func(*args, **kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1801, in forward ret_val = func(*args, kwargs)ret_val = func(*args, *kwargs)ret_val = func(args, kwargs)
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1801, in forward
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1801, in forward
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1801, in forward
Traceback (most recent call last):
File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 632, in
return forward_call(*args, kwargs)
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1801, in forward
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, *kwargs)
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1801, in forward
ret_val = func(args, kwargs)loss = self.module(*inputs, **kwargs)
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1801, in forward File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl loss = self.module(*inputs, kwargs)loss = self.module(*inputs, *kwargs)loss = self.module(inputs, kwargs)
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl result = forward_call(*args, kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 944, in forward loss = self.module(*inputs, *kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl loss = self.module(inputs, kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl loss = self.module(*inputs, kwargs) result = forward_call(*args, *kwargs) result = forward_call(args, kwargs)result = forward_call(*args, **kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
File "/home/user1/venv/ds/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 944, in forward File "/home/user1/venv/ds/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 944, in forward File "/home/user1/venv/ds/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 944, in forward outputs = self.model.decoder( File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl result = forward_call(*args, kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 944, in forward result = forward_call(*args, *kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 944, in forward main() File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 492, in main outputs = self.model.decoder( outputs = self.model.decoder(outputs = self.model.decoder( result = forward_call(args, kwargs)
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl File "/home/user1/venv/ds/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 944, in forward out = trainer.generate_experience(batch_prompt['prompt'],result = forward_call(*args, **kwargs)outputs = self.model.decoder(outputs = self.model.decoder(
File "/home/user1/venv/ds/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 653, in forward File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 129, in generate_experience File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl output = self.actor_model(seq, attention_mask=attention_mask) File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl result = forward_call(*args, kwargs) outputs = self.model.decoder(result = forward_call(*args, *kwargs) result = forward_call(args, kwargs)
File "/home/user1/venv/ds/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 653, in forward pos_embeds = self.embed_positions(attention_mask, past_key_values_length) File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl File "/home/user1/venv/ds/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 653, in forward
File "/home/user1/venv/ds/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 653, in forward File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl result = forward_call(*args, *kwargs)result = forward_call(args, **kwargs)
File "/home/user1/venv/ds/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 653, in forward File "/home/user1/venv/ds/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 653, in forward return forward_call(*args, **kwargs)pos_embeds = self.embed_positions(attention_mask, past_key_values_length)
pos_embeds = self.embed_positions(attention_mask, past_key_values_length) File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
pos_embeds = self.embed_positions(attention_mask, past_key_values_length) File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
pos_embeds = self.embed_positions(attention_mask, past_key_values_length) result = forward_call(*args, **kwargs)ret_val = func(*args, **kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
pos_embeds = self.embed_positions(attention_mask, past_key_values_length) result = hook(self, args)
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
File "/home/user1/venv/ds/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 653, in forward
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1801, in forward
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
result = hook(self, args)
result = hook(self, args) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook
ret_val = func(*args, *kwargs)
pos_embeds = self.embed_positions(attention_mask, past_key_values_length)ret_val = func(args, **kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook
result = hook(self, args)
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn result = hook(self, args) self.pre_sub_module_forward_function(module) result = hook(self, args) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn self.pre_sub_module_forward_function(module)ret_val = func(*args, **kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function self.pre_sub_module_forward_function(module)
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook ret_val = func(*args, *kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function loss = self.module(inputs, **kwargs)
result = hook(self, args)ret_val = func(*args, **kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn self.pre_sub_module_forward_function(module) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook
param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state)
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function self.pre_sub_module_forward_function(module)ret_val = func(*args, **kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn self.pre_sub_module_forward_function(module) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook
ret_val = func(*args, **kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function
result = forward_call(*args, **kwargs)
ret_val = func(*args, **kwargs)
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state)ret_val = func(*args, **kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 944, in forward
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
self.pre_sub_module_forward_function(module) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state)
param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state)
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function
return func(*args, kwargs)
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
return func(*args, *kwargs)
return func(args, kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 284, in fetch_sub_module
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 284, in fetch_sub_module outputs = self.model.decoder( File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 284, in fetch_sub_module File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ret_val = func(*args, **kwargs)param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state) File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
ret_val = func(*args, **kwargs)
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs) self.all_gather_params(params_to_fetch, forward)
self.all_gather_params(params_to_fetch, forward)
self.__all_gather_params(params_to_fetch, forward) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 284, in fetch_sub_module
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
return func(*args, **kwargs)
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, *kwargs)return func(args, **kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 284, in fetch_sub_module
result = forward_call(*args, kwargs)ret_val = func(*args, *kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context ret_val = func(args, kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 284, in fetch_sub_module ret_val = func(*args, **kwargs)
self.__all_gather_params(params_to_fetch, forward)
File "/home/user1/venv/ds/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 653, in forward File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 428, in __all_gather_params
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 428, in all_gather_params File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 428, in all_gather_params File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn return func(*args, **kwargs)self.__all_gather_params(params_to_fetch, forward)
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 284, in fetch_sub_module
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)self.all_gather_params(params_to_fetch, forward)self.all_gatherparams(nonquantized_params, forward, quantize=self.zero_quantized_weights)
self.__all_gather_params_(nonquantized_params, forward, quantize=self.zero_quantized_weights)self.__all_gather_params_(nonquantized_params, forward, quantize=self.zero_quantized_weights)pos_embeds = self.embed_positions(attention_mask, past_key_values_length) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 428, in __all_gather_params
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 453, in __all_gatherparams ret_val = func(*args, **kwargs)
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 453, in all_gatherparams self.all_gather_params(params_to_fetch, forward) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 453, in all_gatherparams File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 428, in all_gather_params ret_val = func(*args, **kwargs)
self.__all_gatherparams(nonquantized_params, forward, quantize=self.zero_quantized_weights) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn self.inflight_param_registry[param] = handle File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 428, in all_gather_params
self.__inflight_param_registry[param] = handle
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 453, in __all_gather_params_
self.inflight_param_registry[param] = handle File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 52, in setitem self.all_gatherparams(nonquantized_params, forward, quantize=self.zero_quantized_weights)ret_val = func(*args, **kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 52, in setitem self.__all_gatherparams(nonquantized_params, forward, quantize=self.zero_quantized_weights)
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 52, in setitem
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 453, in __all_gather_params_
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 428, in all_gather_params result = hook(self, args)raise RuntimeError(f"{param.ds_summary()} already in registry") File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 453, in __all_gatherparams self.inflight_param_registry[param] = handleraise RuntimeError(f"{param.ds_summary()} already in registry")
raise RuntimeError(f"{param.ds_summary()} already in registry")
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
RuntimeError File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 52, in __setitem__
RuntimeErrorself.__inflight_param_registry[param] = handleself.__all_gather_params_(nonquantized_params, forward, quantize=self.zero_quantized_weights)RuntimeError : self.__inflight_param_registry[param] = handle:
: ret_val = func(*args, **kwargs){'id': 1, 'status': 'INFLIGHT', 'numel': 10496000, 'ds_numel': 10496000, 'shape': (2050, 5120), 'ds_shape': (2050, 5120), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([1312000])} already in registry
{'id': 1, 'status': 'INFLIGHT', 'numel': 10496000, 'ds_numel': 10496000, 'shape': (2050, 5120), 'ds_shape': (2050, 5120), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([1312000])} already in registry File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 52, in setitem File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 453, in __all_gatherparams raise RuntimeError(f"{param.ds_summary()} already in registry"){'id': 1, 'status': 'INFLIGHT', 'numel': 10496000, 'ds_numel': 10496000, 'shape': (2050, 5120), 'ds_shape': (2050, 5120), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([1312000])} already in registry
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 52, in setitem
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook
raise RuntimeError(f"{param.ds_summary()} already in registry")RuntimeError
: raise RuntimeError(f"{param.ds_summary()} already in registry")self.__inflight_param_registry[param] = handle{'id': 1, 'status': 'INFLIGHT', 'numel': 10496000, 'ds_numel': 10496000, 'shape': (2050, 5120), 'ds_shape': (2050, 5120), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([1312000])} already in registry
RuntimeError
: RuntimeError File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 52, in __setitem__
self.pre_sub_module_forward_function(module){'id': 1, 'status': 'INFLIGHT', 'numel': 10496000, 'ds_numel': 10496000, 'shape': (2050, 5120), 'ds_shape': (2050, 5120), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([1312000])} already in registry:
{'id': 1, 'status': 'INFLIGHT', 'numel': 10496000, 'ds_numel': 10496000, 'shape': (2050, 5120), 'ds_shape': (2050, 5120), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([1312000])} already in registry File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function
raise RuntimeError(f"{param.ds_summary()} already in registry")
RuntimeError: {'id': 1, 'status': 'INFLIGHT', 'numel': 10496000, 'ds_numel': 10496000, 'shape': (2050, 5120), 'ds_shape': (2050, 5120), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([1312000])} already in registry param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 284, in fetch_sub_module self.__all_gather_params(params_to_fetch, forward) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(args, kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 428, in all_gather_params self.all_gatherparams(nonquantized_params, forward, quantize=self.zero_quantized_weights) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 453, in __all_gatherparams self.inflight_param_registry[param] = handle File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 52, in setitem__ raise RuntimeError(f"{param.ds_summary()} already in registry") RuntimeError: {'id': 1, 'status': 'INFLIGHT', 'numel': 10496000, 'ds_numel': 10496000, 'shape': (2050, 5120), 'ds_shape': (2050, 5120), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([1312000])} already in registry [2023-09-15 10:36:50,504] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2907797 [2023-09-15 10:36:50,546] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2907798 [2023-09-15 10:36:50,547] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2907799 [2023-09-15 10:36:51,115] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2907800 [2023-09-15 10:36:51,443] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2907801 [2023-09-15 10:36:52,095] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2907802 [2023-09-15 10:36:52,138] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2907803 [2023-09-15 10:36:52,178] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2907804 [2023-09-15 10:36:52,218] [ERROR] [launch.py:321:sigkill_handler] ['/home/user1/venv/ds/bin/python3', '-u', 'main.py', '--local_rank=7', '--data_path', 'Dahoas/rm-static', '--data_split', '2,4,4', '--actor_model_name_or_path', '/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/13b', '--critic_model_name_or_path', '/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/output/reward-models/350m', '--num_padding_at_beginning', '1', '--per_device_generation_batch_size', '1', '--per_device_training_batch_size', '1', '--generation_batches', '1', '--ppo_epochs', '1', '--max_answer_seq_len', '256', '--max_prompt_seq_len', '256', '--actor_learning_rate', '5e-4', '--critic_learning_rate', '5e-6', '--num_train_epochs', '1', '--lr_scheduler_type', 'cosine', '--offload_reference_model', '--gradient_accumulation_steps', '1', '--actor_gradient_checkpointing', '--critic_gradient_checkpointing', '--num_warmup_steps', '100', '--deepspeed', '--seed', '1234', '--inference_tp_size', '2', '--actor_zero_stage', '3', '--critic_zero_stage', '3', '--disable_actor_dropout', '--actor_lora_dim', '128', '--actor_lora_module_name', 'decoder.layers.', '--output_dir', '/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/output/step3-models/13b'] exits with return code = 1
Describe the bug When running step 3 with ZERO stage 3 enabled and lora for both the actor and critic models. An error was reported, it seems to tell me that bloomz does not support zero3+lora.
Log output
To Reproduce the
run.sh
is:the
run_bloom_1b7.sh
is:Expected behavior use zero3+lora for training step3
ds_report output
Screenshots no. The error is in the
Log output
System info (please complete the following information):
Docker context no
Additional context no