[BUG]deepspeed-chat training error on v100 * 8, raise assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary() after training of step3

iamsile commented 1 year ago

Describe the bug

Hi, everybody, I'm traning a llama model in step3 using deepspeed-chat. In version 0.10.1, it raised the following error(see in logs bleow). so I switch branch to HeyangQin/fix_issue_3156(https://github.com/microsoft/DeepSpeed/issues/3156) and copy code into master to fix it. after that I find a new bug when training RL.

The full training command:

deepspeed --include localhost:7,6,5,4 /xxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py --data_output_path /xxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/rlhf_13b_data_output --actor_model_name_or_path /xxxxx/new_model_20230808/pytorch_model.bin --tokenizer_type LLaMATokenizer --llm_pretrained /xxxxx/new_model_20230808/pretrain --tokenizer_name_or_path /xxxxx/new_model_20230808/tokenizer --critic_model_name_or_path /xxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step2_reward_model_finetuning/output_zh_13b_mplug_multi_modal_0815/pytorch_model.bin --data_path coco_zh/coco_zh_rm --actor_zero_stage 3 --critic_zero_stage 3 --num_padding_at_beginning 0 --per_device_train_batch_size 1 --per_device_mini_train_batch_size 1 --ppo_epochs 5 --actor_learning_rate 9.65e-6 --critic_learning_rate 5e-6 --gradient_accumulation_steps 1 --deepspeed --actor_lora_dim 1 --actor_lora_module_name q_proj.lora_A.default,q_proj.lora_B.default,v_proj.lora_A.default,v_proj.lora_B.default,k_proj --critic_lora_dim 1 --critic_lora_module_name q_proj.lora_A.default,q_proj.lora_B.default,v_proj.lora_A.default,v_proj.lora_B.default,k_proj --offload_reference_model --actor_learning_rate 5e-4 --critic_learning_rate 5e-6 --max_answer_seq_len 512 --max_prompt_seq_len 200 --actor_weight_decay 0.1 --critic_weight_decay 0.1 --actor_gradient_checkpointing --only_optimize_lora --output_dir /xxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/rlhf_mplug_13b_model_output_20230815 --offload --print_answers

The training log:

File "/xxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/modeling_mplug_owl.py", line 1677, in forward text_embeds = self.get_input_embeddings()(texttokens) # Temporally Embedding File "/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl result = hook(self, input) File "/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, kwargs) File "/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 383, in _pre_forward_module_hook self.pre_sub_module_forward_function(module) File "/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, *kwargs) File "/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 495, in pre_sub_module_forward_function param_coordinator.fetch_sub_module(sub_module, forward=True) File "/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(args, kwargs) File "/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 328, in fetch_sub_module assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary() AssertionError: {'id': 1489, 'status': 'INFLIGHT', 'numel': 412180480, 'ds_numel': 412180480, 'shape': (80504, 5120), 'ds_shape': (80504, 5120), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': {543}, 'ds_tensor.shape': torch.Size([103045120])}

Expected behavior A clear and concise description of what you expected to happen.

ds_report output [2023-08-23 03:17:51,889] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) DeepSpeed C++/CUDA extension op report NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op. JIT compiled ops requires ninja ninja .................. [OKAY] op name ................ installed .. compatible [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] fused_adam ............. [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] cpu_adagrad ............ [NO] ....... [OKAY] fused_lamb ............. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] please install triton==1.0.0 if you want to use sparse attention sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY] utils .................. [NO] ....... [OKAY] DeepSpeed general environment info: torch install path ............... ['/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch'] torch version .................... 1.13.1 deepspeed install path ........... ['/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed-0.10.2+8fb111c0-py3.10.egg/deepspeed'] deepspeed info ................... 0.10.2+8fb111c0, 8fb111c0, master torch cuda version ............... 11.7 torch hip version ................ None nvcc version ..................... 10.1 deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7 shared memory (/dev/shm) size .... 251.53 GB

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information): OS: ubuntu20.04 8*v100 32G pytorch:1.7.1 deepspeed: 0.10.1

GasolSun36 commented 1 year ago

HI, facing the same error, any solutions?

xwjiang2010 commented 1 year ago

Same..

iamsile commented 1 year ago

Update report. In my lastest test. I found copy HeyangQin/fix_issue_3156 into master hasn't work. In RL training, it only works at step0, after that it must be crash. this is a full report:

reward score --> step=0, rank=2, tensor([0.2354], device='cuda:2', dtype=torch.bfloat16) reward score --> step=0, rank=0, tensor([0.4492], device='cuda:0', dtype=torch.bfloat16) reward score --> step=0, rank=1, tensor([0.0601], device='cuda:1', dtype=torch.bfloat16) use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False... use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False... use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False... /xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None") /xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None") /xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None") Epoch: 0 | Step: 0 | PPO Epoch: 1 | Actor Loss: -0.55859375 | Critic Loss: 0.216796875 | Unsupervised Loss: 0.0 Average reward score: 0.2470703125

Traceback (most recent call last): File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 672, in main() File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 541, in main out = trainer.generate_experience(batch_prompt['images'], File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 153, in generate_experience seq = self._generate_sequence(images, prompts, mask, step) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 95, in _generate_sequence seq = self.actor_model.module.forward(images, File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/peft/peft_model.py", line 296, in forward return self.get_base_model()(*args, kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, *kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/modeling_mplug_owl.py", line 1672, in forward outputs = self.language_model.generate( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(args, kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 1538, in generate return self.greedy_search( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 2362, in greedy_search outputs = self( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 794, in forward outputs = self.model( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, *kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 677, in forward layer_outputs = decoder_layer( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(input, kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 372, in forward hidden_states = self.input_layernorm(hidden_states) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl result = hook(self, input) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 383, in _pre_forward_module_hook self.pre_sub_module_forward_function(module) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, *kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 495, in pre_sub_module_forward_function param_coordinator.fetch_sub_module(sub_module, forward=True) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(args, kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 328, in fetch_sub_module assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary() AssertionError: {'id': 918, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 5120, 'shape': (0,), 'ds_shape': (5120,), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': {578}, 'ds_tensor.shape': torch.Size([1707])} Traceback (most recent call last): File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 672, in main() File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 541, in main out = trainer.generate_experience(batch_prompt['images'], File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 153, in generate_experience seq = self._generate_sequence(images, prompts, mask, step) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 95, in _generate_sequence seq = self.actor_model.module.forward(images, File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/peft/peft_model.py", line 296, in forward return self.get_base_model()(*args, *kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(input, kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/modeling_mplug_owl.py", line 1672, in forward outputs = self.language_model.generate( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 1538, in generate return self.greedy_search( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 2362, in greedy_search outputs = self( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, *kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 794, in forward outputs = self.model( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(input, kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 677, in forward layer_outputs = decoder_layer( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 372, in forward hidden_states = self.input_layernorm(hidden_states) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl result = hook(self, input) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, *kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 383, in _pre_forward_module_hook self.pre_sub_module_forward_function(module) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(args, kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 495, in pre_sub_module_forward_function param_coordinator.fetch_sub_module(sub_module, forward=True) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, *kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 328, in fetch_sub_module assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary() AssertionError: {'id': 918, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 5120, 'shape': (0,), 'ds_shape': (5120,), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': {578}, 'ds_tensor.shape': torch.Size([1707])} Traceback (most recent call last): File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 672, in main() File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 541, in main out = trainer.generate_experience(batch_prompt['images'], File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 153, in generate_experience seq = self._generate_sequence(images, prompts, mask, step) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 95, in _generate_sequence seq = self.actor_model.module.forward(images, File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/peft/peft_model.py", line 296, in forward return self.get_base_model()(args, kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/modeling_mplug_owl.py", line 1672, in forward outputs = self.language_model.generate( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, *kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 1538, in generate return self.greedy_search( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 2362, in greedy_search outputs = self( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(input, kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 794, in forward outputs = self.model( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 677, in forward layer_outputs = decoder_layer( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, *kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 372, in forward hidden_states = self.input_layernorm(hidden_states) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl result = hook(self, input) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(args, kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 383, in _pre_forward_module_hook self.pre_sub_module_forward_function(module) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 495, in pre_sub_module_forward_function param_coordinator.fetch_sub_module(sub_module, forward=True) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, *kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(args, kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 328, in fetch_sub_module assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary() AssertionError: {'id': 918, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 5120, 'shape': (0,), 'ds_shape': (5120,), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': {578}, 'ds_tensor.shape': torch.Size([1707])} [2023-08-28 07:34:58,470] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 46301 [2023-08-28 07:34:58,470] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 46302 [2023-08-28 07:34:58,476] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 46303

GasolSun36 commented 1 year ago

I think this may be related to these two paras:

--inference_tp_size 1 \ --tp_gather_partition_size 1 \

I set both paras to 1, and another error occurs, see #4226

iamsile commented 1 year ago

I think this may be related to these two paras:

--inference_tp_size 1 --tp_gather_partition_size 1 \

I set both paras to 1, and another error occurs, see #4226

I didn't set this parameter, but it also has this bug.

GasolSun36 commented 1 year ago

But the default value is 8 or 4, I think you need to manually set them to 1.

iamsile commented 1 year ago

But the default value is 8 or 4, I think you need to manually set them to 1.

I use --inference_tp_size 1 and --tp_gather_partition_size 1, but it doesn't work. it crash same error.

awan-10 commented 1 year ago

@iamsile, @GasolSun36, @xwjiang2010 -- thank you for reporting the error.

We have two scripts for llama2 models that we have tested quite extensively and they have worked without errors for us. Can you please try either of the scripts in this folder first?

https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/training_scripts/llama2

Please let me know if these work on your system or not. We will work closely with you to resolve these issues :)

denizyuret commented 1 year ago

I have replicated the error on a 8xA100(80G) setup with DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/llama2/run_llama2_7b.sh. The resulting training.log is here. Setting ZERO_STAGE=2 works without problems.

vinod-sarvam commented 1 year ago

That's correct. ZeRO=2 works, but it is not enough if we want to run 70B models.

GasolSun36 commented 1 year ago

@iamsile, @GasolSun36, @xwjiang2010 -- thank you for reporting the error.

We have two scripts for llama2 models that we have tested quite extensively and they have worked without errors for us. Can you please try either of the scripts in this folder first?

https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/training_scripts/llama2

Please let me know if these work on your system or not. We will work closely with you to resolve these issues :)

Hi awan-10, I tested the new sh file, which works for LLAMA2-7B (actor model) and OPT-350M (critic model) without any problems. However, when I test the Baichuan-7B (actor model) and OPT-350M, it casues:

AssertionErrorAssertionError assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary() {'id': 227, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 0, 'shape': (0,), 'ds_shape': (0, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {389}, 'ds_tensor.shape': torch.Size([0])}{'id': 227, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 0, 'shape': (0,), 'ds_shape': (0, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {389}, 'ds_tensor.shape': torch.Size([0])}

Did this framework not support Baichuan this kind of Chinese LLM? I'm confused.

iamsile commented 1 year ago

I have replicated the error on a 8xA100(80G) setup with DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/llama2/run_llama2_7b.sh. The resulting training.log is here. Setting ZERO_STAGE=2 works without problems.

Hi, everyone. I tested run_llama2_7b.sh on step1 without any problems, and I need more time to test step2 and step3.

aksbaih commented 1 year ago

Same issue here. Trying to replicate https://github.com/ray-project/ray/tree/workspace_templates_2.6.1/doc/source/templates/04_finetuning_llms_with_deepspeed for llama 7B.

Setup: Cluster of 16 Nvidia L4 running docker anyscale/ray:2.6.1-py39-cu118

Deepspeed config:

{
    "fp16": {
        "enabled": "auto"
    },
    "bf16": {
        "enabled": "auto"
    },
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": 5e8,
        "stage3_prefetch_bucket_size": 5e8,
        "stage3_param_persistence_threshold": 1e6,
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": false,
        "round_robin_gradients": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 10,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

Error ST:

  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 375, in train
    raise skipped from exception_cause(skipped)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/utils.py", line 54, in check_for_failure
    ray.get(object_ref)
ray.exceptions.RayTaskError(AssertionError): ray::_RayTrainWorker__execute.get_next() (pid=1771, ip=10.128.0.28, actor_id=728100beae767cf33e536eb805000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7fbb764c3c70>)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/worker_group.py", line 32, in __execute
    raise skipped from exception_cause(skipped)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/utils.py", line 129, in discard_return_wrapper
    train_func(*args, **kwargs)
  File "/home/ray/ray/doc/source/templates/04_finetuning_llms_with_deepspeed/finetune_hf_llm.py", line 320, in training_function
    outputs = model(**batch)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1801, in forward
    loss = self.module(*inputs, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 827, in forward
    logits = self.lm_head(hidden_states)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    result = hook(self, args)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook
    self.pre_sub_module_forward_function(module)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function
    param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 310, in fetch_sub_module
    assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary()
AssertionError: {'id': 292, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 0, 'shape': (0,), 'ds_shape': (0, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {453}, 'ds_tensor.shape': torch.Size([0])}

Stage 2 works fine, only fails on stage 3.

I tried a few things and none worked:

Minimized memory usage by using a batch size of 1.
Unpinned CPU memory.
Logged CPU memory just before the error (was only 20%) a few layers in.
Used AWS 16xNvidia A10's.
Reduced the size of all buffers in the deepspeed zero config by multiple zeros. None of these worked. Same error.

Thanks for the support, we are excited for this technology :)

denizyuret commented 1 year ago

I have discovered that the issue started at transformers-4.32.0 and transformers-4.31.0 works fine.

iamsile commented 1 year ago

@iamsile, @GasolSun36, @xwjiang2010 -- thank you for reporting the error.

We have two scripts for llama2 models that we have tested quite extensively and they have worked without errors for us. Can you please try either of the scripts in this folder first?

https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/training_scripts/llama2

Please let me know if these work on your system or not. We will work closely with you to resolve these issues :)

@awan-10 hi, I had finished this three step that step1 and step2 worked well, but the step3 also had failed.

the step3 script:

deepspeed --include localhost:7,6,5,4,3,2 --master_port 12346 main.py --data_path Dahoas/rm-static --data_split 2,4,4 --actor_model_name_or_path /xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/model_output --critic_model_name_or_path /xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step2_reward_model_finetuning/model_output --num_padding_at_beginning 1 --per_device_generation_batch_size 1 --per_device_training_batch_size 1 --generation_batches 1 --ppo_epochs 1 --max_answer_seq_len 256 --max_prompt_seq_len 256 --actor_learning_rate 9.65e-6 --critic_learning_rate 5e-6 --actor_weight_decay 0.1 --critic_weight_decay 0.1 --num_train_epochs 1 --lr_scheduler_type cosine --gradient_accumulation_steps 1 --actor_gradient_checkpointing --critic_gradient_checkpointing --offload_reference_model --disable_actor_dropout --num_warmup_steps 100 --deepspeed --seed 1234 --actor_zero_stage 3 --critic_zero_stage 3 --actor_lora_dim 64 --critic_lora_dim 64 --critic_lora_module_name "layers." --actor_lora_module_name "layers." --output_dir /xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/model_output --offload --offload_reference_model

this step3 training log like this:

*[end] Initialized Actor Model [end] (duration: 118.33s)** *****[start] Initializing Ref Model [start] ** [2023-09-04 03:51:48,694] [INFO] [partition_parameters.py:342:exit] finished initializing model - num_params = 582, num_elems = 13.48B Some weights of LlamaForCausalLM were not initialized from the model checkpoint at /xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/model_output and are newly initialized: ['model.layers.18.self_attn.rotary_emb.inv_freq', 'model.layers.27.self_attn.rotary_emb.inv_freq', 'model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.7.self_attn.rotary_emb.inv_freq', 'model.layers.6.self_attn.rotary_emb.inv_freq', 'model.layers.24.self_attn.rotary_emb.inv_freq', 'model.layers.25.self_attn.rotary_emb.inv_freq', 'model.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.31.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq', 'model.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.26.self_attn.rotary_emb.inv_freq', 'model.layers.28.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.23.self_attn.rotary_emb.inv_freq', 'model.layers.15.self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq', 'model.layers.30.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.22.self_attn.rotary_emb.inv_freq', 'model.layers.29.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq', 'model.layers.0.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Some weights of LlamaForCausalLM were not initialized from the model checkpoint at /xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/model_output and are newly initialized: ['model.layers.23.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.27.self_attn.rotary_emb.inv_freq', 'model.layers.6.self_attn.rotary_emb.inv_freq', 'model.layers.24.self_attn.rotary_emb.inv_freq', 'model.layers.31.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.29.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq', 'model.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.26.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.0.self_attn.rotary_emb.inv_freq', 'model.layers.28.self_attn.rotary_emb.inv_freq', 'model.layers.15.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.22.self_attn.rotary_emb.inv_freq', 'model.layers.7.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.layers.25.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq', 'model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.30.self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.18.self_attn.rotary_emb.inv_freq'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Some weights of LlamaForCausalLM were not initialized from the model checkpoint at /xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/model_output and are newly initialized: ['model.layers.23.self_attn.rotary_emb.inv_freq', 'model.layers.6.self_attn.rotary_emb.inv_freq', 'model.layers.28.self_attn.rotary_emb.inv_freq', 'model.layers.22.self_attn.rotary_emb.inv_freq', 'model.layers.24.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.30.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq', 'model.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.25.self_attn.rotary_emb.inv_freq', 'model.layers.31.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.26.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq', 'model.layers.7.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.18.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.27.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.layers.0.self_attn.rotary_emb.inv_freq', 'model.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.29.self_attn.rotary_emb.inv_freq', 'model.layers.15.self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.8.self_attn.rotary_emb.inv_freq'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Some weights of LlamaForCausalLM were not initialized from the model checkpoint at /xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/model_output and are newly initialized: ['model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq', 'model.layers.28.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.25.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.0.self_attn.rotary_emb.inv_freq', 'model.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.22.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq', 'model.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq', 'model.layers.6.self_attn.rotary_emb.inv_freq', 'model.layers.31.self_attn.rotary_emb.inv_freq', 'model.layers.24.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.26.self_attn.rotary_emb.inv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.30.self_attn.rotary_emb.inv_freq', 'model.layers.18.self_attn.rotary_emb.inv_freq', 'model.layers.27.self_attn.rotary_emb.inv_freq', 'model.layers.15.self_attn.rotary_emb.inv_freq', 'model.layers.23.self_attn.rotary_emb.inv_freq', 'model.layers.7.self_attn.rotary_emb.inv_freq', 'model.layers.29.self_attn.rotary_emb.inv_freq'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Some weights of LlamaForCausalLM were not initialized from the model checkpoint at /xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/model_output and are newly initialized: ['model.layers.15.self_attn.rotary_emb.inv_freq', 'model.layers.25.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq', 'model.layers.29.self_attn.rotary_emb.inv_freq', 'model.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.27.self_attn.rotary_emb.inv_freq', 'model.layers.0.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.23.self_attn.rotary_emb.inv_freq', 'model.layers.7.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq', 'model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.6.self_attn.rotary_emb.inv_freq', 'model.layers.24.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.layers.22.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.31.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.26.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.18.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq', 'model.layers.30.self_attn.rotary_emb.inv_freq', 'model.layers.28.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Some weights of LlamaForCausalLM were not initialized from the model checkpoint at /xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/model_output and are newly initialized: ['model.layers.0.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.27.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.layers.22.self_attn.rotary_emb.inv_freq', 'model.layers.31.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq', 'model.layers.28.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.29.self_attn.rotary_emb.inv_freq', 'model.layers.6.self_attn.rotary_emb.inv_freq', 'model.layers.18.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.30.self_attn.rotary_emb.inv_freq', 'model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.15.self_attn.rotary_emb.inv_freq', 'model.layers.25.self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq', 'model.layers.24.self_attn.rotary_emb.inv_freq', 'model.layers.26.self_attn.rotary_emb.inv_freq', 'model.layers.7.self_attn.rotary_emb.inv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.23.self_attn.rotary_emb.inv_freq'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [2023-09-04 03:51:57,942] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.10.2, git-hash=unknown, git-branch=unknown [2023-09-04 03:51:57,953] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False [2023-09-04 03:51:57,954] [INFO] [logging.py:96:log_dist] [Rank 0] Creating ZeRO Offload [2023-09-04 03:51:58,181] [INFO] [utils.py:803:see_memory_usage] DeepSpeedZeRoOffload initialize [begin] [2023-09-04 03:51:58,182] [INFO] [utils.py:804:see_memory_usage] MA 1.06 GB Max_MA 1.85 GB CA 2.64 GB Max_CA 3 GB [2023-09-04 03:51:58,182] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory: used = 312.37 GB, percent = 62.1% Parameter Offload: Total persistent parameters: 266240 in 65 params [2023-09-04 03:51:58,414] [INFO] [utils.py:803:see_memory_usage] DeepSpeedZeRoOffload initialize [end] [2023-09-04 03:51:58,414] [INFO] [utils.py:804:see_memory_usage] MA 1.06 GB Max_MA 1.06 GB CA 2.64 GB Max_CA 3 GB [2023-09-04 03:51:58,414] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory: used = 312.37 GB, percent = 62.1% [2023-09-04 03:51:58,415] [INFO] [config.py:963:print] DeepSpeedEngine configuration: [2023-09-04 03:51:58,416] [INFO] [config.py:967:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2023-09-04 03:51:58,416] [INFO] [config.py:967:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2023-09-04 03:51:58,416] [INFO] [config.py:967:print] amp_enabled .................. False [2023-09-04 03:51:58,416] [INFO] [config.py:967:print] amp_params ................... False [2023-09-04 03:51:58,416] [INFO] [config.py:967:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2023-09-04 03:51:58,416] [INFO] [config.py:967:print] bfloat16_enabled ............. False [2023-09-04 03:51:58,416] [INFO] [config.py:967:print] checkpoint_parallel_write_pipeline False [2023-09-04 03:51:58,416] [INFO] [config.py:967:print] checkpoint_tag_validation_enabled True [2023-09-04 03:51:58,416] [INFO] [config.py:967:print] checkpoint_tag_validation_fail False [2023-09-04 03:51:58,416] [INFO] [config.py:967:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f677c3cc700> [2023-09-04 03:51:58,416] [INFO] [config.py:967:print] communication_data_type ...... None [2023-09-04 03:51:58,416] [INFO] [config.py:967:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2023-09-04 03:51:58,416] [INFO] [config.py:967:print] curriculum_enabled_legacy .... False [2023-09-04 03:51:58,416] [INFO] [config.py:967:print] curriculum_params_legacy ..... False [2023-09-04 03:51:58,416] [INFO] [config.py:967:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2023-09-04 03:51:58,416] [INFO] [config.py:967:print] data_efficiency_enabled ...... False [2023-09-04 03:51:58,416] [INFO] [config.py:967:print] dataloader_drop_last ......... False [2023-09-04 03:51:58,416] [INFO] [config.py:967:print] disable_allgather ............ False [2023-09-04 03:51:58,416] [INFO] [config.py:967:print] dump_state ................... False [2023-09-04 03:51:58,416] [INFO] [config.py:967:print] dynamic_loss_scale_args ...... None [2023-09-04 03:51:58,416] [INFO] [config.py:967:print] eigenvalue_enabled ........... False [2023-09-04 03:51:58,416] [INFO] [config.py:967:print] eigenvalue_gas_boundary_resolution 1 [2023-09-04 03:51:58,416] [INFO] [config.py:967:print] eigenvalue_layer_name ........ bert.encoder.layer [2023-09-04 03:51:58,416] [INFO] [config.py:967:print] eigenvalue_layer_num ......... 0 [2023-09-04 03:51:58,416] [INFO] [config.py:967:print] eigenvalue_max_iter .......... 100 [2023-09-04 03:51:58,416] [INFO] [config.py:967:print] eigenvalue_stability ......... 1e-06 [2023-09-04 03:51:58,416] [INFO] [config.py:967:print] eigenvalue_tol ............... 0.01 [2023-09-04 03:51:58,416] [INFO] [config.py:967:print] eigenvalue_verbose ........... False [2023-09-04 03:51:58,416] [INFO] [config.py:967:print] elasticity_enabled ........... False [2023-09-04 03:51:58,416] [INFO] [config.py:967:print] flops_profiler_config ........ { "enabled": false, "recompute_fwd_factor": 0.0, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2023-09-04 03:51:58,416] [INFO] [config.py:967:print] fp16_auto_cast ............... False [2023-09-04 03:51:58,417] [INFO] [config.py:967:print] fp16_enabled ................. True [2023-09-04 03:51:58,417] [INFO] [config.py:967:print] fp16_master_weights_and_gradients False [2023-09-04 03:51:58,417] [INFO] [config.py:967:print] global_rank .................. 0 [2023-09-04 03:51:58,417] [INFO] [config.py:967:print] grad_accum_dtype ............. None [2023-09-04 03:51:58,417] [INFO] [config.py:967:print] gradient_accumulation_steps .. 1 [2023-09-04 03:51:58,417] [INFO] [config.py:967:print] gradient_clipping ............ 1.0 [2023-09-04 03:51:58,417] [INFO] [config.py:967:print] gradient_predivide_factor .... 1.0 [2023-09-04 03:51:58,417] [INFO] [config.py:967:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2023-09-04 03:51:58,417] [INFO] [config.py:967:print] initial_dynamic_scale ........ 65536 [2023-09-04 03:51:58,417] [INFO] [config.py:967:print] load_universal_checkpoint .... False [2023-09-04 03:51:58,417] [INFO] [config.py:967:print] loss_scale ................... 0 [2023-09-04 03:51:58,417] [INFO] [config.py:967:print] memory_breakdown ............. False [2023-09-04 03:51:58,417] [INFO] [config.py:967:print] mics_hierarchial_params_gather False [2023-09-04 03:51:58,417] [INFO] [config.py:967:print] mics_shard_size .............. -1 [2023-09-04 03:51:58,417] [INFO] [config.py:967:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False [2023-09-04 03:51:58,417] [INFO] [config.py:967:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2023-09-04 03:51:58,417] [INFO] [config.py:967:print] optimizer_legacy_fusion ...... False [2023-09-04 03:51:58,417] [INFO] [config.py:967:print] optimizer_name ............... None [2023-09-04 03:51:58,417] [INFO] [config.py:967:print] optimizer_params ............. None [2023-09-04 03:51:58,417] [INFO] [config.py:967:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0} [2023-09-04 03:51:58,417] [INFO] [config.py:967:print] pld_enabled .................. False [2023-09-04 03:51:58,417] [INFO] [config.py:967:print] pld_params ................... False [2023-09-04 03:51:58,417] [INFO] [config.py:967:print] prescale_gradients ........... False [2023-09-04 03:51:58,417] [INFO] [config.py:967:print] scheduler_name ............... None [2023-09-04 03:51:58,417] [INFO] [config.py:967:print] scheduler_params ............. None [2023-09-04 03:51:58,417] [INFO] [config.py:967:print] sparse_attention ............. None [2023-09-04 03:51:58,417] [INFO] [config.py:967:print] sparse_gradients_enabled ..... False [2023-09-04 03:51:58,417] [INFO] [config.py:967:print] steps_per_print .............. 10 [2023-09-04 03:51:58,417] [INFO] [config.py:967:print] train_batch_size ............. 6 [2023-09-04 03:51:58,417] [INFO] [config.py:967:print] train_micro_batch_size_per_gpu 1 [2023-09-04 03:51:58,417] [INFO] [config.py:967:print] use_node_local_storage ....... False [2023-09-04 03:51:58,417] [INFO] [config.py:967:print] wall_clock_breakdown ......... False [2023-09-04 03:51:58,417] [INFO] [config.py:967:print] world_size ................... 6 [2023-09-04 03:51:58,417] [INFO] [config.py:967:print] zero_allow_untested_optimizer False [2023-09-04 03:51:58,417] [INFO] [config.py:967:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=10000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=False pipeline_loading_checkpoint=False override_module_apply=True [2023-09-04 03:51:58,417] [INFO] [config.py:967:print] zero_enabled ................. True [2023-09-04 03:51:58,417] [INFO] [config.py:967:print] zero_force_ds_cpu_optimizer .. True [2023-09-04 03:51:58,417] [INFO] [config.py:967:print] zero_optimization_stage ...... 3 [2023-09-04 03:51:58,417] [INFO] [config.py:953:print_user_config] json = { "train_batch_size": 6, "train_micro_batch_size_per_gpu": 1, "steps_per_print": 10, "zero_optimization": { "stage": 3, "stage3_param_persistence_threshold": 1.000000e+04, "offload_param": { "device": "cpu" }, "memory_efficient_linear": false }, "fp16": { "enabled": true }, "gradient_clipping": 1.0, "prescale_gradients": false, "wall_clock_breakdown": false } [end] Initialized Ref Model [end] (duration: 23.50s) ****[start] Initializing Critic Model [start] **** [2023-09-04 03:51:59,800] [INFO] [partition_parameters.py:342:exit] finished initializing model - num_params = 872, num_elems = 20.08B

Creating model from_config took 1.394550085067749 seconds torch.load took 14.609045505523682 seconds Loading model state dict took 5.597141265869141 seconds [93m [WARNING] [0m cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled! Using /root/.cache/torch_extensions/py310_cu117 as PyTorch extensions root... No modifications detected for re-loaded extension module cpu_adam, skipping build step... Loading extension module cpu_adam... Time to load cpu_adam op: 2.8365261554718018 seconds Adam Optimizer #1 is created with AVX512 arithmetic capability. Config: alpha=0.000005, betas=(0.900000, 0.950000), weight_decay=0.000000, adam_w=1 [93m [WARNING] [0m cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled! Using /root/.cache/torch_extensions/py310_cu117 as PyTorch extensions root... No modifications detected for re-loaded extension module cpu_adam, skipping build step... Loading extension module cpu_adam... Time to load cpu_adam op: 2.3674826622009277 seconds Adam Optimizer #1 is created with AVX512 arithmetic capability. Config: alpha=0.000005, betas=(0.900000, 0.950000), weight_decay=0.000000, adam_w=1 [93m [WARNING] [0m cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled! Using /root/.cache/torch_extensions/py310_cu117 as PyTorch extensions root... No modifications detected for re-loaded extension module cpu_adam, skipping build step... Loading extension module cpu_adam... Time to load cpu_adam op: 2.3644497394561768 seconds Adam Optimizer #1 is created with AVX512 arithmetic capability. Config: alpha=0.000005, betas=(0.900000, 0.950000), weight_decay=0.000000, adam_w=1 [93m [WARNING] [0m cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled! Using /root/.cache/torch_extensions/py310_cu117 as PyTorch extensions root... No modifications detected for re-loaded extension module cpu_adam, skipping build step... Loading extension module cpu_adam... Time to load cpu_adam op: 2.3637731075286865 seconds Adam Optimizer #1 is created with AVX512 arithmetic capability. Config: alpha=0.000005, betas=(0.900000, 0.950000), weight_decay=0.000000, adam_w=1 [2023-09-04 03:53:17,190] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.10.2, git-hash=unknown, git-branch=unknown [93m [WARNING] [0m cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled! Using /root/.cache/torch_extensions/py310_cu117 as PyTorch extensions root... No modifications detected for re-loaded extension module cpu_adam, skipping build step... Loading extension module cpu_adam... Time to load cpu_adam op: 2.3590688705444336 seconds Adam Optimizer #1 is created with AVX512 arithmetic capability. Config: alpha=0.000005, betas=(0.900000, 0.950000), weight_decay=0.000000, adam_w=1 [2023-09-04 03:53:17,239] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False [2023-09-04 03:53:17,241] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the client Optimizer [2023-09-04 03:53:17,241] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer [93m [WARNING] [0m cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled! Using /root/.cache/torch_extensions/py310_cu117 as PyTorch extensions root... No modifications detected for re-loaded extension module cpu_adam, skipping build step... Loading extension module cpu_adam... Time to load cpu_adam op: 2.372196912765503 seconds Adam Optimizer #1 is created with AVX512 arithmetic capability. Config: alpha=0.000005, betas=(0.900000, 0.950000), weight_decay=0.000000, adam_w=1 [2023-09-04 03:53:17,281] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam [2023-09-04 03:53:17,281] [INFO] [utils.py:54:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'> [2023-09-04 03:53:17,282] [INFO] [logging.py:96:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False [2023-09-04 03:53:17,282] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 3 optimizer [2023-09-04 03:53:17,584] [INFO] [utils.py:803:see_memory_usage] Stage 3 initialize beginning [2023-09-04 03:53:17,585] [INFO] [utils.py:804:see_memory_usage] MA 3.48 GB Max_MA 3.73 GB CA 14.28 GB Max_CA 14 GB [2023-09-04 03:53:17,585] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory: used = 312.73 GB, percent = 62.2% [2023-09-04 03:53:17,588] [INFO] [stage3.py:126:init] Reduce bucket size 500,000,000 [2023-09-04 03:53:17,588] [INFO] [stage3.py:127:init] Prefetch bucket size 30000000 [2023-09-04 03:53:17,826] [INFO] [utils.py:803:see_memory_usage] DeepSpeedZeRoOffload initialize [begin] [2023-09-04 03:53:17,826] [INFO] [utils.py:804:see_memory_usage] MA 3.48 GB Max_MA 3.48 GB CA 14.28 GB Max_CA 14 GB [2023-09-04 03:53:17,826] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory: used = 312.73 GB, percent = 62.2% Parameter Offload: Total persistent parameters: 270336 in 66 params [2023-09-04 03:53:18,200] [INFO] [utils.py:803:see_memory_usage] DeepSpeedZeRoOffload initialize [end] [2023-09-04 03:53:18,201] [INFO] [utils.py:804:see_memory_usage] MA 3.22 GB Max_MA 3.48 GB CA 14.28 GB Max_CA 14 GB [2023-09-04 03:53:18,201] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory: used = 312.73 GB, percent = 62.2% [2023-09-04 03:53:18,440] [INFO] [utils.py:803:see_memory_usage] Before creating fp16 partitions [2023-09-04 03:53:18,441] [INFO] [utils.py:804:see_memory_usage] MA 3.22 GB Max_MA 3.22 GB CA 14.28 GB Max_CA 14 GB [2023-09-04 03:53:18,441] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory: used = 312.74 GB, percent = 62.2% [2023-09-04 03:53:19,357] [INFO] [utils.py:803:see_memory_usage] After creating fp16 partitions: 2 [2023-09-04 03:53:19,358] [INFO] [utils.py:804:see_memory_usage] MA 3.13 GB Max_MA 3.22 GB CA 14.28 GB Max_CA 14 GB [2023-09-04 03:53:19,358] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory: used = 314.41 GB, percent = 62.5% [2023-09-04 03:53:19,585] [INFO] [utils.py:803:see_memory_usage] Before creating fp32 partitions [2023-09-04 03:53:19,586] [INFO] [utils.py:804:see_memory_usage] MA 3.13 GB Max_MA 3.13 GB CA 14.28 GB Max_CA 14 GB [2023-09-04 03:53:19,586] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory: used = 314.41 GB, percent = 62.5% [2023-09-04 03:53:19,839] [INFO] [utils.py:803:see_memory_usage] After creating fp32 partitions [2023-09-04 03:53:19,839] [INFO] [utils.py:804:see_memory_usage] MA 3.13 GB Max_MA 3.13 GB CA 14.28 GB Max_CA 14 GB [2023-09-04 03:53:19,839] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory: used = 314.59 GB, percent = 62.5% [2023-09-04 03:53:20,334] [INFO] [utils.py:803:see_memory_usage] Before initializing optimizer states [2023-09-04 03:53:20,335] [INFO] [utils.py:804:see_memory_usage] MA 3.13 GB Max_MA 3.13 GB CA 14.28 GB Max_CA 14 GB [2023-09-04 03:53:20,335] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory: used = 317.23 GB, percent = 63.1% [2023-09-04 03:53:21,077] [INFO] [utils.py:803:see_memory_usage] After initializing optimizer states [2023-09-04 03:53:21,078] [INFO] [utils.py:804:see_memory_usage] MA 3.13 GB Max_MA 3.13 GB CA 14.28 GB Max_CA 14 GB [2023-09-04 03:53:21,078] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory: used = 317.91 GB, percent = 63.2% [2023-09-04 03:53:21,078] [INFO] [stage3.py:445:_setup_for_real_optimizer] optimizer state initialized [2023-09-04 03:53:21,687] [INFO] [utils.py:803:see_memory_usage] After initializing ZeRO optimizer [2023-09-04 03:53:21,687] [INFO] [utils.py:804:see_memory_usage] MA 4.07 GB Max_MA 4.55 GB CA 15.21 GB Max_CA 15 GB [2023-09-04 03:53:21,687] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory: used = 318.47 GB, percent = 63.3% [2023-09-04 03:53:21,687] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedCPUAdam [2023-09-04 03:53:21,688] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler [2023-09-04 03:53:21,688] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7f6771170c10> [2023-09-04 03:53:21,688] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-09-04 03:53:21,689] [INFO] [config.py:963:print] DeepSpeedEngine configuration: [2023-09-04 03:53:21,689] [INFO] [config.py:967:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2023-09-04 03:53:21,689] [INFO] [config.py:967:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2023-09-04 03:53:21,689] [INFO] [config.py:967:print] amp_enabled .................. False [2023-09-04 03:53:21,689] [INFO] [config.py:967:print] amp_params ................... False [2023-09-04 03:53:21,689] [INFO] [config.py:967:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2023-09-04 03:53:21,689] [INFO] [config.py:967:print] bfloat16_enabled ............. True [2023-09-04 03:53:21,689] [INFO] [config.py:967:print] checkpoint_parallel_write_pipeline False [2023-09-04 03:53:21,689] [INFO] [config.py:967:print] checkpoint_tag_validation_enabled True [2023-09-04 03:53:21,689] [INFO] [config.py:967:print] checkpoint_tag_validation_fail False [2023-09-04 03:53:21,689] [INFO] [config.py:967:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f677129e290> [2023-09-04 03:53:21,690] [INFO] [config.py:967:print] communication_data_type ...... None [2023-09-04 03:53:21,690] [INFO] [config.py:967:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2023-09-04 03:53:21,690] [INFO] [config.py:967:print] curriculum_enabled_legacy .... False [2023-09-04 03:53:21,690] [INFO] [config.py:967:print] curriculum_params_legacy ..... False [2023-09-04 03:53:21,690] [INFO] [config.py:967:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2023-09-04 03:53:21,690] [INFO] [config.py:967:print] data_efficiency_enabled ...... False [2023-09-04 03:53:21,690] [INFO] [config.py:967:print] dataloader_drop_last ......... False [2023-09-04 03:53:21,690] [INFO] [config.py:967:print] disable_allgather ............ False [2023-09-04 03:53:21,690] [INFO] [config.py:967:print] dump_state ................... False [2023-09-04 03:53:21,690] [INFO] [config.py:967:print] dynamic_loss_scale_args ...... None [2023-09-04 03:53:21,690] [INFO] [config.py:967:print] eigenvalue_enabled ........... False [2023-09-04 03:53:21,690] [INFO] [config.py:967:print] eigenvalue_gas_boundary_resolution 1 [2023-09-04 03:53:21,690] [INFO] [config.py:967:print] eigenvalue_layer_name ........ bert.encoder.layer [2023-09-04 03:53:21,690] [INFO] [config.py:967:print] eigenvalue_layer_num ......... 0 [2023-09-04 03:53:21,690] [INFO] [config.py:967:print] eigenvalue_max_iter .......... 100 [2023-09-04 03:53:21,690] [INFO] [config.py:967:print] eigenvalue_stability ......... 1e-06 [2023-09-04 03:53:21,690] [INFO] [config.py:967:print] eigenvalue_tol ............... 0.01 [2023-09-04 03:53:21,690] [INFO] [config.py:967:print] eigenvalue_verbose ........... False [2023-09-04 03:53:21,690] [INFO] [config.py:967:print] elasticity_enabled ........... False [2023-09-04 03:53:21,690] [INFO] [config.py:967:print] flops_profiler_config ........ { "enabled": false, "recompute_fwd_factor": 0.0, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2023-09-04 03:53:21,690] [INFO] [config.py:967:print] fp16_auto_cast ............... None [2023-09-04 03:53:21,690] [INFO] [config.py:967:print] fp16_enabled ................. False [2023-09-04 03:53:21,690] [INFO] [config.py:967:print] fp16_master_weights_and_gradients False [2023-09-04 03:53:21,690] [INFO] [config.py:967:print] global_rank .................. 0 [2023-09-04 03:53:21,690] [INFO] [config.py:967:print] grad_accum_dtype ............. None [2023-09-04 03:53:21,690] [INFO] [config.py:967:print] gradient_accumulation_steps .. 1 [2023-09-04 03:53:21,690] [INFO] [config.py:967:print] gradient_clipping ............ 1.0 [2023-09-04 03:53:21,690] [INFO] [config.py:967:print] gradient_predivide_factor .... 1.0 [2023-09-04 03:53:21,690] [INFO] [config.py:967:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2023-09-04 03:53:21,690] [INFO] [config.py:967:print] initial_dynamic_scale ........ 1 [2023-09-04 03:53:21,690] [INFO] [config.py:967:print] load_universal_checkpoint .... False [2023-09-04 03:53:21,690] [INFO] [config.py:967:print] loss_scale ................... 1.0 [2023-09-04 03:53:21,690] [INFO] [config.py:967:print] memory_breakdown ............. False [2023-09-04 03:53:21,690] [INFO] [config.py:967:print] mics_hierarchial_params_gather False [2023-09-04 03:53:21,690] [INFO] [config.py:967:print] mics_shard_size .............. -1 [2023-09-04 03:53:21,690] [INFO] [config.py:967:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='step3_tensorboard/ds_tensorboard_logs/', job_name='step3_critic_tensorboard') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False [2023-09-04 03:53:21,690] [INFO] [config.py:967:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2023-09-04 03:53:21,690] [INFO] [config.py:967:print] optimizer_legacy_fusion ...... False [2023-09-04 03:53:21,691] [INFO] [config.py:967:print] optimizer_name ............... None [2023-09-04 03:53:21,691] [INFO] [config.py:967:print] optimizer_params ............. None [2023-09-04 03:53:21,691] [INFO] [config.py:967:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0} [2023-09-04 03:53:21,691] [INFO] [config.py:967:print] pld_enabled .................. False [2023-09-04 03:53:21,691] [INFO] [config.py:967:print] pld_params ................... False [2023-09-04 03:53:21,691] [INFO] [config.py:967:print] prescale_gradients ........... False [2023-09-04 03:53:21,691] [INFO] [config.py:967:print] scheduler_name ............... None [2023-09-04 03:53:21,691] [INFO] [config.py:967:print] scheduler_params ............. None [2023-09-04 03:53:21,691] [INFO] [config.py:967:print] sparse_attention ............. None [2023-09-04 03:53:21,691] [INFO] [config.py:967:print] sparse_gradients_enabled ..... False [2023-09-04 03:53:21,691] [INFO] [config.py:967:print] steps_per_print .............. 10 [2023-09-04 03:53:21,691] [INFO] [config.py:967:print] train_batch_size ............. 6 [2023-09-04 03:53:21,691] [INFO] [config.py:967:print] train_micro_batch_size_per_gpu 1 [2023-09-04 03:53:21,691] [INFO] [config.py:967:print] use_node_local_storage ....... False [2023-09-04 03:53:21,691] [INFO] [config.py:967:print] wall_clock_breakdown ......... False [2023-09-04 03:53:21,691] [INFO] [config.py:967:print] world_size ................... 6 [2023-09-04 03:53:21,691] [INFO] [config.py:967:print] zero_allow_untested_optimizer False [2023-09-04 03:53:21,691] [INFO] [config.py:967:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=30000000 param_persistence_threshold=10000 model_persistence_threshold=sys.maxsize max_live_parameters=30000000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=False pipeline_loading_checkpoint=False override_module_apply=True [2023-09-04 03:53:21,691] [INFO] [config.py:967:print] zero_enabled ................. True [2023-09-04 03:53:21,691] [INFO] [config.py:967:print] zero_force_ds_cpu_optimizer .. True [2023-09-04 03:53:21,691] [INFO] [config.py:967:print] zero_optimization_stage ...... 3 [2023-09-04 03:53:21,691] [INFO] [config.py:953:print_user_config] json = { "train_batch_size": 6, "train_micro_batch_size_per_gpu": 1, "steps_per_print": 10, "zero_optimization": { "stage": 3, "offload_param": { "device": "cpu" }, "offload_optimizer": { "device": "cpu" }, "stage3_param_persistence_threshold": 1.000000e+04, "stage3_max_live_parameters": 3.000000e+07, "stage3_prefetch_bucket_size": 3.000000e+07, "memory_efficient_linear": false }, "bf16": { "enabled": true, "loss_scale_window": 100 }, "gradient_clipping": 1.0, "prescale_gradients": false, "wall_clock_breakdown": false, "hybrid_engine": { "enabled": false, "max_out_tokens": 512, "inference_tp_size": 1, "release_inference_cache": false, "pin_parameters": true, "tp_gather_partition_size": 8 }, "tensorboard": { "enabled": false, "output_path": "step3_tensorboard/ds_tensorboard_logs/", "job_name": "step3_critic_tensorboard" } } *[end] Initialized Critic Model [end] (duration: 83.27s)** ****[start] Initializing Reward Model [start] **** [2023-09-04 03:53:22,920] [INFO] [partition_parameters.py:342:exit] finished initializing model - num_params = 1162, num_elems = 26.69B Creating model from_config took 1.2404227256774902 seconds torch.load took 8.468313455581665 seconds Loading model state dict took 7.79913854598999 seconds [2023-09-04 03:53:41,075] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.10.2, git-hash=unknown, git-branch=unknown [2023-09-04 03:53:41,091] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False [2023-09-04 03:53:41,093] [INFO] [logging.py:96:log_dist] [Rank 0] Creating ZeRO Offload [2023-09-04 03:53:41,403] [INFO] [utils.py:803:see_memory_usage] DeepSpeedZeRoOffload initialize [begin] [2023-09-04 03:53:41,403] [INFO] [utils.py:804:see_memory_usage] MA 6.19 GB Max_MA 6.74 GB CA 17.56 GB Max_CA 18 GB [2023-09-04 03:53:41,403] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory: used = 318.4 GB, percent = 63.3% Parameter Offload: Total persistent parameters: 270336 in 66 params [2023-09-04 03:53:41,661] [INFO] [utils.py:803:see_memory_usage] DeepSpeedZeRoOffload initialize [end] [2023-09-04 03:53:41,662] [INFO] [utils.py:804:see_memory_usage] MA 6.19 GB Max_MA 6.19 GB CA 17.56 GB Max_CA 18 GB [2023-09-04 03:53:41,662] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory: used = 318.41 GB, percent = 63.3% [2023-09-04 03:53:41,663] [INFO] [config.py:963:print] DeepSpeedEngine configuration: [2023-09-04 03:53:41,663] [INFO] [config.py:967:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2023-09-04 03:53:41,663] [INFO] [config.py:967:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2023-09-04 03:53:41,663] [INFO] [config.py:967:print] amp_enabled .................. False [2023-09-04 03:53:41,663] [INFO] [config.py:967:print] amp_params ................... False [2023-09-04 03:53:41,663] [INFO] [config.py:967:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2023-09-04 03:53:41,664] [INFO] [config.py:967:print] bfloat16_enabled ............. False [2023-09-04 03:53:41,664] [INFO] [config.py:967:print] checkpoint_parallel_write_pipeline False [2023-09-04 03:53:41,664] [INFO] [config.py:967:print] checkpoint_tag_validation_enabled True [2023-09-04 03:53:41,664] [INFO] [config.py:967:print] checkpoint_tag_validation_fail False [2023-09-04 03:53:41,664] [INFO] [config.py:967:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f6770b9b940> [2023-09-04 03:53:41,664] [INFO] [config.py:967:print] communication_data_type ...... None [2023-09-04 03:53:41,664] [INFO] [config.py:967:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2023-09-04 03:53:41,664] [INFO] [config.py:967:print] curriculum_enabled_legacy .... False [2023-09-04 03:53:41,664] [INFO] [config.py:967:print] curriculum_params_legacy ..... False [2023-09-04 03:53:41,664] [INFO] [config.py:967:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2023-09-04 03:53:41,664] [INFO] [config.py:967:print] data_efficiency_enabled ...... False [2023-09-04 03:53:41,664] [INFO] [config.py:967:print] dataloader_drop_last ......... False [2023-09-04 03:53:41,664] [INFO] [config.py:967:print] disable_allgather ............ False [2023-09-04 03:53:41,664] [INFO] [config.py:967:print] dump_state ................... False [2023-09-04 03:53:41,664] [INFO] [config.py:967:print] dynamic_loss_scale_args ...... None [2023-09-04 03:53:41,664] [INFO] [config.py:967:print] eigenvalue_enabled ........... False [2023-09-04 03:53:41,664] [INFO] [config.py:967:print] eigenvalue_gas_boundary_resolution 1 [2023-09-04 03:53:41,664] [INFO] [config.py:967:print] eigenvalue_layer_name ........ bert.encoder.layer [2023-09-04 03:53:41,664] [INFO] [config.py:967:print] eigenvalue_layer_num ......... 0 [2023-09-04 03:53:41,664] [INFO] [config.py:967:print] eigenvalue_max_iter .......... 100 [2023-09-04 03:53:41,664] [INFO] [config.py:967:print] eigenvalue_stability ......... 1e-06 [2023-09-04 03:53:41,664] [INFO] [config.py:967:print] eigenvalue_tol ............... 0.01 [2023-09-04 03:53:41,664] [INFO] [config.py:967:print] eigenvalue_verbose ........... False [2023-09-04 03:53:41,664] [INFO] [config.py:967:print] elasticity_enabled ........... False [2023-09-04 03:53:41,664] [INFO] [config.py:967:print] flops_profiler_config ........ { "enabled": false, "recompute_fwd_factor": 0.0, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2023-09-04 03:53:41,664] [INFO] [config.py:967:print] fp16_auto_cast ............... False [2023-09-04 03:53:41,664] [INFO] [config.py:967:print] fp16_enabled ................. True [2023-09-04 03:53:41,664] [INFO] [config.py:967:print] fp16_master_weights_and_gradients False [2023-09-04 03:53:41,664] [INFO] [config.py:967:print] global_rank .................. 0 [2023-09-04 03:53:41,664] [INFO] [config.py:967:print] grad_accum_dtype ............. None [2023-09-04 03:53:41,664] [INFO] [config.py:967:print] gradient_accumulation_steps .. 1 [2023-09-04 03:53:41,664] [INFO] [config.py:967:print] gradient_clipping ............ 1.0 [2023-09-04 03:53:41,664] [INFO] [config.py:967:print] gradient_predivide_factor .... 1.0 [2023-09-04 03:53:41,664] [INFO] [config.py:967:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2023-09-04 03:53:41,664] [INFO] [config.py:967:print] initial_dynamic_scale ........ 65536 [2023-09-04 03:53:41,664] [INFO] [config.py:967:print] load_universal_checkpoint .... False [2023-09-04 03:53:41,664] [INFO] [config.py:967:print] loss_scale ................... 0 [2023-09-04 03:53:41,664] [INFO] [config.py:967:print] memory_breakdown ............. False [2023-09-04 03:53:41,664] [INFO] [config.py:967:print] mics_hierarchial_params_gather False [2023-09-04 03:53:41,665] [INFO] [config.py:967:print] mics_shard_size .............. -1 [2023-09-04 03:53:41,665] [INFO] [config.py:967:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False [2023-09-04 03:53:41,665] [INFO] [config.py:967:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2023-09-04 03:53:41,665] [INFO] [config.py:967:print] optimizer_legacy_fusion ...... False [2023-09-04 03:53:41,665] [INFO] [config.py:967:print] optimizer_name ............... None [2023-09-04 03:53:41,665] [INFO] [config.py:967:print] optimizer_params ............. None [2023-09-04 03:53:41,665] [INFO] [config.py:967:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0} [2023-09-04 03:53:41,665] [INFO] [config.py:967:print] pld_enabled .................. False [2023-09-04 03:53:41,665] [INFO] [config.py:967:print] pld_params ................... False [2023-09-04 03:53:41,665] [INFO] [config.py:967:print] prescale_gradients ........... False [2023-09-04 03:53:41,665] [INFO] [config.py:967:print] scheduler_name ............... None [2023-09-04 03:53:41,665] [INFO] [config.py:967:print] scheduler_params ............. None [2023-09-04 03:53:41,665] [INFO] [config.py:967:print] sparse_attention ............. None [2023-09-04 03:53:41,665] [INFO] [config.py:967:print] sparse_gradients_enabled ..... False [2023-09-04 03:53:41,665] [INFO] [config.py:967:print] steps_per_print .............. 10 [2023-09-04 03:53:41,665] [INFO] [config.py:967:print] train_batch_size ............. 6 [2023-09-04 03:53:41,665] [INFO] [config.py:967:print] train_micro_batch_size_per_gpu 1 [2023-09-04 03:53:41,665] [INFO] [config.py:967:print] use_node_local_storage ....... False [2023-09-04 03:53:41,665] [INFO] [config.py:967:print] wall_clock_breakdown ......... False [2023-09-04 03:53:41,665] [INFO] [config.py:967:print] world_size ................... 6 [2023-09-04 03:53:41,665] [INFO] [config.py:967:print] zero_allow_untested_optimizer False [2023-09-04 03:53:41,665] [INFO] [config.py:967:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=10000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=False pipeline_loading_checkpoint=False override_module_apply=True [2023-09-04 03:53:41,665] [INFO] [config.py:967:print] zero_enabled ................. True [2023-09-04 03:53:41,665] [INFO] [config.py:967:print] zero_force_ds_cpu_optimizer .. True [2023-09-04 03:53:41,665] [INFO] [config.py:967:print] zero_optimization_stage ...... 3 [2023-09-04 03:53:41,665] [INFO] [config.py:953:print_user_config] json = { "train_batch_size": 6, "train_micro_batch_size_per_gpu": 1, "steps_per_print": 10, "zero_optimization": { "stage": 3, "stage3_param_persistence_threshold": 1.000000e+04, "offload_param": { "device": "cpu" }, "memory_efficient_linear": false }, "fp16": { "enabled": true }, "gradient_clipping": 1.0, "prescale_gradients": false, "wall_clock_breakdown": false } *[end] Initialized Reward Model [end] (duration: 19.97s)** Running training Beginning of Epoch 1/1, Total Generation Batches 5084 /xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None") /xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None") /xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None") /xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None") /xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None") /xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None") Traceback (most recent call last): File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 632, in Traceback (most recent call last): File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 632, in Traceback (most recent call last): File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 632, in Traceback (most recent call last): File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 632, in main() File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 492, in main main()main()

  File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 492, in main

File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 492, in main main()Traceback (most recent call last):

File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 492, in main File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 632, in Traceback (most recent call last): File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 632, in out = trainer.generate_experience(batch_prompt['prompt'], File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 135, in generate_experience out = trainer.generate_experience(batch_prompt['prompt'],out = trainer.generate_experience(batch_prompt['prompt'],

  File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 135, in generate_experience

File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 135, in generate_experience out = trainer.generate_experience(batch_prompt['prompt'], File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 135, in generate_experience main() File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 492, in main main() File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 492, in main out = trainer.generate_experience(batch_prompt['prompt'], File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 135, in generate_experience out = trainer.generate_experience(batch_prompt['prompt'], File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 135, in generate_experience values = self.critic_model.forward_value( values = self.critic_model.forward_value( values = self.critic_model.forward_value(values = self.critic_model.forward_value( values = self.critic_model.forward_value( values = self.critic_model.forward_value(

File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/utils/model/reward_model.py", line 135, in forward_value File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/utils/model/reward_model.py", line 135, in forward_value

File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/utils/model/reward_model.py", line 135, in forward_value File "/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/utils/model/reward_model.py", line 135, in forward_value transformer_outputs = self.rwtranrsformer( transformer_outputs = self.rwtranrsformer( transformer_outputs = self.rwtranrsformer(transformer_outputs = self.rwtranrsformer( File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl transformer_outputs = self.rwtranrsformer(transformer_outputs = self.rwtranrsformer(

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, *kwargs)result = forward_call(input, **kwargs)

  File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 685, in forward

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 685, in forward result = forward_call(*input, kwargs) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 685, in forward result = forward_call(*input, *kwargs)result = forward_call(input, kwargs)

result = forward_call(*input, **kwargs) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 685, in forward File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 685, in forward

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 685, in forward layer_outputs = torch.utils.checkpoint.checkpoint( File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint layer_outputs = torch.utils.checkpoint.checkpoint(layer_outputs = torch.utils.checkpoint.checkpoint(layer_outputs = torch.utils.checkpoint.checkpoint(layer_outputs = torch.utils.checkpoint.checkpoint(

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint layer_outputs = torch.utils.checkpoint.checkpoint( File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint return CheckpointFunction.apply(function, preserve, args) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward return CheckpointFunction.apply(function, preserve, args)
return CheckpointFunction.apply(function, preserve, args) outputs = run_function(args) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward

return CheckpointFunction.apply(function, preserve, *args)

return CheckpointFunction.apply(function, preserve, args) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward return CheckpointFunction.apply(function, preserve, args) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 681, in custom_forward

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward outputs = run_function(args)
outputs = run_function(args) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 681, in custom_forward

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 681, in custom_forward outputs = run_function(args)
outputs = run_function(args) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 681, in custom_forward

  File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 681, in custom_forward

outputs = run_function(args) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 681, in custom_forward return module(inputs, output_attentions, None) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl return module(inputs, output_attentions, None)return module(inputs, output_attentions, None)

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl return module(inputs, output_attentions, None) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl return module(inputs, output_attentions, None) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl return module(inputs, output_attentions, None) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(input, kwargs) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 408, in forward result = forward_call(*input, *kwargs)result = forward_call(input, kwargs)result = forward_call(*input, **kwargs)

result = forward_call(*input, **kwargs) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 408, in forward File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 408, in forward File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 408, in forward

result = forward_call(*input, **kwargs)hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 408, in forward

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 408, in forward File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl hidden_states, self_attn_weights, present_key_value = self.self_attn(hidden_states, self_attn_weights, present_key_value = self.self_attn(hidden_states, self_attn_weights, present_key_value = self.self_attn(

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, kwargs) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 305, in forward query_states = self.q_proj(hidden_states) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl result = forward_call(*input, *kwargs)result = forward_call(input, kwargs)result = forward_call(*input, *kwargs)result = forward_call(input, **kwargs)

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 305, in forward result = forward_call(*input, **kwargs) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 305, in forward File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 305, in forward File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 305, in forward

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 305, in forward query_states = self.q_proj(hidden_states) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl query_states = self.q_proj(hidden_states) query_states = self.q_proj(hidden_states) query_states = self.q_proj(hidden_states) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl query_states = self.q_proj(hidden_states) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl

  File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl result = hook(self, input) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook result = hook(self, input)self.pre_sub_module_forward_function(module)

  File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function result = hook(self, input)
result = hook(self, input) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn

result = hook(self, input) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn

ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn

ret_val = func(*args, *kwargs)ret_val = func(args, **kwargs) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook

ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state)ret_val = func(*args, **kwargs)

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context self.pre_sub_module_forward_function(module)self.pre_sub_module_forward_function(module)self.pre_sub_module_forward_function(module)self.pre_sub_module_forward_function(module)

  File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function return func(*args, **kwargs)self.pre_sub_module_forward_function(module)

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 284, in fetch_sub_module File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state) param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state)param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state)param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state) self.__all_gather_params(params_to_fetch, forward)

  File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, kwargs) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context ret_val = func(*args, *kwargs) ret_val = func(args, kwargs) ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context ret_val = func(*args, *kwargs) ret_val = func(args, **kwargs) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context

return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 428, in __all_gather_params

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, kwargs) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 284, in fetch_sub_module return func(*args, *kwargs) return func(args, kwargs) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 284, in fetch_sub_module

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 284, in fetch_sub_module File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 284, in fetch_sub_module return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 284, in fetch_sub_module self.all_gather_params(params_to_fetch, forward) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn self.all_gatherparams(nonquantized_params, forward, quantize=self.zero_quantized_weights)
self.__all_gather_params(params_to_fetch, forward) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 446, in __all_gatherparams

self.all_gather_params(params_to_fetch, forward)self.all_gather_params(params_to_fetch, forward)ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn self.__all_gather_params(params_to_fetch, forward) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 428, in __all_gather_params

  File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn

ret_val = func(*args, kwargs) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 428, in __all_gather_params ret_val = func(*args, *kwargs)ret_val = func(args, kwargs)

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 428, in __all_gather_params File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 428, in __all_gather_params handle = partitioned_params[0].all_gather_coalesced(partitioned_params,ret_val = func(*args, **kwargs)

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 428, in all_gather_params self.all_gatherparams(nonquantized_params, forward, quantize=self.zero_quantized_weights)
ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 446, in __all_gatherparams

  File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1132, in all_gather_coalesced

self.all_gatherparams(nonquantized_params, forward, quantize=self.zero_quantized_weights)
self.all_gatherparams(nonquantized_params, forward, quantize=self.zero_quantized_weights)self.__all_gatherparams(nonquantized_params, forward, quantize=self.zero_quantized_weights) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 446, in __all_gatherparams

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 446, in all_gatherparams File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 446, in all_gatherparams self.__all_gatherparams(nonquantized_params, forward, quantize=self.zero_quantized_weights) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 446, in __all_gatherparams handle = partitioned_params[0].all_gather_coalesced(partitioned_params, File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn handle = partitioned_params[0].all_gather_coalesced(partitioned_params, ret_val = func(*args, **kwargs) handle = partitioned_params[0].all_gather_coalesced(partitioned_params,handle = partitioned_params[0].all_gather_coalesced(partitioned_params, File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn

  File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1132, in all_gather_coalesced

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn handle = partitioned_params[0].all_gather_coalesced(partitioned_params,
ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn

  File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1132, in all_gather_coalesced
ret_val = func(*args, **kwargs)ret_val = func(*args, **kwargs)

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1132, in all_gather_coalesced File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1132, in all_gather_coalesced ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1132, in all_gather_coalesced dtype=get_only_unique_item(p.ds_tensor.dtype dtype=get_only_unique_item(p.ds_tensor.dtype
dtype=get_only_unique_item(p.ds_tensor.dtypedtype=get_only_unique_item(p.ds_tensor.dtypedtype=get_only_unique_item(p.ds_tensor.dtype dtype=get_only_unique_item(p.ds_tensor.dtype File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/utils.py", line 842, in get_only_unique_item

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/utils.py", line 842, in get_only_unique_item

File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/utils.py", line 842, in get_only_unique_item File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/utils.py", line 842, in get_only_unique_item File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/utils.py", line 842, in get_only_unique_item File "/xxxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/utils.py", line 842, in get_only_unique_item raise RuntimeError(f"expected there to be only one unique element in {items}") raise RuntimeError(f"expected there to be only one unique element in {items}")raise RuntimeError(f"expected there to be only one unique element in {items}")raise RuntimeError(f"expected there to be only one unique element in {items}") raise RuntimeError(f"expected there to be only one unique element in {items}")

RuntimeErrorRuntimeErrorraise RuntimeError(f"expected there to be only one unique element in {items}")RuntimeErrorRuntimeError: : RuntimeError : : expected there to be only one unique element in <generator object Init._convert_to_deepspeed_param..all_gather_coalesced.. at 0x7fd38025bd80>expected there to be only one unique element in <generator object Init._convert_to_deepspeed_param..all_gather_coalesced.. at 0x7f25da167d80>: expected there to be only one unique element in <generator object Init._convert_to_deepspeed_param..all_gather_coalesced.. at 0x7f671c161310>expected there to be only one unique element in <generator object Init._convert_to_deepspeed_param..all_gather_coalesced.. at 0x7fdd842e7d80>RuntimeError

expected there to be only one unique element in <generator object Init._convert_to_deepspeed_param..all_gather_coalesced.. at 0x7fbc541fbd80>

: expected there to be only one unique element in <generator object Init._convert_to_deepspeed_param..all_gather_coalesced.. at 0x7f697870bd80> [2023-09-04 03:59:24,656] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 53804 [2023-09-04 03:59:24,670] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 53805 [2023-09-04 03:59:25,169] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 53806 [2023-09-04 03:59:25,176] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 53807 [2023-09-04 03:59:25,176] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 53808 [2023-09-04 03:59:25,182] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 53809 [2023-09-04 03:59:25,186] [ERROR] [launch.py:321:sigkill_handler] ['/xxxxxx/env/mplug_owl/bin/python', '-u', 'main.py', '--local_rank=5', '--data_path', 'Dahoas/rm-static', '--data_split', '2,4,4', '--actor_model_name_or_path', '/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/model_output', '--critic_model_name_or_path', '/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step2_reward_model_finetuning/model_output', '--num_padding_at_beginning', '1', '--per_device_generation_batch_size', '1', '--per_device_training_batch_size', '1', '--generation_batches', '1', '--ppo_epochs', '1', '--max_answer_seq_len', '256', '--max_prompt_seq_len', '256', '--actor_learning_rate', '9.65e-6', '--critic_learning_rate', '5e-6', '--actor_weight_decay', '0.1', '--critic_weight_decay', '0.1', '--num_train_epochs', '1', '--lr_scheduler_type', 'cosine', '--gradient_accumulation_steps', '1', '--actor_gradient_checkpointing', '--critic_gradient_checkpointing', '--offload_reference_model', '--disable_actor_dropout', '--num_warmup_steps', '100', '--deepspeed', '--seed', '1234', '--actor_zero_stage', '3', '--critic_zero_stage', '3', '--actor_lora_dim', '64', '--critic_lora_dim', '64', '--critic_lora_module_name', 'layers.', '--actor_lora_module_name', 'layers.', '--output_dir', '/xxxxxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/model_output', '--offload', '--offload_reference_model'] exits with return code = 1

this is my pip list: absl-py 1.4.0 accelerate 0.20.3 aiodns 3.0.0 aiofiles 23.1.0 aiohttp 3.8.4 aiosignal 1.3.1 altair 4.2.2 anyio 3.6.2 apex 0.1 appdirs 1.4.4 APScheduler 3.10.1 asttokens 2.2.1 async-timeout 4.0.2 attrs 23.1.0 backcall 0.2.0 blinker 1.6.2 brotlipy 0.7.0 cachetools 4.2.4 cchardet 2.1.7 certifi 2023.5.7 cffi 1.15.1 charset-normalizer 2.0.4 cityhash 0.2.4.post11 click 8.1.3 colorama 0.4.6 contourpy 1.0.7 cpm-kernels 1.0.11 cryptography 39.0.1 cycler 0.11.0 datasets 2.12.0 dbpool 1.2.1 decorator 5.1.1 deepspeed 0.10.2 dill 0.3.6 docker-pycreds 0.4.0 dotmap 1.3.30 dsc-auth 0.1.16 einops 0.6.1 entrypoints 0.4 executing 1.2.0 fastapi 0.95.1 ffmpy 0.3.0 filelock 3.12.0 fire 0.5.0 Flask 2.3.2 fonttools 4.39.3 frozenlist 1.3.3 fsspec 2023.5.0 gitdb 4.0.10 GitPython 3.1.31 google-api-core 2.11.0 google-auth 2.17.3 google-auth-oauthlib 0.4.6 google-cloud-speech 2.19.0 googleapis-common-protos 1.59.0 gradio 3.28.3 gradio_client 0.2.0 grpcio 1.54.2 grpcio-reflection 1.48.2 grpcio-status 1.54.2 h11 0.14.0 h5py 3.8.0 hiredis 2.2.3 hjson 3.1.0 httpcore 0.17.0 httpx 0.24.0 huggingface-hub 0.14.1 icecream 2.1.3 icetk 0.0.7 idna 3.4 infra-component 1.3.7 infra-framework 1.16.9 infra-kconf 1.0.6 infra-kess 1.0.6 infra-keycenter 1.0.1 infra-storage 1.2.3 ipython 8.13.2 itsdangerous 2.1.2 jedi 0.18.2 Jinja2 3.1.2 joblib 1.2.0 jsonschema 4.17.3 kazoo 2.9.0 kiwisolver 1.4.4 linkify-it-py 2.0.2 lxml 4.9.2 lz4 3.1.10 Markdown 3.4.3 markdown-it-py 2.2.0 markdown2 2.4.8 MarkupSafe 2.1.2 matplotlib 3.7.1 matplotlib-inline 0.1.6 mdit-py-plugins 0.3.3 mdurl 0.1.2 mkl-fft 1.3.1 mkl-random 1.2.2 mkl-service 2.4.0 multidict 6.0.4 multiprocess 0.70.14 munch 2.5.0 mysql-connector-python 8.0.31 ninja 1.11.1 numpy 1.24.3 oauthlib 3.2.2 openai 0.27.8 opencv-python 4.7.0.72 orjson 3.8.12 packaging 23.1 pandas 2.0.1 parso 0.8.3 pathtools 0.1.2 peft 0.3.0 pexpect 4.8.0 pickleshare 0.7.5 Pillow 9.4.0 pip 23.0.1 pkgconfig 1.5.5 prettytable 2.5.0 prompt-toolkit 3.0.38 proto-plus 1.22.2 protobuf 3.19.6 psutil 5.9.5 ptyprocess 0.7.0 pure-eval 0.2.2 py-cpuinfo 9.0.0 pyarrow 12.0.0 pyasn1 0.5.0 pyasn1-modules 0.3.0 pycares 4.3.0 pycparser 2.21 pycryptodome 3.18.0 pydantic 1.10.7 pydub 0.25.1 Pygments 2.15.1 pygtrans 1.5.2 pyOpenSSL 23.0.0 pyparsing 3.0.9 pyrsistent 0.19.3 pysmhasher 0.2.5 PySocks 1.7.1 python-dateutil 2.8.2 python-multipart 0.0.6 python-snappy 0.6.1 pytz 2021.3 pytz-deprecation-shim 0.1.0.post0 PyYAML 6.0 redis 4.5.5 regex 2023.5.5 requests 2.29.0 requests-oauthlib 1.3.1 responses 0.18.0 rsa 4.9 ruamel.yaml 0.17.24 ruamel.yaml.clib 0.2.7 safetensors 0.3.1 sconf 0.2.5 semantic-version 2.10.0 sentencepiece 0.1.99 sentry-sdk 1.24.0 setproctitle 1.3.2 setuptools 66.0.0 setuptools-scm 7.1.0 six 1.16.0 smmap 5.0.0 sniffio 1.3.0 sqlparse 0.4.4 stack-data 0.6.2 starlette 0.26.1 SwissArmyTransformer 0.3.6 tabulate 0.9.0 tensorboard 2.7.0 tensorboard-data-server 0.6.1 tensorboard-plugin-wit 1.8.1 tensorboardX 2.6 termcolor 2.3.0 timm 0.9.2 tokenizers 0.13.3 tomli 2.0.1 toolz 0.12.0 torch 1.13.1 torchaudio 0.13.1 torchvision 0.14.1 tornado 6.3.2 tqdm 4.65.0 traitlets 5.9.0 transformers 4.31.0 typing_extensions 4.5.0 tzdata 2023.3 tzlocal 4.3 uc-micro-py 1.0.2 unpaddedbase64 2.1.0 urllib3 1.26.15 uvicorn 0.22.0 wandb 0.15.3 wcwidth 0.2.6 websockets 11.0.3 Werkzeug 2.3.3 wheel 0.38.4 xmltodict 0.12.0 xxhash 3.2.0 yarl 1.9.2

this is my ds_report: [2023-09-04 06:02:21,219] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) DeepSpeed C++/CUDA extension op report NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible async_io ............... [NO] ....... [OKAY] fused_adam ............. [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] cpu_adagrad ............ [NO] ....... [OKAY] fused_lamb ............. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] please install triton==1.0.0 if you want to use sparse attention sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY]

DeepSpeed general environment info: torch install path ............... ['/xxxxx/env/mplug_owl/lib/python3.10/site-packages/torch'] torch version .................... 1.13.1 deepspeed install path ........... ['/xxxxx/env/mplug_owl/lib/python3.10/site-packages/deepspeed'] deepspeed info ................... 0.10.2, unknown, unknown torch cuda version ............... 11.7 torch hip version ................ None nvcc version ..................... 10.1 deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7 shared memory (/dev/shm) size .... 251.53 GB

zhouchuanCN commented 1 year ago

I have discovered that the issue started at transformers-4.32.0 and transformers-4.31.0 works fine.

I tried the transformers-4.31.0 and it solved my problem. Great thanks to @denizyuret

awan-10 commented 1 year ago

Hi All, thank you for trying out various things. We are aware of this new problem with ZeRO Stage 3, HF transformers > 4.31.0. Closing this issue as the downgrade resolves the problem. Please reopen with a new error. Also, strongly encourage you to open a new issue if someone is still seeing errors despite downgrading to transformers 4.31.0.

andrew-zm-ml commented 1 year ago

I am seeing this issue even with transformers==4.31.0.

moshe-mishan commented 1 year ago

Also seeing this issue when using either deepspeed 0.10.1 or 0.10.3 with transformers==4.31.0 I run my model using pytorch-lightning==2.0.9

torch==2.0.1+cu118 cuda==11.8

Wang-Xiaodong1899 commented 1 year ago

Same. I also meet the issue even with transformers==4.31.0. Using deepspeed==0.10.3, torch==2.0.1. Any advice?

alex-athanassakos commented 11 months ago

Also seeing this with transformers==4.31.0

microsoft / DeepSpeed

[BUG]deepspeed-chat training error on v100 * 8, raise assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary() after training of step3 #4194