[BUG]deepspeed-chat training error on v100 * 8, raise assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary() after training of step3

Describe the bug

Hi, everybody, I'm traning a llama model in step3 using deepspeed-chat. In version 0.10.1, it raised the following error(see in logs bleow). so I switch branch to HeyangQin/fix_issue_3156(https://github.com/microsoft/DeepSpeed/issues/3156) and copy code into master to fix it. after that I find a new bug when training RL.

The full training command:

deepspeed --include localhost:7,6,5,4 /xxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py --data_output_path /xxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/rlhf_13b_data_output --actor_model_name_or_path /xxxxx/new_model_20230808/pytorch_model.bin --tokenizer_type LLaMATokenizer --llm_pretrained /xxxxx/new_model_20230808/pretrain --tokenizer_name_or_path /xxxxx/new_model_20230808/tokenizer --critic_model_name_or_path /xxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step2_reward_model_finetuning/output_zh_13b_mplug_multi_modal_0815/pytorch_model.bin --data_path coco_zh/coco_zh_rm --actor_zero_stage 3 --critic_zero_stage 3 --num_padding_at_beginning 0 --per_device_train_batch_size 1 --per_device_mini_train_batch_size 1 --ppo_epochs 5 --actor_learning_rate 9.65e-6 --critic_learning_rate 5e-6 --gradient_accumulation_steps 1 --deepspeed --actor_lora_dim 1 --actor_lora_module_name q_proj.lora_A.default,q_proj.lora_B.default,v_proj.lora_A.default,v_proj.lora_B.default,k_proj --critic_lora_dim 1 --critic_lora_module_name q_proj.lora_A.default,q_proj.lora_B.default,v_proj.lora_A.default,v_proj.lora_B.default,k_proj --offload_reference_model --actor_learning_rate 5e-4 --critic_learning_rate 5e-6 --max_answer_seq_len 512 --max_prompt_seq_len 200 --actor_weight_decay 0.1 --critic_weight_decay 0.1 --actor_gradient_checkpointing --only_optimize_lora --output_dir /xxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/rlhf_mplug_13b_model_output_20230815 --offload --print_answers

The training log:

File "/xxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/modeling_mplug_owl.py", line 1677, in forward text_embeds = self.get_input_embeddings()(texttokens) # Temporally Embedding File "/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl result = hook(self, input) File "/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, kwargs) File "/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 383, in _pre_forward_module_hook self.pre_sub_module_forward_function(module) File "/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, *kwargs) File "/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 495, in pre_sub_module_forward_function param_coordinator.fetch_sub_module(sub_module, forward=True) File "/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(args, kwargs) File "/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 328, in fetch_sub_module assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary() AssertionError: {'id': 1489, 'status': 'INFLIGHT', 'numel': 412180480, 'ds_numel': 412180480, 'shape': (80504, 5120), 'ds_shape': (80504, 5120), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': {543}, 'ds_tensor.shape': torch.Size([103045120])}

Expected behavior A clear and concise description of what you expected to happen.

ds_report output [2023-08-23 03:17:51,889] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) DeepSpeed C++/CUDA extension op report NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op. JIT compiled ops requires ninja ninja .................. [OKAY] op name ................ installed .. compatible [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] fused_adam ............. [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] cpu_adagrad ............ [NO] ....... [OKAY] fused_lamb ............. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] please install triton==1.0.0 if you want to use sparse attention sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY] utils .................. [NO] ....... [OKAY] DeepSpeed general environment info: torch install path ............... ['/vq_ssd/taowei03/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch'] torch version .................... 1.13.1 deepspeed install path ........... ['/vq_ssd/taowei03/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed-0.10.2+8fb111c0-py3.10.egg/deepspeed'] deepspeed info ................... 0.10.2+8fb111c0, 8fb111c0, master torch cuda version ............... 11.7 torch hip version ................ None nvcc version ..................... 10.1 deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7 shared memory (/dev/shm) size .... 251.53 GB

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information): OS: ubuntu20.04 8*v100 32G pytorch:1.7.1 deepspeed: 0.10.1

microsoft / DeepSpeedExamples

[BUG]deepspeed-chat training error on v100 * 8, raise assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary() after training of step3 #704