Describe the bug when i run train，rlhf step 3；

Actor_Lr=9.65e-6
Critic_Lr=5e-6

#--data_path Dahoas/rm-static \
#--offload_reference_model \
deepspeed --master_port 12346 main_step3.py \
   --data_path ${data_path}/beyond/rlhf-reward-single-round-trans_chinese_step3 \
   --data_split 2,4,4 \
   --actor_model_name_or_path $ACTOR_MODEL_PATH \
   --critic_model_name_or_path $CRITIC_MODEL_PATH \
   --data_output_path ${data_path}/train_data_file_step3  \
   --num_padding_at_beginning 1 \ 
   --per_device_generation_batch_size 1 \ 
   --per_device_training_batch_size 1 \ 
   --generation_batches 1 \ 
   --ppo_epochs 1 \ 
   --max_answer_seq_len 256 \
   --max_prompt_seq_len 256 \
   --actor_learning_rate ${Actor_Lr} \
   --critic_learning_rate ${Critic_Lr} \
   --actor_weight_decay 0.1 \
   --critic_weight_decay 0.1 \
   --num_train_epochs 1 \ 
   --lr_scheduler_type cosine \
   --gradient_accumulation_steps 1 \ 
   --actor_gradient_checkpointing \
   --critic_gradient_checkpointing \
   --actor_dropout 0.0 \
   --num_warmup_steps 100 \
   --deepspeed --seed 1234 \
   --enable_hybrid_engine \
   --actor_zero_stage $ACTOR_ZERO_STAGE \
   --critic_zero_stage $CRITIC_ZERO_STAGE \
   --enable_ema \
   --output_dir $output_path \

Log output i got error：

[rank3]: ValueError: not enough values to unpack (expected 2, got 0)
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/deepspeed/DeepSpeedExamples/applications/DeepSpeed-Chat/main_step3.py", line 673, in <module>
[rank1]:     main()
[rank1]:   File "/home/deepspeed/DeepSpeedExamples/applications/DeepSpeed-Chat/main_step3.py", line 527, in main
[rank1]:     out = trainer.generate_experience(batch_prompt['prompt'],
[rank1]:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/deepspeed/DeepSpeedExamples/applications/DeepSpeed-Chat/dschat/rlhf/ppo_trainer.py", line 140, in generate_experience
[rank1]:     seq = self._generate_sequence(prompts, mask, step)
[rank1]:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/deepspeed/DeepSpeedExamples/applications/DeepSpeed-Chat/dschat/rlhf/ppo_trainer.py", line 87, in _generate_sequence
[rank1]:     seq = self.actor_model.module.generate(
[rank1]:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/tools/anaconda3/envs/deepspeed/lib/python3.12/site-packages/deepspeed/runtime/hybrid_engine.py", line 253, in generate
[rank1]:     generate_ret_vals = self._generate(*inputs, **kwargs)
[rank1]:                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/tools/anaconda3/envs/deepspeed/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank1]:     return func(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/tools/anaconda3/envs/deepspeed/lib/python3.12/site-packages/transformers/generation/utils.py", line 2024, in generate
[rank1]:     result = self._sample(
[rank1]:              ^^^^^^^^^^^^^
[rank1]:   File "/home/tools/anaconda3/envs/deepspeed/lib/python3.12/site-packages/transformers/generation/utils.py", line 2982, in _sample
[rank1]:     outputs = self(**model_inputs, return_dict=True)[rank1]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/tools/anaconda3/envs/deepspeed/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/tools/anaconda3/envs/deepspeed/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1609, in _call_impl
[rank1]:     result = forward_call(*args, **kwargs)
[rank1]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/tools/anaconda3/envs/deepspeed/lib/python3.12/site-packages/transformers/models/bloom/modeling_bloom.py", line 955, in forward
[rank1]:     transformer_outputs = self.transformer(
[rank1]:                           ^^^^^^^^^^^^^^^^^[rank1]:   File "/home/tools/anaconda3/envs/deepspeed/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^[rank1]:   File "/home/tools/anaconda3/envs/deepspeed/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1609, in _call_impl
[rank1]:     result = forward_call(*args, **kwargs)
[rank1]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/tools/anaconda3/envs/deepspeed/lib/python3.12/site-packages/transformers/models/bloom/modeling_bloom.py", line 744, in forward
[rank1]:     outputs = block(
[rank1]:               ^^^^^^
[rank1]:   File "/home/tools/anaconda3/envs/deepspeed/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl[rank1]:     return self._call_impl(*args, **kwargs)[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/tools/anaconda3/envs/deepspeed/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1609, in _call_impl[rank1]:     result = forward_call(*args, **kwargs)[rank1]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/tools/anaconda3/envs/deepspeed/lib/python3.12/site-packages/deepspeed/model_implementations/transformers/ds_transformer.py", line 171, in forward
[rank1]:     self.attention(input,
[rank1]:   File "/home/tools/anaconda3/envs/deepspeed/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/tools/anaconda3/envs/deepspeed/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^[rank1]:   File "/home/tools/anaconda3/envs/deepspeed/lib/python3.12/site-packages/deepspeed/ops/transformer/inference/ds_attention.py", line 160, in forward
[rank1]:     context_layer, key_layer, value_layer = self.compute_attention(qkv_out=qkv_out,
[rank1]:                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^[rank1]:   File "/home/tools/anaconda3/envs/deepspeed/lib/python3.12/site-packages/deepspeed/ops/transformer/inference/ds_attention.py", line 239, in compute_attention           
[rank1]:     past_key, past_value = layer_past
[rank1]:     ^^^^^^^^^^^^^^^^^^^^
[rank1]: ValueError: not enough values to unpack (expected 2, got 0)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

To Reproduce Steps to reproduce the behavior:

Command/Script to reproduce
What packages are required and their versions
How to run the script
...

Expected behavior A clear and concise description of what you expected to happen.

ds_report output

ds_report

[2024-09-11 19:27:52,618] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fp_quantizer ........... [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
gds .................... [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4
 [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/tools/anaconda3/envs/deepspeed/lib/python3.12/site-packages/torch']
torch version .................... 2.4.0+cu121
deepspeed install path ........... ['/home/tools/anaconda3/envs/deepspeed/lib/python3.12/site-packages/deepspeed']
deepspeed info ................... 0.15.1, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.1
deepspeed wheel compiled w. ...... torch 2.4, cuda 12.1
shared memory (/dev/shm) size .... 503.77 GB

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

 - OS: Ubuntu 20.04.6 LTS
 - GPU ：NVIDIA L20*4 46G
 - (if applicable) what [DeepSpeed-MII](https://github.com/microsoft/deepspeed-mii) 0.15.1
 - (if applicable) Hugging Face Transformers/Accelerate/etc. versions 4.44.2
 - Python 3.12.0
 - transformers 4.44.2
 - cuda 12.1
 - torch 2.4.0
 - deepspeed 0.15.1
 - accelerate 0.33.0
 - Any other relevant info about your setup

Docker context Are you using a specific docker image that you can share?

Additional context

home/deepspeed/DeepSpeedExamples/applications/DeepSpeed-Chat/dschat/utils/model/model_utils.py:155: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  model_ckpt_state_dict = torch.load(model_ckpt_path, map_location='cpu')

### Tasks

microsoft / DeepSpeed

[BUG] error ：past_key, past_value = layer_past，how to solve this ? #6522

ds_report