[BUG]`assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary()` when training deepspeed-chat step3 with ZeRO3 and a larger `generation_batches`

GoSz commented 11 months ago

Describe the bug When training Deepspeed-Chat Step3 with ZeRO3(without hybrid-engine), if we set generation_batches >= 3 or generation_batches >= 2 and ppo_epochs >= 2, deepspeed will raise assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary() during generate_experience in the second step.

Log output

***** Running training *****
Beginning of Epoch 1/1, Total Generation Batches 954
[2023-10-18 06:02:18,170] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1
[2023-10-18 06:02:18,977] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1
Invalidate trace cache @ step 271: expected module 2, but got module 271
[2023-10-18 06:02:19,948] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768
Invalidate trace cache @ step 271: expected module 815, but got module 814
[2023-10-18 06:02:21,398] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768
Epoch: 0 | Step: 2 | PPO Epoch: 1 | Actor Loss: 0.032704671223958336 | Critic Loss: 0.0035425821940104165 | Unsupervised Loss: 0.0
End-to-End => Latency: 100.65s, TFLOPs: 0.79, Samples/sec: 0.95, Time/seq 1.05s, Batch Size: 96, Total Seq. Length: 512
Generation => Latency: 30.45s, Per-token Latency 118.94 ms, TFLOPs: 0.18, BW: 22.12 GB/sec, Answer Seq. Length: 256
Training   => Latency: 9.30s, TFLOPs: 6.74
Actor Model Parameters => 1.316 B, Critic Model Parameters => 0.331 B
Average reward score: -0.4375
-------------------------------------------------------------------------------------
Traceback (most recent call last):
  File "/root/test/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 660, in <module>
    main()
  File "/root/test/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 520, in main
    out = trainer.generate_experience(batch_prompt['prompt'],
  File "/root/test/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 125, in generate_experience
    seq = self._generate_sequence(prompts, mask, step)
  File "/root/test/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 87, in _generate_sequence
    seq = self.actor_model.module.generate(
  File "/root/miniconda3/envs/test/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/miniconda3/envs/test/lib/python3.9/site-packages/transformers/generation/utils.py", line 1538, in generate
    return self.greedy_search(
  File "/root/miniconda3/envs/test/lib/python3.9/site-packages/transformers/generation/utils.py", line 2362, in greedy_search
    outputs = self(
  File "/root/miniconda3/envs/test/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/root/miniconda3/envs/test/lib/python3.9/site-packages/transformers/models/opt/modeling_opt.py", line 944, in forward
    outputs = self.model.decoder(
  File "/root/miniconda3/envs/test/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/root/miniconda3/envs/test/lib/python3.9/site-packages/transformers/models/opt/modeling_opt.py", line 710, in forward
    layer_outputs = decoder_layer(
  File "/root/miniconda3/envs/test/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/root/miniconda3/envs/test/lib/python3.9/site-packages/transformers/models/opt/modeling_opt.py", line 353, in forward
    hidden_states = self.fc1(hidden_states)
  File "/root/miniconda3/envs/test/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    result = hook(self, args)
  File "/root/miniconda3/envs/test/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/root/miniconda3/envs/test/lib/python3.9/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook
    self.pre_sub_module_forward_function(module)
  File "/root/miniconda3/envs/test/lib/python3.9/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function
    param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state)
  File "/root/miniconda3/envs/test/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/root/miniconda3/envs/test/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/miniconda3/envs/test/lib/python3.9/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 310, in fetch_sub_module
    assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary()
AssertionError: {'id': 366, 'status': 'INFLIGHT', 'numel': 16777216, 'ds_numel': 16777216, 'shape': (8192, 2048), 'ds_shape': (8192, 2048), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': {257}, 'ds_tensor.shape': torch.Size([2097152])}

To Reproduce

training script: DeepSpeed-Chat/training/step3_rlhf_finetuning/training_scripts/opt/single_node/run_1.3b.sh
set ACTOR_ZERO_STAGE=3, generation_batches=3, and disable hybrid engine(or it will raise another error)
and just using the actor/critic model provided in the script

ds_report output

[2023-10-18 03:39:04,020] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/root/miniconda3/envs/test/lib/python3.9/site-packages/torch']
torch version .................... 2.0.1
deepspeed install path ........... ['/root/miniconda3/envs/test/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.11.1, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.7
shared memory (/dev/shm) size .... 964.44 GB

System info:

OS: Ubuntu 20.04.5 LTS
GPU: A100 x8
Python 3.9.16
Transformers 4.31.0
Accelerate 0.20.3

gfgkmn commented 8 months ago

I encountered the same issue, and after some debugging, I began to suspect that it might be caused by the hybrid-engine or perhaps due to switching between eval() and train() modes too frequently—although I'm not entirely sure why.

Here is my temporary solution:

For main.py

         for step, (batch_prompt, batch_unsupervised) in enumerate(
                 zip(prompt_train_dataloader, unsupervised_train_dataloader)):

@@ -491,12 +492,13 @@ def main():
                 prompts = prompts[:, length - args.max_prompt_seq_len:]
                 raise ValueError("Prompt length is too long")

-            # alreay assert prompt.shape = base_prompt.shape
+            # alreay assert prompt.shape = reward_prompt.shape

+            trainer.should_switch_mode('eval')
             out = trainer.generate_experience(batch_prompt['prompt'],
                                               batch_prompt['prompt_att_mask'],
-                                              batch_prompt['base_prompt'],
-                                              batch_prompt['base_prompt_att_mask'],
+                                              batch_prompt['reward_prompt'],
+                                              batch_prompt['reward_prompt_att_mask'],
                                               step)

             training_start = time.time()
@@ -514,6 +516,7 @@ def main():
                 args.global_rank)

             if exp_dataset is not None:
+                trainer.should_switch_mode('train')
                 inner_iter = 0
                 actor_loss_sum, critic_loss_sum, unsup_loss_sum = 0, 0, 0
                 mean_kl_sum, mean_entropy_sum = 0, 0
@@ -592,6 +595,10 @@ def main():
                                       global_step=step)
                     writer.flush()

+            else:
+                continue

For ppo_trainer.py

@@ -71,11 +71,12 @@ class DeepSpeedPPOTrainer():
         self.gamma = 1.0
         self.lam = 0.95
         self.generate_time = 0.0
+        self.curren_mode = 'train'

+    def should_switch_mode(self, target_mode='eval'):
+        if target_mode == 'eval':
+            if self.curren_mode == 'train':
+                self.eval()
+                self.curren_mode = 'eval'
+        else:
+            self.train()
+            self.curren_mode = 'train'
+
+        
     def generate_experience(self, prompts, mask, reward_prompts, reward_mask, step):
-        self.eval()
         generate_start = time.time()
         seq, reward_seq = self._generate_sequence(prompts, mask, reward_prompts, reward_mask, step)

         generate_end = time.time()
-        self.train()

So far, this solution has been working effectively, but it still encounters a NCCL ECC error after 74 steps of training. I am actively working to resolve this issue, although I currently have no leads on a fix.

I'm sharing this in hopes of finding a better solution or gaining a deeper understanding of the problem.

mayiran1999 commented 8 months ago

I encountered the same issue and I solved it by using transformers==4.31.0. However, I still do not understand what caused this error. Is there anyone could explain this? thanks!

microsoft / DeepSpeed

[BUG]`assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary()` when training deepspeed-chat step3 with ZeRO3 and a larger `generation_batches` #4533