Open GoSz opened 11 months ago
I encountered the same issue, and after some debugging, I began to suspect that it might be caused by the hybrid-engine or perhaps due to switching between eval()
and train()
modes too frequently—although I'm not entirely sure why.
Here is my temporary solution:
For main.py
for step, (batch_prompt, batch_unsupervised) in enumerate(
zip(prompt_train_dataloader, unsupervised_train_dataloader)):
@@ -491,12 +492,13 @@ def main():
prompts = prompts[:, length - args.max_prompt_seq_len:]
raise ValueError("Prompt length is too long")
- # alreay assert prompt.shape = base_prompt.shape
+ # alreay assert prompt.shape = reward_prompt.shape
+ trainer.should_switch_mode('eval')
out = trainer.generate_experience(batch_prompt['prompt'],
batch_prompt['prompt_att_mask'],
- batch_prompt['base_prompt'],
- batch_prompt['base_prompt_att_mask'],
+ batch_prompt['reward_prompt'],
+ batch_prompt['reward_prompt_att_mask'],
step)
training_start = time.time()
@@ -514,6 +516,7 @@ def main():
args.global_rank)
if exp_dataset is not None:
+ trainer.should_switch_mode('train')
inner_iter = 0
actor_loss_sum, critic_loss_sum, unsup_loss_sum = 0, 0, 0
mean_kl_sum, mean_entropy_sum = 0, 0
@@ -592,6 +595,10 @@ def main():
global_step=step)
writer.flush()
+ else:
+ continue
For ppo_trainer.py
@@ -71,11 +71,12 @@ class DeepSpeedPPOTrainer():
self.gamma = 1.0
self.lam = 0.95
self.generate_time = 0.0
+ self.curren_mode = 'train'
+ def should_switch_mode(self, target_mode='eval'):
+ if target_mode == 'eval':
+ if self.curren_mode == 'train':
+ self.eval()
+ self.curren_mode = 'eval'
+ else:
+ self.train()
+ self.curren_mode = 'train'
+
+
def generate_experience(self, prompts, mask, reward_prompts, reward_mask, step):
- self.eval()
generate_start = time.time()
seq, reward_seq = self._generate_sequence(prompts, mask, reward_prompts, reward_mask, step)
generate_end = time.time()
- self.train()
So far, this solution has been working effectively, but it still encounters a NCCL ECC error after 74 steps of training. I am actively working to resolve this issue, although I currently have no leads on a fix.
I'm sharing this in hopes of finding a better solution or gaining a deeper understanding of the problem.
I encountered the same issue and I solved it by using transformers==4.31.0. However, I still do not understand what caused this error. Is there anyone could explain this? thanks!
Describe the bug When training Deepspeed-Chat Step3 with ZeRO3(without hybrid-engine), if we set
generation_batches >= 3
orgeneration_batches >= 2 and ppo_epochs >= 2
, deepspeed will raiseassert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary()
during generate_experience in the second step.Log output
To Reproduce
DeepSpeed-Chat/training/step3_rlhf_finetuning/training_scripts/opt/single_node/run_1.3b.sh
ACTOR_ZERO_STAGE=3
,generation_batches=3
, and disable hybrid engine(or it will raise another error)ds_report output
System info: