In step 3, I met a error when executing self.actor_model.eval()

ZJXNEFU commented 1 year ago

Here is the error I met, seems like the self._total_batch_size is None, but I don't know the reason

  File "/path/model_training/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 434, in main
    out = trainer.generate_experience(batch_prompt['prompt'],
  File "/path/model_training/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 97, in generate_experience
    self.eval()
  File "/path/model_training/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 237, in eval
    self.actor_model.eval()
  File "/root/miniconda3/envs/dschat/lib/python3.9/site-packages/deepspeed/runtime/hybrid_engine.py", line 379, in eval
    f'|CurSamplesPerSec={(1 / latency * self._total_batch_size):.2f} ' + \
TypeError: unsupported operand type(s) for *: 'float' and 'NoneType'

lekurile commented 1 year ago

Hi @ZJXNEFU,

Can you please provide the reproduction command you ran for Step 3 training (training script, actor/critic models, num GPUs, zero stage, etc)?

I ran the following Step 3 training command on the latest versions of DeepSpeed and DeepSpeedExamples:

dse/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning$ bash training_scripts/single_node/run_1.3b.sh AdamG012/chat-opt-1.3b-sft-deepspeed AdamG012/chat-opt-350m-reward-deepspeed 2 2

And was unable to reproduce the issue unfortunately.

One thing to potentially check is to see that self._total_batch_size is being set correctly in the generate() function using pdb or some other debugger: https://github.com/microsoft/DeepSpeed/blob/d24629f4fdaaa92df068de24f926d341f129112c/deepspeed/runtime/hybrid_engine.py#L178C65-L178C65

Another thing is to make sure you're not entering this condition in the eval() function before self._total_batch_size is set: https://github.com/microsoft/DeepSpeed/blob/d24629f4fdaaa92df068de24f926d341f129112c/deepspeed/runtime/hybrid_engine.py#L383

Every time we enter the generate_experience() function in ppo_trainer.py, we first call self.eval() followed by self._generate_sequence(). In the first pass of eval(), self._t_start should be None so we don't attempt to print metrics, which is where your error is happening.

Thanks, Lev

senthilps8 commented 1 year ago

@lekurile I faced the same issue using codegen models. I have narrowed down the problem. This if condition is always False for codegen models: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/hybrid_engine.py#L357-L359 Since there is no predefined inference policy for CodeGen models, self._inference_containers is always empty. Is there a workaround for this?

lekurile commented 1 year ago

Hi @senthilps8,

Thank you for providing feedback about codegen models having issues.

I'd like to reproduce this on my end. Can you please provide an example of a Step 3 run you did with the actor/critic models and any other arguments passed to the command?

Is there a workaround for this?

I think you can try disabling the Hybrid Engine in the meantime to get around this. Please let me know if this works.

Thanks, Lev

robotsp commented 1 year ago

the same issue as https://github.com/microsoft/DeepSpeedExamples/issues/544

@lekurile @senthilps8 could you please have a look for that? thanks!

devinzhang91 commented 1 year ago

Maybe self._total_batch_size has not been initialized at function def generate(self, *inputs, **kwargs):。 So , I just disable the printing code about self._total_batch_size . It work... https://github.com/microsoft/DeepSpeed/blob/v0.10.1/deepspeed/runtime/hybrid_engine.py#L394

lekurile commented 9 months ago

Hi @ZJXNEFU,

I believe this is due to the generate function not being replaced when there isn't a corresponding inference policy for the model you're using: https://github.com/microsoft/DeepSpeed/blob/93a81b5362a83bacd7b40c838295909f347e37af/deepspeed/runtime/hybrid_engine.py#L359

The generate() function is what'll set the self._total_batch_size if it doesn't exist: https://github.com/microsoft/DeepSpeed/blob/93a81b5362a83bacd7b40c838295909f347e37af/deepspeed/runtime/hybrid_engine.py#L175

Created a PR conditionally constructing metrics: https://github.com/microsoft/DeepSpeed/pull/4789

Thanks, Lev

microsoft / DeepSpeedExamples

In step 3, I met a error when executing self.actor_model.eval() #593