Open ZJXNEFU opened 1 year ago
Hi @ZJXNEFU,
Can you please provide the reproduction command you ran for Step 3 training (training script, actor/critic models, num GPUs, zero stage, etc)?
I ran the following Step 3 training command on the latest versions of DeepSpeed and DeepSpeedExamples:
dse/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning$ bash training_scripts/single_node/run_1.3b.sh AdamG012/chat-opt-1.3b-sft-deepspeed AdamG012/chat-opt-350m-reward-deepspeed 2 2
And was unable to reproduce the issue unfortunately.
One thing to potentially check is to see that self._total_batch_size
is being set correctly in the generate()
function using pdb
or some other debugger:
https://github.com/microsoft/DeepSpeed/blob/d24629f4fdaaa92df068de24f926d341f129112c/deepspeed/runtime/hybrid_engine.py#L178C65-L178C65
Another thing is to make sure you're not entering this condition in the eval()
function before self._total_batch_size
is set:
https://github.com/microsoft/DeepSpeed/blob/d24629f4fdaaa92df068de24f926d341f129112c/deepspeed/runtime/hybrid_engine.py#L383
Every time we enter the generate_experience()
function in ppo_trainer.py
, we first call self.eval()
followed by self._generate_sequence()
. In the first pass of eval()
, self._t_start
should be None
so we don't attempt to print metrics, which is where your error is happening.
Thanks, Lev
@lekurile I faced the same issue using codegen models.
I have narrowed down the problem. This if condition is always False for codegen models: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/hybrid_engine.py#L357-L359
Since there is no predefined inference policy for CodeGen models, self._inference_containers
is always empty. Is there a workaround for this?
Hi @senthilps8,
Thank you for providing feedback about codegen models having issues.
I'd like to reproduce this on my end. Can you please provide an example of a Step 3 run you did with the actor/critic models and any other arguments passed to the command?
Is there a workaround for this?
I think you can try disabling the Hybrid Engine in the meantime to get around this. Please let me know if this works.
Thanks, Lev
the same issue as https://github.com/microsoft/DeepSpeedExamples/issues/544
@lekurile @senthilps8 could you please have a look for that? thanks!
Maybe self._total_batch_size
has not been initialized at function def generate(self, *inputs, **kwargs):
。
So , I just disable the printing code about self._total_batch_size
. It work...
https://github.com/microsoft/DeepSpeed/blob/v0.10.1/deepspeed/runtime/hybrid_engine.py#L394
Hi @ZJXNEFU,
I believe this is due to the generate function not being replaced when there isn't a corresponding inference policy for the model you're using: https://github.com/microsoft/DeepSpeed/blob/93a81b5362a83bacd7b40c838295909f347e37af/deepspeed/runtime/hybrid_engine.py#L359
The generate()
function is what'll set the self._total_batch_size
if it doesn't exist:
https://github.com/microsoft/DeepSpeed/blob/93a81b5362a83bacd7b40c838295909f347e37af/deepspeed/runtime/hybrid_engine.py#L175
Created a PR conditionally constructing metrics: https://github.com/microsoft/DeepSpeed/pull/4789
Thanks, Lev
Here is the error I met, seems like the
self._total_batch_size
isNone
, but I don't know the reason