Open haolin-nju opened 1 year ago
Hi, have you solved this problem?
Hi, have you solved this problem?
I'm so sorry. I have abandoned DeepSpeed-Chat for RLHF unless they solve this issue. inferece_tp_size > 1 is a must if I'd like to try on a generation model with larger parameter size.
Hi, have you solved this problem?
I'm so sorry. I have abandoned DeepSpeed-Chat for RLHF unless they solve this issue. inferece_tp_size > 1 is a must if I'd like to try on a generation model with larger parameter size.
我把deepspeed换成0.8.3+b528f50e后问题解决了
Hi, have you solved this problem?
I'm so sorry. I have abandoned DeepSpeed-Chat for RLHF unless they solve this issue. inferece_tp_size > 1 is a must if I'd like to try on a generation model with larger parameter size.
I encountered the same problem. Any suggestions on alternative of DeepSpeed-Chat for RLHF training?
red the same problem. Any suggestions on alternative of DeepSpeed-Chat for RLHF tr
I have found that there is a candidate pull request to address this issue. Perhaps you could have a try?
red the same problem. Any suggestions on alternative of DeepSpeed-Chat for RLHF tr
I have found that there is a candidate pull request to address this issue. Perhaps you could have a try?
After picking this commit, another error occurred #4469.
BTW, --unpin_actor_parameters
can also pass around current issue.
Desciption: In DeepSpeed-Chat step3, a runtime error: The size of tensor a (4) must match the size of tensor b (8) at non-singleton dimension 0 will be thrown when inference_tp_size>1 and hybrid engine is enabled. I encountered this bug on the given 13b training scripts but not on that of 1.3b. I found that the main differences between the provided 13b and 1.3b training scripts are that zero stage are set to 3 and inference_tp_size is larger than 1 on 13b models, while zero stage are set to 2 and inference_tp_size are set to default on 1.3b scripts.
The exception looks like:
Environment: torch1.12, deepspeed-0.10.0+d6f62217, deepspeed example-f9c3ae05, transformers-4.30.0
minimum bug reproduction for me: opt-1.3b+opt-350m, GPU: 8*40G A100
script:
It would be of great help if anyone help me check this bug and fix it.