Open EikeKohl opened 1 year ago
@EikeKohl have you solved this problem, same Error
Hey @XiaoLaoDi not yet, but here is what I tried so far:
As for the CUDA setup, I tried
However, the error remains...
@XiaoLaoDi How does your setup look like? Maybe we can identify similiarities and possible problem areas
maybe see this issue https://github.com/microsoft/DeepSpeedExamples/issues/335#issuecomment-1521105300
@ruihan0495 thank you for the info. Not using the DeepSpeed-HE does indeed make a training possible 🙂. I run into another exception a little later in the code, but that is probably due to poor model quality. I am currently working on fixing that issue as well.
@EikeKohl I got the same error when I trained GPT2 and I also used AWS EC2 instance. Did you figure out what the problem is?
@DehongXu tbh I didn't use Deepspeed RLHF in a while, but I remember that there was a known issue with the hybrid engine that was supposed to be fixed in upcoming updates. Disabling the usage of the hybrid engine made it worked for me at that time.
I try to run RLHF for my previously trained Actor and Reward model. However, I encounter the following Exception:
I then set
TORCH_USE_CUDA_DSA=1
andCUDA_LAUNCH_BLOCKING=1
for better debugging this results in the following, more elaborate error message:This is my bash script: