Closed nomadlx closed 1 month ago
Hi @nomadlx, DeepSpeed reads .deepspeed_env
when it launches processes. As long as you see outputs in the log, the processes should have been successfully launched. It is unlikely to happen that DeepSpeed's behavior regarding .deepspeed_env
caused the error.
I suggest checking how your application code progresses. The change in PYTORCH_CUDA_ALLOC_CONF
might affect it. A simple would be helpful for us if you still see a suspicious behavior in DeepSpeed.
Since we haven’t received any additional information, we’re closing this issue for now. Please feel free to reopen it if you have more details to share.
Describe the bug At first I had a lack of GPU OOM, it report me that:
So I tried to setting
PYTORCH_CUDA_ALLOC_CONF
when running deepspeed cmd. But it doesn't seem to be working. Therefore, I try add it to.deepspeed_env
file, then I raise error:Expected behavior I want to full training 70B level LLM model, I used 8 A800(80G), In theory it works with bf16(70x2x4=560<80*8)
ds_report output