Open ethansmith2000 opened 7 months ago
I also encountered this bug when using deepspeed==0.13.1 in Huggingface Transformers trainer. However, after I upgraded deepspeed to 0.13.5, the bug disappeared. So, maybe you can try
pip3 install deepspeed==0.13.5
I also encountered this bug when using deepspeed==0.13.1 in Huggingface Transformers trainer. However, after I upgraded deepspeed to 0.13.5, the bug disappeared. So, maybe you can try
pip3 install deepspeed==0.13.5
Still encountered this bug using 0.13.5 or 0.14.0 after switching to other machines.
I also encountered the error "assert all_groups_norm > 0", does anybody know how to solve it?
I think the error suggests vanishing gradient, but it's strange that I don't see it when using fp16 or full precision
I encountered the same error "assert all_groups_norm > 0", does anyone have a solution?
any solution?
I encountered the same issue. Is there any solution?
... /deepspeed/runtime/bf16_optimizer.py", line 312, in step rank0: assert all_groups_norm > 0.
deepspeed 0.15.0 transformers 4.44.2
I encountered the same issue. Is there any solution?
... /deepspeed/runtime/bf16_optimizer.py", line 312, in step rank0: assert all_groups_norm > 0.
deepspeed 0.15.0 transformers 4.44.2
I resolved this issue. In my case, the cause was not related to the versions of Deepspeed, Tranformers, or other dependencies. The problem was with the model checkpoint for "clip-vit-large-patch14", which seemed to be corrupted, though I'm not sure why. After re-downloading it from Huggingface, the issue was clearly resolved.
The above issue occured when I use Deepspeed with zero0.json. When I use zero1 or zero2, the issue of loss being zero and grad_norm=NaN occured.
Now, all issues are clearly resolved after re-downloading the CLIP model.
What's your weight_decay
setting? 1e-2
can be too large for certain tasks, especially in highly unbalanced classification tasks.
Describe the bug A clear and concise description of what the bug is.
Many of my trainings when in the config we have: bf16=True, around a few hundred steps the training crashes on the assertion error:
"assert all_groups_norm > 0"
To Reproduce Steps to reproduce the behavior: I am not entirely sure what causes it aside from having bf16 enabled sometimes. Here is the config I have been using
deepspeed: gradient_accumulation_steps: 1 steps_per_print: 2000 optimizer: type: "Adam" params: lr: 1.0e-4 betas: [0.9, 0.985] eps: 1.0e-8 weight_decay: 0.05
Expected behavior A clear and concise description of what you expected to happen.
Not crash I think, I'm sure the assertion is there for a reason though.
ds_report output Please run
ds_report
to give us details about your setup.Screenshots If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
Launcher context Are you launching your experiment with the
deepspeed
launcher, MPI, or something else?No, using srun torchrun train.py --deepspeed
Docker context Are you using a specific docker image that you can share?
N/a
Additional context Add any other context about the problem here.
Only happens with some of the models i've trained