[BUG] assert all_groups_norm > 0 | Error related to Bf16 optimizer it seems

ethansmith2000 commented 7 months ago

Describe the bug A clear and concise description of what the bug is.

Many of my trainings when in the config we have: bf16=True, around a few hundred steps the training crashes on the assertion error:

"assert all_groups_norm > 0"

To Reproduce Steps to reproduce the behavior: I am not entirely sure what causes it aside from having bf16 enabled sometimes. Here is the config I have been using

deepspeed: gradient_accumulation_steps: 1 steps_per_print: 2000 optimizer: type: "Adam" params: lr: 1.0e-4 betas: [0.9, 0.985] eps: 1.0e-8 weight_decay: 0.05

scheduler:
    type: "WarmupDecayLR"
    params:
        warmup_min_lr: 0
        warmup_max_lr: ${deepspeed.optimizer.params.lr}
        warmup_num_steps: 250
        warmup_type: "linear"
        total_num_steps: 20000

gradient_clipping: 1.0
prescale_gradients: False

bf16:
    enabled: True

wall_clock_breakdown: True

zero_optimization:
    stage: 0
    allgather_partitions: True
    allgather_bucket_size: 2e8
    overlap_comm: True
    reduce_scatter: True
    reduce_bucket_size: 2e8
    contiguous_gradients: True
    zero_quantized_nontrainable_weights: False

flops_profiler:
    enabled: False
    profile_step": 1
    module_depth: -1
    top_modules: 1
    detailed: True
    output_file: null

activation_checkpointing:
    partition_activation: False # Enables partition activation when used with model parallelism
    cpu_checkpointing: False
    contiguous_memory_optimization: False
    number_checkpoints: None
    synchronize_checkpoint_boundary: False
    profile: False

comms_logger:
    enabled: True
    verbose: False
    prof_all: True
    debug: False

Expected behavior A clear and concise description of what you expected to happen.

Not crash I think, I'm sure the assertion is there for a reason though.

ds_report output Please run ds_report to give us details about your setup.

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

OS: Ubuntu 22.04
GPU count and types: 2x A100 on a 8x node
Interconnects (if applicable)
Python version: 3.8.18
Any other relevant info about your setup

Launcher context Are you launching your experiment with the deepspeed launcher, MPI, or something else?

No, using srun torchrun train.py --deepspeed

Docker context Are you using a specific docker image that you can share?

N/a

Additional context Add any other context about the problem here.

Only happens with some of the models i've trained

zhenyuhe00 commented 7 months ago

I also encountered this bug when using deepspeed==0.13.1 in Huggingface Transformers trainer. However, after I upgraded deepspeed to 0.13.5, the bug disappeared. So, maybe you can try

pip3 install deepspeed==0.13.5

zhenyuhe00 commented 6 months ago

I also encountered this bug when using deepspeed==0.13.1 in Huggingface Transformers trainer. However, after I upgraded deepspeed to 0.13.5, the bug disappeared. So, maybe you can try
pip3 install deepspeed==0.13.5

Still encountered this bug using 0.13.5 or 0.14.0 after switching to other machines.

vividfree commented 5 months ago

I also encountered the error "assert all_groups_norm > 0", does anybody know how to solve it?

ethansmith2000 commented 5 months ago

I think the error suggests vanishing gradient, but it's strange that I don't see it when using fp16 or full precision

Zth9730 commented 3 months ago

I encountered the same error "assert all_groups_norm > 0", does anyone have a solution?

liuxingbin commented 2 months ago

any solution?

shachoi commented 1 month ago

I encountered the same issue. Is there any solution?

... /deepspeed/runtime/bf16_optimizer.py", line 312, in step rank0: assert all_groups_norm > 0.

deepspeed 0.15.0 transformers 4.44.2

shachoi commented 1 month ago

I encountered the same issue. Is there any solution?

... /deepspeed/runtime/bf16_optimizer.py", line 312, in step rank0: assert all_groups_norm > 0.

deepspeed 0.15.0 transformers 4.44.2

I resolved this issue. In my case, the cause was not related to the versions of Deepspeed, Tranformers, or other dependencies. The problem was with the model checkpoint for "clip-vit-large-patch14", which seemed to be corrupted, though I'm not sure why. After re-downloading it from Huggingface, the issue was clearly resolved.

The above issue occured when I use Deepspeed with zero0.json. When I use zero1 or zero2, the issue of loss being zero and grad_norm=NaN occured.

Now, all issues are clearly resolved after re-downloading the CLIP model.

ZhiyuanChen commented 1 month ago

What's your weight_decay setting? 1e-2 can be too large for certain tasks, especially in highly unbalanced classification tasks.

microsoft / DeepSpeed

[BUG] assert all_groups_norm > 0 | Error related to Bf16 optimizer it seems #5223