microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.11k stars 4.06k forks source link

[BUG] assert all_groups_norm > 0 | Error related to Bf16 optimizer it seems #5223

Open ethansmith2000 opened 7 months ago

ethansmith2000 commented 7 months ago

Describe the bug A clear and concise description of what the bug is.

Many of my trainings when in the config we have: bf16=True, around a few hundred steps the training crashes on the assertion error:

"assert all_groups_norm > 0"

To Reproduce Steps to reproduce the behavior: I am not entirely sure what causes it aside from having bf16 enabled sometimes. Here is the config I have been using

deepspeed: gradient_accumulation_steps: 1 steps_per_print: 2000 optimizer: type: "Adam" params: lr: 1.0e-4 betas: [0.9, 0.985] eps: 1.0e-8 weight_decay: 0.05

scheduler:
    type: "WarmupDecayLR"
    params:
        warmup_min_lr: 0
        warmup_max_lr: ${deepspeed.optimizer.params.lr}
        warmup_num_steps: 250
        warmup_type: "linear"
        total_num_steps: 20000

gradient_clipping: 1.0
prescale_gradients: False

bf16:
    enabled: True

wall_clock_breakdown: True

zero_optimization:
    stage: 0
    allgather_partitions: True
    allgather_bucket_size: 2e8
    overlap_comm: True
    reduce_scatter: True
    reduce_bucket_size: 2e8
    contiguous_gradients: True
    zero_quantized_nontrainable_weights: False

flops_profiler:
    enabled: False
    profile_step": 1
    module_depth: -1
    top_modules: 1
    detailed: True
    output_file: null

activation_checkpointing:
    partition_activation: False # Enables partition activation when used with model parallelism
    cpu_checkpointing: False
    contiguous_memory_optimization: False
    number_checkpoints: None
    synchronize_checkpoint_boundary: False
    profile: False

comms_logger:
    enabled: True
    verbose: False
    prof_all: True
    debug: False

Expected behavior A clear and concise description of what you expected to happen.

Not crash I think, I'm sure the assertion is there for a reason though.

ds_report output Please run ds_report to give us details about your setup.

Screenshot 2024-03-04 at 8 52 44 PM

Screenshots If applicable, add screenshots to help explain your problem.

Screenshot 2024-03-04 at 8 48 05 PM

System info (please complete the following information):

Launcher context Are you launching your experiment with the deepspeed launcher, MPI, or something else?

No, using srun torchrun train.py --deepspeed

Docker context Are you using a specific docker image that you can share?

N/a

Additional context Add any other context about the problem here.

Only happens with some of the models i've trained

zhenyuhe00 commented 7 months ago

I also encountered this bug when using deepspeed==0.13.1 in Huggingface Transformers trainer. However, after I upgraded deepspeed to 0.13.5, the bug disappeared. So, maybe you can try

pip3 install deepspeed==0.13.5
zhenyuhe00 commented 6 months ago

I also encountered this bug when using deepspeed==0.13.1 in Huggingface Transformers trainer. However, after I upgraded deepspeed to 0.13.5, the bug disappeared. So, maybe you can try

pip3 install deepspeed==0.13.5

Still encountered this bug using 0.13.5 or 0.14.0 after switching to other machines.

vividfree commented 5 months ago

I also encountered the error "assert all_groups_norm > 0", does anybody know how to solve it?

ethansmith2000 commented 5 months ago

I think the error suggests vanishing gradient, but it's strange that I don't see it when using fp16 or full precision

Zth9730 commented 3 months ago

I encountered the same error "assert all_groups_norm > 0", does anyone have a solution?

liuxingbin commented 2 months ago

any solution?

shachoi commented 1 month ago

I encountered the same issue. Is there any solution?

... /deepspeed/runtime/bf16_optimizer.py", line 312, in step rank0: assert all_groups_norm > 0.

deepspeed 0.15.0 transformers 4.44.2

shachoi commented 1 month ago

I encountered the same issue. Is there any solution?

... /deepspeed/runtime/bf16_optimizer.py", line 312, in step rank0: assert all_groups_norm > 0.

deepspeed 0.15.0 transformers 4.44.2

I resolved this issue. In my case, the cause was not related to the versions of Deepspeed, Tranformers, or other dependencies. The problem was with the model checkpoint for "clip-vit-large-patch14", which seemed to be corrupted, though I'm not sure why. After re-downloading it from Huggingface, the issue was clearly resolved.

The above issue occured when I use Deepspeed with zero0.json. When I use zero1 or zero2, the issue of loss being zero and grad_norm=NaN occured.

Now, all issues are clearly resolved after re-downloading the CLIP model.

ZhiyuanChen commented 1 month ago

What's your weight_decay setting? 1e-2 can be too large for certain tasks, especially in highly unbalanced classification tasks.