microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
34.79k stars 4.05k forks source link

BF16 Support #974

Open sdtblck opened 3 years ago

sdtblck commented 3 years ago

Are there any plans to support bf16 training in deepspeed in the near future? If not - could someone guide me toward what I would need to change in order to implement it? It seems like a fair few things in deepspeed are dependent on fp16 to a large degree.

samyam commented 3 years ago

@sdtblck This has been on our radar, but we haven't had the bandwidth to tackle it. I would be happy to guide you if you want to take a pass at it. And we would really appreciate the help.

There are a few pieces involved in doing this:

  1. DeepSpeed Config Changes to allow fp16 config to take a type which will default to float16, but can be changed to blfoat16.
  2. Change all the places in engine.py, stage1.py, stage2.py, stage3.py, partitioned_parameters.py, fused_optimizer.py and unfused_optimizer.py to switch hard coded .half() calls to .half() or .bfloat16() based on the config passed.
  3. Perform gradient accumulation in fp32 instead of fp16
  4. Ensure all the communication happens in fp32 or fp16. Currently NCCL does not support bfloat16 communication.
  5. Hard code the loss_scale to 1.0 if bfloat16 in enabled.

My suggestion would be to start with ZeRO Stage 2 first which would require making changes primarily just to engine.py and stage2.py. Then move from there to supporting other options Let me know if you have any questions.

raa2463 commented 3 years ago

@sdtblck @cli99 Did you make progress to support bfloat16 in Deepspeed? @samyam Otherwise, can you guide me further if I run into issues while making progress on completing this task? As a starter

  1. Can you point me to where the DeepSpeed config changes need to happen? Is there an example config file in the repo?
  2. Do the gradient accumulation changes happen in allreduce_gradients() function and the functions it calls?
  3. To ensure comm happens in FP32, should I basically just convert the tensors into FP32 before all_reduce (and other comm primitives) or is there more to it than that? Thank you!
gahdritz commented 3 years ago

Any update on this?

lhatsk commented 2 years ago

What's the status of this?

"bfloat16": { "enabled": true },

Doesn't seem to have any effect. Model / data remains in fp32.

rohitgr7 commented 2 years ago

I think it's:

"bf16": { "enabled": true }
stas00 commented 2 years ago

The config file doesn't get validated, so if you make a typo - it just gets silently ignored and the defaults are used.

I reported this more than a year ago: https://github.com/microsoft/DeepSpeed/issues/653

Perhaps now it's a good time to add a validator option and probably making it on by default

@jeffra

raa2463 commented 2 years ago

I think it's:

"bf16": { "enabled": true }

https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/constants.py#L122 I think in the new version of DeepSpeed you can use either.

mynewstart commented 1 year ago

I think it's:

"bf16": { "enabled": true }

Do we need revise optimizer? I found no parameters are updated

tjruwase commented 1 year ago

@mynewstart, can you please clarify your question?

In general, bf16 is now fully supported in DeepSpeed. This issue can now be closed.

mynewstart commented 1 year ago

Hi @tjruwase, I am using deepspeed ZERO2+FP16 fine-tune a 13 B model in A100 80G instance. I found the exception "Current loss scale already at minimum", this could be due to some parameters in the network overflow to zero, resulting in NaN values during gradient computation. So I decided to change the ds_config and use bf16, but I found the ppl of evaluation dataset didn't change. After some deep dive, I found the model parameters not be updated.

So I'm wondering if I need to modify any other code when switching to bf16. My fine-tuning code is based on DeepSpeed Chat, and the parameters in ds_config are as follows: zero_opt_dict = { "stage": stage, "offload_param": { "device": device }, "offload_optimizer": { "device": device }, "stage3_param_persistence_threshold": 1e4, "stage3_max_live_parameters": 3e7, "stage3_prefetch_bucket_size": 3e7, "memory_efficient_linear": False }

"train_batch_size": GLOBAL_BATCH_SIZE, "train_micro_batch_size_per_gpu": MICRO_BATCH_SIZE, "steps_per_print": 10, "zero_optimization": zero_opt_dict, "fp16": { "enabled": False }, "bfloat16": { "enabled": True, }, "gradient_clipping": 1.0, "prescale_gradients": False, "wall_clock_breakdown": False, "hybrid_engine": { "enabled": enable_hybrid_engine, "inference_tp_size": inference_tp_size, "release_inference_cache": release_inference_cache, "pin_parameters": pin_parameters, "tp_gather_partition_size": tp_gather_partition_size, }

mynewstart commented 1 year ago

Ok, I found it will overflow even if I use bf16/fp32, it may be some other reasons cause this problem.

gz-d commented 1 year ago

Ok, I found it will overflow even if I use bf16/fp32, it may be some other reasons cause this problem.

@mynewstart Did you fix it? Thx

ethansmith2000 commented 8 months ago

I am finding that enabling bf16 puts the actual trainable weights at bf16 precision, does this seem like it may be a problem?

stas00 commented 8 months ago

That's how Deepspeed does it. It's different from how torch's amp implements it, but it works.

ethansmith2000 commented 8 months ago

BF16 seemed to have loss of performance when tested here https://github.com/microsoft/DeepSpeed/issues/4904