Open sdtblck opened 3 years ago
@sdtblck This has been on our radar, but we haven't had the bandwidth to tackle it. I would be happy to guide you if you want to take a pass at it. And we would really appreciate the help.
There are a few pieces involved in doing this:
My suggestion would be to start with ZeRO Stage 2 first which would require making changes primarily just to engine.py and stage2.py. Then move from there to supporting other options Let me know if you have any questions.
@sdtblck @cli99 Did you make progress to support bfloat16 in Deepspeed? @samyam Otherwise, can you guide me further if I run into issues while making progress on completing this task? As a starter
Any update on this?
What's the status of this?
"bfloat16": { "enabled": true },
Doesn't seem to have any effect. Model / data remains in fp32.
I think it's:
"bf16": { "enabled": true }
The config file doesn't get validated, so if you make a typo - it just gets silently ignored and the defaults are used.
I reported this more than a year ago: https://github.com/microsoft/DeepSpeed/issues/653
Perhaps now it's a good time to add a validator option and probably making it on by default
@jeffra
I think it's:
"bf16": { "enabled": true }
https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/constants.py#L122 I think in the new version of DeepSpeed you can use either.
I think it's:
"bf16": { "enabled": true }
Do we need revise optimizer? I found no parameters are updated
@mynewstart, can you please clarify your question?
In general, bf16 is now fully supported in DeepSpeed. This issue can now be closed.
Hi @tjruwase, I am using deepspeed ZERO2+FP16 fine-tune a 13 B model in A100 80G instance. I found the exception "Current loss scale already at minimum", this could be due to some parameters in the network overflow to zero, resulting in NaN values during gradient computation. So I decided to change the ds_config and use bf16, but I found the ppl of evaluation dataset didn't change. After some deep dive, I found the model parameters not be updated.
So I'm wondering if I need to modify any other code when switching to bf16. My fine-tuning code is based on DeepSpeed Chat, and the parameters in ds_config are as follows:
zero_opt_dict = { "stage": stage, "offload_param": { "device": device }, "offload_optimizer": { "device": device }, "stage3_param_persistence_threshold": 1e4, "stage3_max_live_parameters": 3e7, "stage3_prefetch_bucket_size": 3e7, "memory_efficient_linear": False }
"train_batch_size": GLOBAL_BATCH_SIZE, "train_micro_batch_size_per_gpu": MICRO_BATCH_SIZE, "steps_per_print": 10, "zero_optimization": zero_opt_dict, "fp16": { "enabled": False }, "bfloat16": { "enabled": True, }, "gradient_clipping": 1.0, "prescale_gradients": False, "wall_clock_breakdown": False, "hybrid_engine": { "enabled": enable_hybrid_engine, "inference_tp_size": inference_tp_size, "release_inference_cache": release_inference_cache, "pin_parameters": pin_parameters, "tp_gather_partition_size": tp_gather_partition_size, }
Ok, I found it will overflow even if I use bf16/fp32, it may be some other reasons cause this problem.
Ok, I found it will overflow even if I use bf16/fp32, it may be some other reasons cause this problem.
@mynewstart Did you fix it? Thx
I am finding that enabling bf16 puts the actual trainable weights at bf16 precision, does this seem like it may be a problem?
That's how Deepspeed does it. It's different from how torch's amp implements it, but it works.
BF16 seemed to have loss of performance when tested here https://github.com/microsoft/DeepSpeed/issues/4904
Are there any plans to support bf16 training in deepspeed in the near future? If not - could someone guide me toward what I would need to change in order to implement it? It seems like a fair few things in deepspeed are dependent on fp16 to a large degree.