jeffra commented 2 years ago

DeepSpeed has support for several dtypes now (i.e., fp32, fp16, bf16). However, it's becoming less clear what parts of training are using what dtypes and what time. For example, in #1801 we added support for BF16 training + FP32 gradient accumulation and optimizer stage sharding (zero stage 1) when pipeline parallelism is enabled. This is only triggered if your config is in the following scenario:

Note: PP is used on the client side and not in the ds_config, but it's use or not also decides what code paths are supported or not.

# pipeline-parallelism: enabled
"bf16": {
   "enabled": true
},
"zero_optimization": {
    "stage": 0
}

--> BF16 training + FP32 gradient accumulation + ZeRO stage 1 optimizer sharding via deepspeed/runtime/bf16_optimizer.py

# pipeline-parallelism: enabled
"bf16": {
   "enabled": true
},
"zero_optimization": {
    "stage": 1
}

--> BF16 training + BF16 gradient accumulation + ZeRO stage 1 optimizer sharding via deepspeed/runtime/zero/stage_1_and_2.py

The proposal is to introduce a config like the following:

"bf16": {
   "enabled": true
},
"gradient_accumulation_dtype": "fp32",
"zero_optimization": {
    "stage": 1
}

-->

The proposal is to add a new option in the ds_config: gradient_accumulation_dtype. In this case we would dispatch to the right version of ZeRO depending on what mode is selected by the user to make it more explicit what is happening.

I've started a table to try and express all of these possible cases and which ones would be supported and which would not. It feels a bit overly complicated in some ways however. This also doesn't consider cases where zero is disabled "stage": 0.

bf16	fp16	grad-accu-dtype	PP	ZeRO (1,2,3)	Result	ZeRO implementation
T	T	*	*	*	Error

T	F	fp16	*	*	NotSupported
T	F	bf16	*	*	OKAY	stage_1_and_2.py
T	F	fp32	T	1	OKAY	bf16_optimizer.py
T	F	fp32	F	1	NotSupported
T	F	fp32	*	2 or 3	NotSupported

F	T	fp16	*	*	OKAY	stage_1_and_2.py
F	T	bf16 or fp32	*	*	NotSupported

F	F	fp32	*	*	OKAY	stage_1_and_2.py
F	F	bf16 or fp16	*	*	NotSupported

Note: this is a WIP but I don't want to lose our progress on this discussion.

jeffra commented 2 years ago

I should also note that I think we have similar complexities with communication_data_type, which we only support in some configs but not sure if we error out explicitly on cases we don't support.

https://github.com/microsoft/DeepSpeed/blob/41d90830e2d78c154d560f36c8273ca2f889bbfb/deepspeed/runtime/engine.py#L723-L733

stas00 commented 2 years ago

There is one more dimension to this design discussion - and it's whether the additional accumulator is sharded or not.

e.g. currently bf16 optimizer allocates a local accumulator on each gpu, which is quite expensive: 4x params of that gpu. It happens to work great on the massive A100 80Gb, but it could be too expensive on 40GB cards.

So the other dimension that needs to be configurable is whether it is sharded or not. Of course, sharding will come at the overhead of additional comms.

So you probably want:

"gradient_accumulation_dtype": {
    "dtype": ["fp32"|"fp16"|"bf16"],
    "sharded": [true|false],
},

with the special case of grad_accum_dtype == dtype where it's automatically shared and shouldn't be possible to override, at least in the current code base. But perhaps it could be turned off in the future to speed things up if someone has extra memory.

e.g. currently at BigScience we are about 60/80GB - surely it'd have been great if we could have localized/unsharded some more tensors and lessen the comms and gain a bit higher throughput.

assij commented 2 years ago

Hi, Using https://github.com/microsoft/Megatron-DeepSpeed for pipeline + zero1 + bfloat16 Deepspeed doesn't work.

When using the script in examples/run_deepspeed_example.sh with Zero1 and bfloat16 ( the script works with fp16) I get the following error: File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 768, in _exec_backward_pass self.optimizer.clear_lp_grads() AttributeError: 'DeepSpeedZeroOptimizer' object has no attribute 'clear_lp_grads'

The run_deepspeed_example.sh is attached !/bin/bash set -ex

BASE_PATH=/vc_data/Megatron-LM/data DATA_PATH=${BASE_PATH}/indexed_datasets/megatron DS_CONFIG=ds_config.json

TP=2 PP=2 NLAYERS=24 HIDDEN=512

GLOBAL_BATCH=64 MICRO_BATCH=4

ZERO_STAGE=1

OUTPUT_DIR=ds_z{NLAYERS}_hs{GLOBAL_BATCH}_mb${MICRO_BATCH}

OUTPUT_DIR=baseline_nl{HIDDEN}_g{MICRO_BATCH}

mkdir -p $OUTPUT_DIR

cat < $DS_CONFIG { "train_batch_size" : $GLOBAL_BATCH, "train_micro_batch_size_per_gpu": $MICRO_BATCH, "steps_per_print": 1,

"zero_optimization": { "stage": $ZERO_STAGE },

"bf16": {"enabled": true},

"wall_clock_breakdown" : true } EOT

export NCCL_DEBUG=warn

ds_args="" ds_args=" --deepspeed ${ds_args}"

ds_args=" --no-pipeline-parallel ${ds_args}"

ds_args=" --deepspeed_config={ds_args}" ds_args=" --zero-stage={ds_args}" ds_args=" --deepspeed-activation-checkpointing ${ds_args}"

deepspeed pretrain_gpt.py --tensor-model-parallel-size $TP --pipeline-model-parallel-size $PP --num-layers $NLAYERS --hidden-size $HIDDEN --num-attention-heads 16 --seq-length 256 --loss-scale 12 --max-position-embeddings 1024 --micro-batch-size 4 --global-batch-size 1024 --train-iters 1000 --lr 6.0e-5 --min-lr 6.0e-6 --lr-decay-style cosine --log-interval 1 --eval-iters 40 --eval-interval 1000 --data-path $DATA_PATH --vocab-file $BASE_PATH/gpt2-vocab.json --merge-file $BASE_PATH/gpt2-merges.txt --save-interval 1000 --split 98,2,0 --clip-grad 1.0 --weight-decay 0.1 --adam-beta1 0.9 --adam-beta2 0.95 --init-method-std 0.006 --bf16 --checkpoint-activations --tensorboard-dir $OUTPUT_DIR $ds_args --exit-interval 5000 | tee ${OUTPUT_DIR}/output.log

tjruwase commented 2 years ago

@assij, try setting zero_stage to 0 instead of 1

assij commented 2 years ago

The code works with zero stage 0 , however I would like to use zero stage 1 in order to shard the optimizer states / calculations. When is pipeline + zero stage 1 + bfloat16 going to be supported?

stas00 commented 2 years ago

BF16Optimizer is ZeRO stage 1, but currently it's a bit of a hack and thus uses stage=0, it's just differently implemented so can't be used as normal stage-1 - this is because there is already bf16/stage1 which is a different beast which accumulates in bf16 which you don't want as it won't be very precise and thus the training won't be as smooth.

Of course, it'd be great to find an intuitive solution here. But do not worry and use stage=0 here for now.

tjruwase commented 1 year ago

@stas00, @assij we have added initial support covering the combinations in the table. We would appreciate help testing the combinations that matter to you.

ys950902 commented 1 year ago

Hi, May I have a question, using https://github.com/microsoft/DeepSpeed to run pipeline + zero1 + bfloat16 Deepspeed still doesn't work, it can work with pipeline + zero0 + bfloat16 and can also work with pipeline + zero1 + fp16, is it as expected for BF16Optimizer in deepspeed, or do I need to do some extra set for pipeline + zero1 + bfloat16, it would be grateful if you could give me some suggestions.

Desein-Yang commented 1 year ago

Hello, I am using pp + z0/1 + bf16 but it still doesn't work , with deepspeed=0.9.2 and bf setting, it suggested don't have attribute clear_lp_grad.

microsoft / DeepSpeed

Add explicit gradient_accumulation_dtype config #1835

OUTPUT_DIR=baseline_nl{HIDDEN}_g{MICRO_BATCH}

ds_args=" --no-pipeline-parallel ${ds_args}"