microsoft / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2
Other
1.86k stars 343 forks source link

GPT-2 with pipeline parallel and bfloat16 doesn't work #58

Open assij opened 2 years ago

assij commented 2 years ago

Hi, When using the script in examples/run_deepspeed_example.sh with Zero1 and bfloat16 ( the script works with fp16) I get the following error: File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 768, in _exec_backward_pass self.optimizer.clear_lp_grads() AttributeError: 'DeepSpeedZeroOptimizer' object has no attribute 'clear_lp_grads'

The run_deepspeed_example.sh is attached !/bin/bash set -ex

BASE_PATH=/vc_data/Megatron-LM/data DATA_PATH=${BASE_PATH}/indexed_datasets/megatron DS_CONFIG=ds_config.json

TP=2 PP=2 NLAYERS=24 HIDDEN=512

GLOBAL_BATCH=64 MICRO_BATCH=4

ZERO_STAGE=1

OUTPUT_DIR=ds_z${ZERO_STAGE}_nl${NLAYERS}_hs${HIDDEN}_gb${GLOBAL_BATCH}_mb${MICRO_BATCH}

OUTPUT_DIR=baseline_nl${NLAYERS}_hs${HIDDEN}_gb${GLOBAL_BATCH}_mb${MICRO_BATCH}

mkdir -p $OUTPUT_DIR

cat < $DS_CONFIG { "train_batch_size" : $GLOBAL_BATCH, "train_micro_batch_size_per_gpu": $MICRO_BATCH, "steps_per_print": 1,

"zero_optimization": { "stage": $ZERO_STAGE },

"bf16": {"enabled": true},

"wall_clock_breakdown" : true } EOT

export NCCL_DEBUG=warn

ds_args="" ds_args=" --deepspeed ${ds_args}"

ds_args=" --no-pipeline-parallel ${ds_args}"

ds_args=" --deepspeed_config=$DS_CONFIG ${ds_args}" ds_args=" --zero-stage=$ZERO_STAGE ${ds_args}" ds_args=" --deepspeed-activation-checkpointing ${ds_args}"

deepspeed pretrain_gpt.py \ --tensor-model-parallel-size $TP \ --pipeline-model-parallel-size $PP \ --num-layers $NLAYERS \ --hidden-size $HIDDEN \ --num-attention-heads 16 \ --seq-length 256 \ --loss-scale 12 \ --max-position-embeddings 1024 \ --micro-batch-size 4 \ --global-batch-size 1024 \ --train-iters 1000 \ --lr 6.0e-5 \ --min-lr 6.0e-6 \ --lr-decay-style cosine \ --log-interval 1 \ --eval-iters 40 \ --eval-interval 1000 \ --data-path $DATA_PATH \ --vocab-file $BASE_PATH/gpt2-vocab.json \ --merge-file $BASE_PATH/gpt2-merges.txt \ --save-interval 1000 \ --split 98,2,0 \ --clip-grad 1.0 \ --weight-decay 0.1 \ --adam-beta1 0.9 \ --adam-beta2 0.95 \ --init-method-std 0.006 \ --bf16 \ --checkpoint-activations \ --tensorboard-dir $OUTPUT_DIR \ $ds_args \ --exit-interval 5000 | tee ${OUTPUT_DIR}/output.log

bestbzw commented 1 year ago

I also meet this problem, have you solved it?

SingL3 commented 1 year ago

Same problem here, is there any solution?

SingL3 commented 1 year ago

I found a solution here: https://github.com/microsoft/DeepSpeed/issues/3693 Hope this helps!

SefaZeng commented 10 months ago

I found a solution here: microsoft/DeepSpeed#3693 Hope this helps!

It seems "data_types": { "grad_accum_dtype": "fp32" } will consume much more GPU memory. It raises a CUDA OOM error to me.