microsoft / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2
Other
1.83k stars 337 forks source link

Illegal Memory Access Error when running a llama2 pretraining script #202

Open imgaojun opened 1 year ago

imgaojun commented 1 year ago

While running a llama2 pretraining script with specific configurations, I encountered an illegal memory access error. The detailed error message is as follows:

[2023-08-09 07:56:18,503] [INFO] [engine.py:83:__init__] CONFIG: micro_batches=32 micro_batch_size=2                                             
Traceback (most recent call last):                                                                                                                
  File "/mnt/cephfs/nlp_group/gaojun/projects/CodeLLM/llm-sft/Megatron-DeepSpeed/pretrain_gpt.py", line 342, in <module>                         
    pretrain(train_valid_test_datasets_provider,                                                                                                 
  File "/mnt/cephfs/nlp_group/gaojun/projects/CodeLLM/llm-sft/Megatron-DeepSpeed/megatron/training.py", line 135, in pretrain                    
    model, optimizer, opt_param_scheduler = setup_model_and_optimizer(                            
  File "/mnt/cephfs/nlp_group/gaojun/projects/CodeLLM/llm-sft/Megatron-DeepSpeed/megatron/training.py", line 579, in setup_model_and_optimizer
    model, optimizer, _, opt_param_scheduler = deepspeed.initialize(                          
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/__init__.py", line 186, in initialize                                                   
    engine = PipelineEngine(args=args,                                                                                                           
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/pipe/engine.py", line 132, in __init__
    params_tensor = torch.LongTensor(data=[num_params, unique_params]).to(self.device)                                                           
RuntimeError: CUDA error: an illegal memory access was encountered                                                                                
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.                           
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.                                                                                            
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.                                                                               

terminate called after throwing an instance of 'c10::Error'                                                                                       
  what():  CUDA error: an illegal memory access was encountered                                                                                   
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.                           
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.                                                                                            
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Configuration Details

export CUDA_VISIBLE_DEVICES=2,3,6,7
export CUDA_DEVICE_MAX_CONNECTIONS=1
export NCCL_IB_DISABLE=1

TP=2
PP=2
ZERO_STAGE=1

GPUS_PER_NODE=4
NNODES=1
JOB_ID=8899
ENDPOINT=localhost:6000
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))

######################################
# Change the below configurations here
HIDDEN_SIZE=5120 # e.g. llama-13b: 5120
FFN_HIDDEN_SIZE=13824 # e.g. llama-13b: 13824
NUM_LAYERS=40 # e.g. llama-13b: 40
NUM_HEADS=40 # e.g. llama-13b: 40
SEQ_LENGTH=4096
NUM_KV_HEADS=40 # llama2 70B uses GQA

TRAIN_ITERS=250000
SAVE_INTERVAL=1000
EVAL_INTERVAL=${SAVE_INTERVAL}
LOG_INTERVAL=1

MICRO_BATCH_SIZE=2
GLOBAL_BATCH_SIZE=64 # e.g. llama: 4M tokens
LR=2e-5
MIN_LR=2e-6
LR_WARMUP_STEPS=2000
WEIGHT_DECAY=0.0
GRAD_CLIP=1
DTYPE="bf16"

cat <<EOT > $DS_CONFIG
{
  "train_batch_size" : $GLOBAL_BATCH_SIZE,
  "train_micro_batch_size_per_gpu": $MICRO_BATCH_SIZE,
  "steps_per_print": 1,
  "zero_optimization": {
    "stage": $ZERO_STAGE
  },
  "bf16": {
    "enabled": true
  }
}
EOT

DEEPSPEED_ARGS="
    --deepspeed \
    --deepspeed_config $DS_CONFIG \
    --zero-stage $ZERO_STAGE \
    --deepspeed-activation-checkpointing
"

DISTRIBUTED_ARGS="
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --rdzv-id $JOB_ID \
    --rdzv-backend c10d \
    --rdzv-endpoint ${ENDPOINT}
"

GPT_ARGS="
    --tensor-model-parallel-size $TP \
    --pipeline-model-parallel-size $PP \
    --num-layers $NUM_LAYERS \
    --hidden-size $HIDDEN_SIZE \
    --ffn-hidden-size $FFN_HIDDEN_SIZE \
    --num-attention-heads $NUM_HEADS \
    --micro-batch-size $MICRO_BATCH_SIZE \
    --global-batch-size $GLOBAL_BATCH_SIZE \
    --seq-length $SEQ_LENGTH \
    --max-position-embeddings $SEQ_LENGTH \
    --train-iters $TRAIN_ITERS \
    --lr $LR \
    --lr-decay-style cosine \
    --min-lr $MIN_LR \
    --weight-decay $WEIGHT_DECAY \
    --clip-grad $GRAD_CLIP \
    --lr-warmup-iters $LR_WARMUP_STEPS \
    --optimizer adam \
    --adam-beta1 0.9 \
    --adam-beta2 0.95 \
    --${DTYPE} \
    --no-query-key-layer-scaling \
    --attention-dropout 0 \
    --hidden-dropout 0 \
    --use-rotary-position-embeddings \
    --untie-embeddings-and-output-weights \
    --swiglu \
    --normalization rmsnorm \
    --disable-bias-linear \
    --num-key-value-heads $NUM_KV_HEADS
"

DATA_ARGS="
    --data-path ${DATASET} \
    --tokenizer-type $TOKENIZER_TYPE \
    --tokenizer-model $TOKENIZER_PATH \
    --data-impl mmap \
    --split 100,0,0 \
"

OUTPUT_ARGS="
    --log-interval $LOG_INTERVAL \
    --save-interval $SAVE_INTERVAL \
    --eval-interval $EVAL_INTERVAL \
    --log-batch-size-to-tensorboard \
    --log-validation-ppl-to-tensorboard \
    --log-memory-to-tensorboard \
    --log-world-size-to-tensorboard \
    --tensorboard-dir $LOG_DIR
"

torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
    $DEEPSPEED_ARGS \
    $GPT_ARGS \
    $DATA_ARGS \
    $OUTPUT_ARGS \
    --distributed-backend nccl \
    --save $SAVE_PATH \
    --load $LOAD_PATH
KenwayZZZ commented 1 year ago

Same error. I run into this problem when I set TP=1 and bf16. By the way, do you encounter assert all_groups_norm > 0 when using bf16?

liguodongiot commented 1 year ago

I meet the same problem, too.

Godricly commented 1 year ago

A quick but ugly solution is to comment out apex version adam and use nightly build torch version adam with fused option as true in this file https://github.com/microsoft/Megatron-DeepSpeed/blob/main/megatron/optimizer/__init__.py.

butsugiri commented 1 year ago

I tried this workaround https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/249

heroes999 commented 1 year ago

Same error. I run into this problem when I set TP=1 and bf16. By the way, do you encounter assert all_groups_norm > 0 when using bf16?

Yes, all_group_norm assertion when training Llama 7B @KenwayZZZ

au-revoir commented 10 months ago

Same error. I run into this problem when I set TP=1 and bf16. By the way, do you encounter assert all_groups_norm > 0 when using bf16?

Yes, all_group_norm assertion when training Llama 7B @KenwayZZZ

Have you found a workaround for this? I get the same error when using GPT-J-6B

Godricly commented 10 months ago

I'd suggest you to replace Adam implementation using pytorch one with 64bit int indexing.

heroes999 commented 10 months ago

Same error. I run into this problem when I set TP=1 and bf16. By the way, do you encounter assert all_groups_norm > 0 when using bf16?

Yes, all_group_norm assertion when training Llama 7B @KenwayZZZ

Have you found a workaround for this? I get the same error when using GPT-J-6B

not yet, I've tried torch.AdamW(fused=True), but without luck. "all_group_norm assertion" occurs after dozens of steps @au-revoir @Godricly

Godricly commented 9 months ago

Which version of torch were you use? I tried with nightly build 2.2.0 version before and it worked.