microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
34.62k stars 4.04k forks source link

[BUG] Grad_norm is nan and Loss is 0 #5347

Open xxtars opened 5 months ago

xxtars commented 5 months ago

Describe the bug when train llama-vid (stage2, full-finetuning LLaMA) using deepspeed==0.14.0, and transformers trainer, grad_norm will be nan (or 1.414, with smaller lr, pink line) and loss will be 0 after few steps. Same issue as described in #5242, but with AMD GPU. I follow #5242, deepspeed==0.12.3 can work normally. However, both versions of ds do not significantly speed up training when using multiple nodes.

grad_norm, circle means NAN: image Step Stage2_1node_ds0.14.0
37 1.6166185140609741
38 1.5178347826004028
39 NaN
40 1.434411883354187
41 NaN
42 NaN
Step Stage2_4node_ds0.14.0
0 23068.85156
1 24.89443588256836
2 22.727699279785156
3 23.45322036743164
4 NaN
5 NaN
Step Stage2_1node_ds0.14.0_smaller_lr
113 1.500418203016511
114 1.0031307797956182
115 1.4142135623730951
116 1.4142135623730951
117 1.4142135623730951
118 1.4142135623730951

loss: image

training speed:

# one node, ds 0.14.0
[default0]:  0%|          | 11/5964 [05:15<46:24:59, 28.07s/it]
[default0]:  0%|          | 12/5964 [05:40<45:14:48, 27.37s/it]
[default0]:  0%|          | 13/5964 [06:05<43:52:20, 26.54s/it]
[default0]:  0%|          | 14/5964 [06:50<53:08:24, 32.15s/it]
[default0]:  0%|          | 15/5964 [07:16<50:11:33, 30.37s/it]
# four nodes, ds 0.14.0
[default0]:  0%|          | 11/5964 [04:11<36:46:44, 22.24s/it]
[default0]:  0%|          | 12/5964 [04:32<36:07:52, 21.85s/it]
[default0]:  0%|          | 13/5964 [04:49<33:38:57, 20.36s/it]
[default0]:  0%|          | 14/5964 [05:20<38:58:16, 23.58s/it]
[default0]:  0%|          | 15/5964 [05:44<39:29:56, 23.90s/it]

(stage1, training Connector, all work normally, inculding speed up training when using multiple nodes and ds version)

training speed:

# one node, ds 0.14.0
[default0]:  0%|          | 11/3086 [01:43<6:47:31,  7.95s/it]
[default0]:  0%|          | 12/3086 [01:51<6:58:28,  8.17s/it]
[default0]:  0%|          | 13/3086 [02:00<7:02:19,  8.25s/it]
[default0]:  0%|          | 14/3086 [02:06<6:33:13,  7.68s/it]
[default0]:  0%|          | 15/3086 [02:14<6:38:44,  7.79s/it]
# four nodes, ds 0.14.0
[default0]:  0%|          | 11/3086 [00:36<2:21:11,  2.76s/it]
[default0]:  0%|          | 12/3086 [00:39<2:27:00,  2.87s/it]
[default0]:  0%|          | 13/3086 [00:41<2:24:27,  2.82s/it]
[default0]:  0%|          | 14/3086 [00:43<2:12:09,  2.58s/it]
[default0]:  0%|          | 15/3086 [00:46<2:11:55,  2.58s/it]

I'm not sure if the training speed is related to issue #5242, but I think it's abnormal because with A100 and multiple nodes, I can achieve a significant speed improvement.

To Reproduce Steps to reproduce the behavior:

  1. my run script, luanch with slurm srun bash scripts/video/train/stage_2_full_v7b_224_fps_1_torchrun.sh:
    
    #!/bin/bash -e
    export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
    export MASTER_PORT=12345

export GPUS_PER_NODE=$SLURM_GPUS_ON_NODE export MIOPEN_DEBUG_DISABLE_SQL_WAL=1 export MIOPEN_USER_DB_PATH="~/.cache/$(whoami)-miopen-cache-$SLURM_NODEID" export MIOPEN_CUSTOM_CACHE_DIR=$MIOPEN_USER_DB_PATH

Set MIOpen cache to a temporary folder.

if [ $SLURM_LOCALID -eq 0 ] ; then rm -rf $MIOPEN_USER_DB_PATH mkdir -p $MIOPEN_USER_DB_PATH fi sleep 2

export MPICH_GPU_SUPPORT_ENABLED=1

Set interfaces to be used by RCCL.

export NCCL_SOCKET_IFNAME=hsn0,hsn1,hsn2,hsn3 export NCCL_NET_GDR_LEVEL=3

export LAUNCHER="python -m torch.distributed.run \ --nproc_per_node $GPUS_PER_NODE \ --nnodes $SLURM_NNODES \ --node_rank $SLURM_PROCID \ --rdzv_endpoint $MASTER_ADDR:$MASTER_PORT \ --rdzv_backend c10d \ --max_restarts 0 \ --tee 3 \ "

export Stage2_CMD=" \ llamavid/train/train_mem.py \ --deepspeed ./scripts/zero2.json \ --model_name_or_path model_zoo/LLM/llama2/Llama-2-7b-chat-hf \ --version imgsp_llama_2 \ --data_path ./data/LLaMA-VID-Finetune/llava_v1_5_mix665k_with_video_chatgpt.json \ --image_folder ./data/LLaMA-VID-Finetune \ --video_folder ./data/LLaMA-VID-Finetune \ --vision_tower model_zoo/LAVIS/eva_vit_g.pth \ --image_processor ./llamavid/processor/clip-patch14-224 \ --pretrain_mm_mlp_adapter ./work_dirs/llama2-vid-7b-pretrain-224-video-fps-1/mm_projector.bin \ --mm_projector_type mlp2x_gelu \ --mm_vision_select_layer -2 \ --mm_use_im_start_end False \ --mm_use_im_patch_token False \ --image_aspect_ratio pad \ --group_by_modality_length True \ --video_fps 1 \ --bert_type "qformer_pretrain" \ --num_query 32 \ --compress_type "mean" \ --bf16 True \ --output_dir ./work_dirs/llama2-vid-7b-full-224-video-fps-1-torchrun \ --num_train_epochs 1 \ --per_device_train_batch_size 2 \ --per_device_eval_batch_size 2 \ --gradient_accumulation_steps 8 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 1000 \ --save_total_limit 1 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --model_max_length 2048 \ --gradient_checkpointing True \ --dataloader_num_workers 4 \ --lazy_preprocess True \ --report_to wandb \ --run_name LUMI_Stage2_LLaMA "

8 x 4 x 4 = 128

8 x 2 x 8 = 128

32 x 2 x 2 = 128

bash -c "$LAUNCHER $Stage2_CMD"


**Expected behavior**
Grad_norm != nan and loss != 0

**ds_report output**

ds 0.14.0

[2024-04-02 08:21:58,307] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

[WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-devel package with yum [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] fused_adam ............. [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] cpu_adagrad ............ [NO] ....... [OKAY] cpu_lion ............... [NO] ....... [OKAY] [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH evoformer_attn ......... [NO] ....... [NO] fused_lamb ............. [NO] ....... [OKAY] fused_lion ............. [NO] ....... [OKAY] inference_core_ops ..... [NO] ....... [OKAY] cutlass_ops ............ [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] ragged_device_ops ...... [NO] ....... [OKAY] ragged_ops ............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] sparse_attn is not compatible with ROCM sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY]

DeepSpeed general environment info: torch install path ............... ['/xxx/miniconda3/envs/llamavid_rocm5.6/lib/python3.10/site-packages/torch'] torch version .................... 2.2.2+rocm5.6 deepspeed install path ........... ['/xxx/miniconda3/envs/llamavid_rocm5.6/lib/python3.10/site-packages/deepspeed'] deepspeed info ................... 0.14.0, unknown, unknown torch cuda version ............... None torch hip version ................ 5.6.31061-8c743ae5d nvcc version ..................... None deepspeed wheel compiled w. ...... torch 2.2, hip 5.6 shared memory (/dev/shm) size .... 427.71 GB



**System info (please complete the following information):**
 - OS: SUSE Linux Enterprise Server 15 SP4
 - GPU count and types: one or four machines with x8 MI250X each
 - Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
 - Python version: 3.10.14
 - transformers==4.39.2

**Launcher context**
slurm -> torch.distributed.run

**Additional context**
**following #5242, use ds 0.12.3**, One Node:
**grad_norm:**
![W B Chart 2024_4_2 14_16_49](https://github.com/microsoft/DeepSpeed/assets/49916044/127ba495-f3b1-4572-8ebe-a3577e56460d)
**loss:**
![W B Chart 2024_4_2 14_16_57](https://github.com/microsoft/DeepSpeed/assets/49916044/67a29fb0-9e19-4810-bb35-16baa9625ce6)
efsotr commented 3 months ago

Setting overlap_comm to False can avoid this problem.

GuanhuaWang commented 3 months ago

Hi @xxtars , we noticed this accuracy issue in 14.0 (some of our user also falled back to 12.3) and did several fixes on accuracy later on. Could you try 0.14.2? thx

weimakeit commented 2 months ago

Setting overlap_comm to False can avoid this problem.

This works in my multi-node training scenario.

jianshuod commented 23 hours ago

I have checked the problem in 0.15.0 and the problem still exists. Another workaround is to increase the bucket size. For example, an increase from 5e7 to 2e8 can help tackle the problem.