xxtars commented 5 months ago

Describe the bug when train llama-vid (stage2, full-finetuning LLaMA) using deepspeed==0.14.0, and transformers trainer, grad_norm will be nan (or 1.414, with smaller lr, pink line) and loss will be 0 after few steps. Same issue as described in #5242, but with AMD GPU. I follow #5242, deepspeed==0.12.3 can work normally. However, both versions of ds do not significantly speed up training when using multiple nodes.

grad_norm, circle means NAN:	Step	Stage2_1node_ds0.14.0
37	1.6166185140609741
38	1.5178347826004028
39	NaN
40	1.434411883354187
41	NaN
42	NaN

Step	Stage2_4node_ds0.14.0
0	23068.85156
1	24.89443588256836
2	22.727699279785156
3	23.45322036743164
4	NaN
5	NaN

Step	Stage2_1node_ds0.14.0_smaller_lr
113	1.500418203016511
114	1.0031307797956182
115	1.4142135623730951
116	1.4142135623730951
117	1.4142135623730951
118	1.4142135623730951

loss:

training speed:

# one node, ds 0.14.0
[default0]:  0%|          | 11/5964 [05:15<46:24:59, 28.07s/it]
[default0]:  0%|          | 12/5964 [05:40<45:14:48, 27.37s/it]
[default0]:  0%|          | 13/5964 [06:05<43:52:20, 26.54s/it]
[default0]:  0%|          | 14/5964 [06:50<53:08:24, 32.15s/it]
[default0]:  0%|          | 15/5964 [07:16<50:11:33, 30.37s/it]

# four nodes, ds 0.14.0
[default0]:  0%|          | 11/5964 [04:11<36:46:44, 22.24s/it]
[default0]:  0%|          | 12/5964 [04:32<36:07:52, 21.85s/it]
[default0]:  0%|          | 13/5964 [04:49<33:38:57, 20.36s/it]
[default0]:  0%|          | 14/5964 [05:20<38:58:16, 23.58s/it]
[default0]:  0%|          | 15/5964 [05:44<39:29:56, 23.90s/it]

(stage1, training Connector, all work normally, inculding speed up training when using multiple nodes and ds version)

training speed:

# one node, ds 0.14.0
[default0]:  0%|          | 11/3086 [01:43<6:47:31,  7.95s/it]
[default0]:  0%|          | 12/3086 [01:51<6:58:28,  8.17s/it]
[default0]:  0%|          | 13/3086 [02:00<7:02:19,  8.25s/it]
[default0]:  0%|          | 14/3086 [02:06<6:33:13,  7.68s/it]
[default0]:  0%|          | 15/3086 [02:14<6:38:44,  7.79s/it]

# four nodes, ds 0.14.0
[default0]:  0%|          | 11/3086 [00:36<2:21:11,  2.76s/it]
[default0]:  0%|          | 12/3086 [00:39<2:27:00,  2.87s/it]
[default0]:  0%|          | 13/3086 [00:41<2:24:27,  2.82s/it]
[default0]:  0%|          | 14/3086 [00:43<2:12:09,  2.58s/it]
[default0]:  0%|          | 15/3086 [00:46<2:11:55,  2.58s/it]

I'm not sure if the training speed is related to issue #5242, but I think it's abnormal because with A100 and multiple nodes, I can achieve a significant speed improvement.

To Reproduce Steps to reproduce the behavior:

my run script, luanch with slurm srun bash scripts/video/train/stage_2_full_v7b_224_fps_1_torchrun.sh:


#!/bin/bash -e
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=12345

export GPUS_PER_NODE=$SLURM_GPUS_ON_NODE export MIOPEN_DEBUG_DISABLE_SQL_WAL=1 export MIOPEN_USER_DB_PATH="~/.cache/$(whoami)-miopen-cache-$SLURM_NODEID" export MIOPEN_CUSTOM_CACHE_DIR=$MIOPEN_USER_DB_PATH

Set MIOpen cache to a temporary folder.

if [ $SLURM_LOCALID -eq 0 ] ; then rm -rf $MIOPEN_USER_DB_PATH mkdir -p $MIOPEN_USER_DB_PATH fi sleep 2

export MPICH_GPU_SUPPORT_ENABLED=1

Set interfaces to be used by RCCL.

export NCCL_SOCKET_IFNAME=hsn0,hsn1,hsn2,hsn3 export NCCL_NET_GDR_LEVEL=3

export LAUNCHER="python -m torch.distributed.run \ --nproc_per_node $GPUS_PER_NODE \ --nnodes $SLURM_NNODES \ --node_rank $SLURM_PROCID \ --rdzv_endpoint $MASTER_ADDR:$MASTER_PORT \ --rdzv_backend c10d \ --max_restarts 0 \ --tee 3 \ "

export Stage2_CMD=" \ llamavid/train/train_mem.py \ --deepspeed ./scripts/zero2.json \ --model_name_or_path model_zoo/LLM/llama2/Llama-2-7b-chat-hf \ --version imgsp_llama_2 \ --data_path ./data/LLaMA-VID-Finetune/llava_v1_5_mix665k_with_video_chatgpt.json \ --image_folder ./data/LLaMA-VID-Finetune \ --video_folder ./data/LLaMA-VID-Finetune \ --vision_tower model_zoo/LAVIS/eva_vit_g.pth \ --image_processor ./llamavid/processor/clip-patch14-224 \ --pretrain_mm_mlp_adapter ./work_dirs/llama2-vid-7b-pretrain-224-video-fps-1/mm_projector.bin \ --mm_projector_type mlp2x_gelu \ --mm_vision_select_layer -2 \ --mm_use_im_start_end False \ --mm_use_im_patch_token False \ --image_aspect_ratio pad \ --group_by_modality_length True \ --video_fps 1 \ --bert_type "qformer_pretrain" \ --num_query 32 \ --compress_type "mean" \ --bf16 True \ --output_dir ./work_dirs/llama2-vid-7b-full-224-video-fps-1-torchrun \ --num_train_epochs 1 \ --per_device_train_batch_size 2 \ --per_device_eval_batch_size 2 \ --gradient_accumulation_steps 8 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 1000 \ --save_total_limit 1 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --model_max_length 2048 \ --gradient_checkpointing True \ --dataloader_num_workers 4 \ --lazy_preprocess True \ --report_to wandb \ --run_name LUMI_Stage2_LLaMA "

8 x 4 x 4 = 128

8 x 2 x 8 = 128

32 x 2 x 2 = 128

bash -c "$LAUNCHER $Stage2_CMD"


**Expected behavior**
Grad_norm != nan and loss != 0

**ds_report output**

ds 0.14.0

[2024-04-02 08:21:58,307] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

[WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-devel package with yum [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] fused_adam ............. [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] cpu_adagrad ............ [NO] ....... [OKAY] cpu_lion ............... [NO] ....... [OKAY] [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH evoformer_attn ......... [NO] ....... [NO] fused_lamb ............. [NO] ....... [OKAY] fused_lion ............. [NO] ....... [OKAY] inference_core_ops ..... [NO] ....... [OKAY] cutlass_ops ............ [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] ragged_device_ops ...... [NO] ....... [OKAY] ragged_ops ............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] sparse_attn is not compatible with ROCM sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY]

DeepSpeed general environment info: torch install path ............... ['/xxx/miniconda3/envs/llamavid_rocm5.6/lib/python3.10/site-packages/torch'] torch version .................... 2.2.2+rocm5.6 deepspeed install path ........... ['/xxx/miniconda3/envs/llamavid_rocm5.6/lib/python3.10/site-packages/deepspeed'] deepspeed info ................... 0.14.0, unknown, unknown torch cuda version ............... None torch hip version ................ 5.6.31061-8c743ae5d nvcc version ..................... None deepspeed wheel compiled w. ...... torch 2.2, hip 5.6 shared memory (/dev/shm) size .... 427.71 GB



**System info (please complete the following information):**
 - OS: SUSE Linux Enterprise Server 15 SP4
 - GPU count and types: one or four machines with x8 MI250X each
 - Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
 - Python version: 3.10.14
 - transformers==4.39.2

**Launcher context**
slurm -> torch.distributed.run

**Additional context**
**following #5242, use ds 0.12.3**, One Node:
**grad_norm:**
![W B Chart 2024_4_2 14_16_49](https://github.com/microsoft/DeepSpeed/assets/49916044/127ba495-f3b1-4572-8ebe-a3577e56460d)
**loss:**
![W B Chart 2024_4_2 14_16_57](https://github.com/microsoft/DeepSpeed/assets/49916044/67a29fb0-9e19-4810-bb35-16baa9625ce6)

efsotr commented 3 months ago

Setting overlap_comm to False can avoid this problem.

GuanhuaWang commented 3 months ago

Hi @xxtars , we noticed this accuracy issue in 14.0 (some of our user also falled back to 12.3) and did several fixes on accuracy later on. Could you try 0.14.2? thx

weimakeit commented 2 months ago