Open xxtars opened 5 months ago
Setting overlap_comm to False can avoid this problem.
Hi @xxtars , we noticed this accuracy issue in 14.0 (some of our user also falled back to 12.3) and did several fixes on accuracy later on. Could you try 0.14.2? thx
Setting overlap_comm to False can avoid this problem.
This works in my multi-node training scenario.
I have checked the problem in 0.15.0 and the problem still exists. Another workaround is to increase the bucket size. For example, an increase from 5e7 to 2e8 can help tackle the problem.
Describe the bug when train llama-vid (stage2, full-finetuning LLaMA) using deepspeed==0.14.0, and transformers trainer, grad_norm will be nan (or 1.414, with smaller lr, pink line) and loss will be 0 after few steps. Same issue as described in #5242, but with AMD GPU. I follow #5242, deepspeed==0.12.3 can work normally. However, both versions of ds do not significantly speed up training when using multiple nodes.
loss:
training speed:
(stage1, training Connector, all work normally, inculding speed up training when using multiple nodes and ds version)
training speed:
I'm not sure if the training speed is related to issue #5242, but I think it's abnormal because with A100 and multiple nodes, I can achieve a significant speed improvement.
To Reproduce Steps to reproduce the behavior:
srun bash scripts/video/train/stage_2_full_v7b_224_fps_1_torchrun.sh
:export GPUS_PER_NODE=$SLURM_GPUS_ON_NODE export MIOPEN_DEBUG_DISABLE_SQL_WAL=1 export MIOPEN_USER_DB_PATH="~/.cache/$(whoami)-miopen-cache-$SLURM_NODEID" export MIOPEN_CUSTOM_CACHE_DIR=$MIOPEN_USER_DB_PATH
Set MIOpen cache to a temporary folder.
if [ $SLURM_LOCALID -eq 0 ] ; then rm -rf $MIOPEN_USER_DB_PATH mkdir -p $MIOPEN_USER_DB_PATH fi sleep 2
export MPICH_GPU_SUPPORT_ENABLED=1
Set interfaces to be used by RCCL.
export NCCL_SOCKET_IFNAME=hsn0,hsn1,hsn2,hsn3 export NCCL_NET_GDR_LEVEL=3
export LAUNCHER="python -m torch.distributed.run \ --nproc_per_node $GPUS_PER_NODE \ --nnodes $SLURM_NNODES \ --node_rank $SLURM_PROCID \ --rdzv_endpoint $MASTER_ADDR:$MASTER_PORT \ --rdzv_backend c10d \ --max_restarts 0 \ --tee 3 \ "
export Stage2_CMD=" \ llamavid/train/train_mem.py \ --deepspeed ./scripts/zero2.json \ --model_name_or_path model_zoo/LLM/llama2/Llama-2-7b-chat-hf \ --version imgsp_llama_2 \ --data_path ./data/LLaMA-VID-Finetune/llava_v1_5_mix665k_with_video_chatgpt.json \ --image_folder ./data/LLaMA-VID-Finetune \ --video_folder ./data/LLaMA-VID-Finetune \ --vision_tower model_zoo/LAVIS/eva_vit_g.pth \ --image_processor ./llamavid/processor/clip-patch14-224 \ --pretrain_mm_mlp_adapter ./work_dirs/llama2-vid-7b-pretrain-224-video-fps-1/mm_projector.bin \ --mm_projector_type mlp2x_gelu \ --mm_vision_select_layer -2 \ --mm_use_im_start_end False \ --mm_use_im_patch_token False \ --image_aspect_ratio pad \ --group_by_modality_length True \ --video_fps 1 \ --bert_type "qformer_pretrain" \ --num_query 32 \ --compress_type "mean" \ --bf16 True \ --output_dir ./work_dirs/llama2-vid-7b-full-224-video-fps-1-torchrun \ --num_train_epochs 1 \ --per_device_train_batch_size 2 \ --per_device_eval_batch_size 2 \ --gradient_accumulation_steps 8 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 1000 \ --save_total_limit 1 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --model_max_length 2048 \ --gradient_checkpointing True \ --dataloader_num_workers 4 \ --lazy_preprocess True \ --report_to wandb \ --run_name LUMI_Stage2_LLaMA "
8 x 4 x 4 = 128
8 x 2 x 8 = 128
32 x 2 x 2 = 128
bash -c "$LAUNCHER $Stage2_CMD"
ds 0.14.0
[2024-04-02 08:21:58,307] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
DeepSpeed C++/CUDA extension op report
NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.
JIT compiled ops requires ninja ninja .................. [OKAY]
op name ................ installed .. compatible
[WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-devel package with yum [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] fused_adam ............. [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] cpu_adagrad ............ [NO] ....... [OKAY] cpu_lion ............... [NO] ....... [OKAY] [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH evoformer_attn ......... [NO] ....... [NO] fused_lamb ............. [NO] ....... [OKAY] fused_lion ............. [NO] ....... [OKAY] inference_core_ops ..... [NO] ....... [OKAY] cutlass_ops ............ [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] ragged_device_ops ...... [NO] ....... [OKAY] ragged_ops ............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] sparse_attn is not compatible with ROCM sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY]
DeepSpeed general environment info: torch install path ............... ['/xxx/miniconda3/envs/llamavid_rocm5.6/lib/python3.10/site-packages/torch'] torch version .................... 2.2.2+rocm5.6 deepspeed install path ........... ['/xxx/miniconda3/envs/llamavid_rocm5.6/lib/python3.10/site-packages/deepspeed'] deepspeed info ................... 0.14.0, unknown, unknown torch cuda version ............... None torch hip version ................ 5.6.31061-8c743ae5d nvcc version ..................... None deepspeed wheel compiled w. ...... torch 2.2, hip 5.6 shared memory (/dev/shm) size .... 427.71 GB