[BUG]: Fine-tune Colossal-LLaMA-2 error

🐛 Describe the bug

I run colossalai run --nproc_per_node 8 finetune.py \ --plugin "gemini_auto" \ --dataset "/home/pdl/xlz/ColossalAI/data" \ --model_path "/home/pdl/xlz/pretrain_weights/Colossal-LLaMA-2-7b-base" \ --task_name "qaAll_final.jsonl" \ --save_dir "./output" \ --flash_attention \ --max_length 2048 \ --batch_size 1 in ColossalAI/examples/language/llama2 there is a error

*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
/home/pdl/anaconda3/envs/xlz_2/lib/python3.10/site-packages/colossalai/kernel/cuda_native/mha/flash_attn_2.py:28: UserWarning: please install flash_attn from https://github.com/HazyResearch/flash-attention
  warnings.warn("please install flash_attn from https://github.com/HazyResearch/flash-attention")
/home/pdl/anaconda3/envs/xlz_2/lib/python3.10/site-packages/colossalai/kernel/cuda_native/mha/mem_eff_attn.py:15: UserWarning: please install xformers from https://github.com/facebookresearch/xformers
  warnings.warn("please install xformers from https://github.com/facebookresearch/xformers")
/home/pdl/anaconda3/envs/xlz_2/lib/python3.10/site-packages/colossalai/kernel/cuda_native/mha/flash_attn_2.py:28: UserWarning: please install flash_attn from https://github.com/HazyResearch/flash-attention
  warnings.warn("please install flash_attn from https://github.com/HazyResearch/flash-attention")
/home/pdl/anaconda3/envs/xlz_2/lib/python3.10/site-packages/colossalai/kernel/cuda_native/mha/mem_eff_attn.py:15: UserWarning: please install xformers from https://github.com/facebookresearch/xformers
  warnings.warn("please install xformers from https://github.com/facebookresearch/xformers")
/home/pdl/anaconda3/envs/xlz_2/lib/python3.10/site-packages/colossalai/kernel/cuda_native/mha/flash_attn_2.py:28: UserWarning: please install flash_attn from https://github.com/HazyResearch/flash-attention
  warnings.warn("please install flash_attn from https://github.com/HazyResearch/flash-attention")
/home/pdl/anaconda3/envs/xlz_2/lib/python3.10/site-packages/colossalai/kernel/cuda_native/mha/mem_eff_attn.py:15: UserWarning: please install xformers from https://github.com/facebookresearch/xformers
  warnings.warn("please install xformers from https://github.com/facebookresearch/xformers")
/home/pdl/anaconda3/envs/xlz_2/lib/python3.10/site-packages/colossalai/kernel/cuda_native/mha/flash_attn_2.py:28: UserWarning: please install flash_attn from https://github.com/HazyResearch/flash-attention
  warnings.warn("please install flash_attn from https://github.com/HazyResearch/flash-attention")
/home/pdl/anaconda3/envs/xlz_2/lib/python3.10/site-packages/colossalai/kernel/cuda_native/mha/mem_eff_attn.py:15: UserWarning: please install xformers from https://github.com/facebookresearch/xformers
  warnings.warn("please install xformers from https://github.com/facebookresearch/xformers")
/home/pdl/anaconda3/envs/xlz_2/lib/python3.10/site-packages/colossalai/kernel/cuda_native/mha/flash_attn_2.py:28: UserWarning: please install flash_attn from https://github.com/HazyResearch/flash-attention
  warnings.warn("please install flash_attn from https://github.com/HazyResearch/flash-attention")
/home/pdl/anaconda3/envs/xlz_2/lib/python3.10/site-packages/colossalai/kernel/cuda_native/mha/mem_eff_attn.py:15: UserWarning: please install xformers from https://github.com/facebookresearch/xformers
  warnings.warn("please install xformers from https://github.com/facebookresearch/xformers")
/home/pdl/anaconda3/envs/xlz_2/lib/python3.10/site-packages/colossalai/kernel/cuda_native/mha/flash_attn_2.py:28: UserWarning: please install flash_attn from https://github.com/HazyResearch/flash-attention
  warnings.warn("please install flash_attn from https://github.com/HazyResearch/flash-attention")
/home/pdl/anaconda3/envs/xlz_2/lib/python3.10/site-packages/colossalai/kernel/cuda_native/mha/mem_eff_attn.py:15: UserWarning: please install xformers from https://github.com/facebookresearch/xformers
  warnings.warn("please install xformers from https://github.com/facebookresearch/xformers")
/home/pdl/anaconda3/envs/xlz_2/lib/python3.10/site-packages/colossalai/kernel/cuda_native/mha/flash_attn_2.py:28: UserWarning: please install flash_attn from https://github.com/HazyResearch/flash-attention
  warnings.warn("please install flash_attn from https://github.com/HazyResearch/flash-attention")
/home/pdl/anaconda3/envs/xlz_2/lib/python3.10/site-packages/colossalai/kernel/cuda_native/mha/mem_eff_attn.py:15: UserWarning: please install xformers from https://github.com/facebookresearch/xformers
  warnings.warn("please install xformers from https://github.com/facebookresearch/xformers")
/home/pdl/anaconda3/envs/xlz_2/lib/python3.10/site-packages/colossalai/kernel/cuda_native/mha/flash_attn_2.py:28: UserWarning: please install flash_attn from https://github.com/HazyResearch/flash-attention
  warnings.warn("please install flash_attn from https://github.com/HazyResearch/flash-attention")
/home/pdl/anaconda3/envs/xlz_2/lib/python3.10/site-packages/colossalai/kernel/cuda_native/mha/mem_eff_attn.py:15: UserWarning: please install xformers from https://github.com/facebookresearch/xformers
  warnings.warn("please install xformers from https://github.com/facebookresearch/xformers")
/home/pdl/anaconda3/envs/xlz_2/lib/python3.10/site-packages/colossalai/initialize.py:48: UserWarning: `config` is deprecated and will be removed soon.
  warnings.warn("`config` is deprecated and will be removed soon.")
[10/31/23 17:01:16] INFO     colossalai - colossalai - INFO:                    
                             /home/pdl/anaconda3/envs/xlz_2/lib/python3.10/site-
                             packages/colossalai/initialize.py:63 launch        
                    INFO     colossalai - colossalai - INFO: Distributed        
                             environment is initialized, world size: 8          
[10/31/23 17:01:16] INFO     colossalai - colossalai - INFO:                    
                             /home/pdl/anaconda3/envs/xlz_2/lib/python3.10/site-
                             packages/colossalai/initialize.py:63 launch        
[10/31/23 17:01:16] INFO     colossalai - colossalai - INFO:                    
                             /home/pdl/anaconda3/envs/xlz_2/lib/python3.10/site-
                             packages/colossalai/initialize.py:63 launch        
                    INFO     colossalai - colossalai - INFO: Distributed        
                             environment is initialized, world size: 8          
[10/31/23 17:01:16] INFO     colossalai - colossalai - INFO:                    
                             /home/pdl/anaconda3/envs/xlz_2/lib/python3.10/site-
                             packages/colossalai/initialize.py:63 launch        
[10/31/23 17:01:16] INFO     colossalai - colossalai - INFO:                    
                             /home/pdl/anaconda3/envs/xlz_2/lib/python3.10/site-
                             packages/colossalai/initialize.py:63 launch        
[10/31/23 17:01:16] INFO     colossalai - colossalai - INFO:                    
                             /home/pdl/anaconda3/envs/xlz_2/lib/python3.10/site-
                             packages/colossalai/initialize.py:63 launch        
                    INFO     colossalai - colossalai - INFO: Distributed        
                             environment is initialized, world size: 8          
                    INFO     colossalai - colossalai - INFO: Distributed        
                             environment is initialized, world size: 8          
                    INFO     colossalai - colossalai - INFO: Distributed        
                             environment is initialized, world size: 8          
                    INFO     colossalai - colossalai - INFO: Distributed        
                             environment is initialized, world size: 8          
[10/31/23 17:01:16] INFO     colossalai - colossalai - INFO:                    
                             /home/pdl/anaconda3/envs/xlz_2/lib/python3.10/site-
                             packages/colossalai/initialize.py:63 launch        
                    INFO     colossalai - colossalai - INFO: Distributed        
                             environment is initialized, world size: 8          
[10/31/23 17:01:16] INFO     colossalai - colossalai - INFO:                    
                             /home/pdl/anaconda3/envs/xlz_2/lib/python3.10/site-
                             packages/colossalai/initialize.py:63 launch        
                    INFO     colossalai - colossalai - INFO: Distributed        
                             environment is initialized, world size: 8          
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=True`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=True`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=True`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=True`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=True`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Model params: 6.56 B
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=True`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=True`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=True`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Booster init max CUDA memory: 2828.09 MB
Booster init max CPU memory: 9382.59 MB
Epoch 0:   0%|          | 0/337 [00:00<?, ?it/s]finish_collection 582
finish_collection 582
finish_collection 582
finish_collection 582
finish_collection 582
finish_collection 582
finish_collection 582
finish_collection 582
Epoch 0:   4%|▎         | 12/337 [01:56<48:44,  9.00s/it, loss=105]WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 33042 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 33043 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 33044 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 33046 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 33047 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 33048 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 33049 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 3 (pid: 33045) of binary: /home/pdl/anaconda3/envs/xlz_2/bin/python
Traceback (most recent call last):
  File "/home/pdl/anaconda3/envs/xlz_2/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/pdl/anaconda3/envs/xlz_2/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/pdl/anaconda3/envs/xlz_2/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/home/pdl/anaconda3/envs/xlz_2/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/pdl/anaconda3/envs/xlz_2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/pdl/anaconda3/envs/xlz_2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
======================================================
finetune.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-10-31_17:07:10
  host      : node1280
  rank      : 3 (local_rank: 3)
  exitcode  : -9 (pid: 33045)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 33045
======================================================
Error: failed to run torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0 --master_addr=127.0.0.1 --master_port=29500 finetune.py --plugin gemini_auto --dataset /home/pdl/xlz/ColossalAI/data --model_path /home/pdl/xlz/pretrain_weights/Colossal-LLaMA-2-7b-base --task_name qaAll_final.jsonl --save_dir ./output --flash_attention --max_length 2048 --batch_size 1 on 127.0.0.1, is localhost: True, exception: Encountered a bad command exit code!

Command: 'cd /home/pdl/xlz/ColossalAI/examples/language/llama2 && export CONDA_SHLVL="3" LD_LIBRARY_PATH="/usr/local/cuda/lib64:/usr/local/cuda/lib64:" LS_COLORS="rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:" CONDA_EXE="/home/pdl/anaconda3/bin/conda" LC_MEASUREMENT="zh_CN.UTF-8" SSH_CONNECTION="10.0.9.254 51010 10.107.12.80 22" LESSCLOSE="/usr/bin/lesspipe %s %s" LC_PAPER="zh_CN.UTF-8" LC_MONETARY="zh_CN.UTF-8" _="/home/pdl/anaconda3/envs/xlz_2/bin/colossalai" LANG="zh_CN.UTF-8" OLDPWD="/home/pdl/xlz/ColossalAI" COLORTERM="truecolor" CONDA_PREFIX="/home/pdl/anaconda3/envs/xlz_2" LC_NAME="zh_CN.UTF-8" XDG_SESSION_ID="2571" USER="pdl" CONDA_PREFIX_1="/home/pdl/anaconda3" CONDA_PREFIX_2="/home/pdl/anaconda3/envs/xlz" PWD="/home/pdl/xlz/ColossalAI/examples/language/llama2" HOME="/home/pdl" CONDA_PYTHON_EXE="/home/pdl/anaconda3/bin/python" BROWSER="/home/pdl/.vscode-server/bin/74f6148eb9ea00507ec113ec51c489d6ffb4b771/bin/helpers/browser.sh" VSCODE_GIT_ASKPASS_NODE="/home/pdl/.vscode-server/bin/74f6148eb9ea00507ec113ec51c489d6ffb4b771/node" TERM_PROGRAM="vscode" SSH_CLIENT="10.0.9.254 51010 22" TERM_PROGRAM_VERSION="1.80.1" CUDA_HOME="/usr/local/cuda" XDG_DATA_DIRS="/usr/local/share:/usr/share:/var/lib/snapd/desktop" VSCODE_IPC_HOOK_CLI="/run/user/1000/vscode-ipc-4e7ee27d-33b5-4e1e-ae7f-a738c1fe9c5a.sock" LC_ADDRESS="zh_CN.UTF-8" LC_NUMERIC="zh_CN.UTF-8" CONDA_PROMPT_MODIFIER="(xlz_2) " MAIL="/var/mail/pdl" VSCODE_GIT_ASKPASS_MAIN="/home/pdl/.vscode-server/bin/74f6148eb9ea00507ec113ec51c489d6ffb4b771/extensions/git/dist/askpass-main.js" SHELL="/bin/bash" TERM="xterm-256color" SHLVL="5" LANGUAGE="zh_CN:zh" VSCODE_GIT_IPC_HANDLE="/run/user/1000/vscode-git-84ae8a0c87.sock" LC_TELEPHONE="zh_CN.UTF-8" LOGNAME="pdl" DBUS_SESSION_BUS_ADDRESS="unix:path=/run/user/1000/bus" GIT_ASKPASS="/home/pdl/.vscode-server/bin/74f6148eb9ea00507ec113ec51c489d6ffb4b771/extensions/git/dist/askpass.sh" XDG_RUNTIME_DIR="/run/user/1000" PATH="/usr/local/cuda/bin:/home/pdl/anaconda3/envs/xlz_2/bin:/home/pdl/.vscode-server/bin/74f6148eb9ea00507ec113ec51c489d6ffb4b771/bin/remote-cli:/usr/local/cuda/bin:/home/pdl/anaconda3/bin:/home/pdl/anaconda3/bin:/home/pdl/anaconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin" LC_IDENTIFICATION="zh_CN.UTF-8" CONDA_DEFAULT_ENV="xlz_2" LESSOPEN="| /usr/bin/lesspipe %s" LC_TIME="zh_CN.UTF-8" && torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0 --master_addr=127.0.0.1 --master_port=29500 finetune.py --plugin gemini_auto --dataset /home/pdl/xlz/ColossalAI/data --model_path /home/pdl/xlz/pretrain_weights/Colossal-LLaMA-2-7b-base --task_name qaAll_final.jsonl --save_dir ./output --flash_attention --max_length 2048 --batch_size 1'

Exit code: 1

Stdout: already printed

Stderr: already printed

====== Training on All Nodes =====
127.0.0.1: failure

====== Stopping All Nodes =====
127.0.0.1: finish```

### Environment

cuda_11.7
cuDNN 8904
nccl (2, 14, 3)
python 3.10
torch 1.13.1

run  bash gemini_auto.sh successed in ColossalAI/examples/language/llama2/scripts/benchmark_7B
hpcaitech / ColossalAI

[BUG]: Fine-tune Colossal-LLaMA-2 error #4994

🐛 Describe the bug