tatsu-lab / stanford_alpaca

Code and documentation to train Stanford's Alpaca models, and generate the data.
https://crfm.stanford.edu/2023/03/13/alpaca.html
Apache License 2.0
29.41k stars 4.04k forks source link

error of multi-GPU: torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 #162

Open xiaoweiweixiao opened 1 year ago

xiaoweiweixiao commented 1 year ago

When I use four GPU to train the model, I meet this error, can anybody help me slove this error? Thank you very much.

WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
/home/la/anaconda3/envs/alpaca_torch/lib/python3.10/site-packages/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead 
  warnings.warn(
/home/la/anaconda3/envs/alpaca_torch/lib/python3.10/site-packages/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead 
  warnings.warn(
/home/la/anaconda3/envs/alpaca_torch/lib/python3.10/site-packages/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead 
  warnings.warn(
/home/la/anaconda3/envs/alpaca_torch/lib/python3.10/site-packages/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead 
  warnings.warn(
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 77807 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 77808 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 77809 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 77806) of binary: /home/la/anaconda3/envs/alpaca_torch/bin/python
Traceback (most recent call last):
  File "/home/la/anaconda3/envs/alpaca_torch/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/la/anaconda3/envs/alpaca_torch/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/la/anaconda3/envs/alpaca_torch/lib/python3.10/site-packages/torch/distributed/run.py", line 798, in <module>
    main()
  File "/home/la/anaconda3/envs/alpaca_torch/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/la/anaconda3/envs/alpaca_torch/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/home/la/anaconda3/envs/alpaca_torch/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/la/anaconda3/envs/alpaca_torch/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/la/anaconda3/envs/alpaca_torch/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
======================================================
train.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-30_20:18:47
  host      : guest-server
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 77806)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 77806
======================================================
xiaoweiweixiao commented 1 year ago

When I use multi-GPU to run other codes, I also meet this error. Who can help me?

kasakh commented 1 year ago

Can you show what is the command you used to train in multi-GPU environment?

xiaoweiweixiao commented 1 year ago

python -m torch.distributed.run --nproc_per_node=4 --master_port=11110 train.py \ --model_name_or_path ./output/path \ --data_path ./alpaca_data.json \ --fp16 True \ --output_dir ./pretrained \ --num_train_epochs 3 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 8 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 2000 \ --save_total_limit 1 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --fsdp "full_shard auto_wrap" \ --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \ --tf32 False

ZiweiWangTHU commented 1 year ago

Same problem, have you found the solution?

FinalFlowers commented 1 year ago

Same problem, hope to have an answer

optimist-lsc commented 1 year ago

Please attempt to install the specified version of transformer: pip install git+https://github.com/zphang/transformers.git cd transformers git --reset hard 68d640f7c368bcaaaecfc678f11908ebbd3d6176 python setup.py install

xiaoweiweixiao commented 1 year ago

git --reset hard 68d640f7c368bcaaaecfc678f11908ebbd3d6176

Thank you for your advice. There is no setup.py in the transformers, only a [README.md]. So I can not install transformers.

xv994 commented 1 year ago

pip install git+https://github.com/zphang/transformers.git cd transformers git --reset hard 68d640f7c368bcaaaecfc678f11908ebbd3d6176 pip install .

you can try it, I use it solving problems

xiaoweiweixiao commented 1 year ago

pip install git+https://github.com/zphang/transformers.git cd transformers git --reset hard 68d640f7c368bcaaaecfc678f11908ebbd3d6176 pip install .

you can try it, I use it solving problems

Thank you for your advice. I can not run "pip install git+https://github.com/zphang/transformers.git", meet this error:

Collecting git+https://github.com/zphang/transformers.git
  Cloning https://github.com/zphang/transformers.git to /tmp/pip-req-build-8bfk9e3m
  Running command git clone --quiet https://github.com/zphang/transformers.git /tmp/pip-req-build-8bfk9e3m
  Resolved https://github.com/zphang/transformers.git to commit 63a9d6745f679b2eb882e0f147828380981111fa
ERROR: git+https://github.com/zphang/transformers.git does not appear to be a Python project: neither 'setup.py' nor 'pyproject.toml' found.

I download the transformers from the ”https://github.com/zphang/transformers“ and "cd transformers" run "git --reset hard 68d640f7c368bcaaaecfc678f11908ebbd3d6176" meet this error:

Unknown option: --reset
usage: git [--version] [--help] [-C <path>] [-c name=value]
           [--exec-path[=<path>]] [--html-path] [--man-path] [--info-path]
           [-p | --paginate | --no-pager] [--no-replace-objects] [--bare]
           [--git-dir=<path>] [--work-tree=<path>] [--namespace=<name>]
           <command> [<args>]

Is there something wrong done by me?

xv994 commented 1 year ago

sorry, maybe my suggestion last time was wrong. Your transformers repository is wrong. Please try follow code, it's my operation: git clone https://github.com/huggingface/transformers.git cd transformers git checkout 0041be5 pip install . If you are Chinese, you can read this link:https://zhuanlan.zhihu.com/p/618321077 , I followed his steps and succeed.

xiaoweiweixiao commented 1 year ago

pip install .

Thank you very much. My problem is solved follow your suggestion.

xiaoweiweixiao commented 1 year ago

0041be5

I have another question. I meet the same error in runing other models. This method do not solve my error. I guess the "0041be5" should change when runing other models (such as GLM130B). How to change the branch name "0041be5"?

xv994 commented 1 year ago

I think you may need different virtual python environment to train different model. And I don't know the version of transformers which GLM130B needs, so you'd better to ask their developer or read their guide.

zhihui-shao commented 1 year ago

我按照这个来了,但还是不行,请问还有解决办法吗

codemaster17611 commented 1 year ago

Can you tell me the machine configuration which you successfully ran the train.py ? I meet the same problem, but i have no idea

xv994 commented 1 year ago

Can you tell me the machine configuration which you successfully ran the train.py ? I meet the same problem, but i have no idea

can you show the error you met?

xv994 commented 1 year ago

我按照这个来了,但还是不行,请问还有解决办法吗

具体是什么问题呢

codemaster17611 commented 1 year ago

image

codemaster17611 commented 1 year ago

Can you tell me the machine configuration which you successfully ran the train.py ? I meet the same problem, but i have no idea

can you show the error you met?

always show exitcode -9, my config: GPU V100 16G * 3, CPU RAM 128G , is RAM not enough ? thx for you reply

codemaster17611 commented 1 year ago

but i also to monti

Can you tell me the machine configuration which you successfully ran the train.py ? I meet the same problem, but i have no idea

can you show the error you met?

always show exitcode -9, my config: GPU V100 16G * 3, CPU RAM 128G , is RAM not enough ? thx for you reply

But I only find 70% of the RAM be used on the backend

xv994 commented 1 year ago

image

oh, my friend, this is not the main reason, you should let me see the exception above this. And your RAM is enough, my device is less than yours.

xv994 commented 1 year ago

Have you ever tried this: git clone https://github.com/huggingface/transformers.git cd transformers git checkout 0041be5 pip install .

maybe it works.

Difang233 commented 1 year ago

Have you ever tried this: git clone https://github.com/huggingface/transformers.git cd transformers git checkout 0041be5 pip install .

maybe it works.

Hi, I have tried this method, but still got this problem, do you have any idea about this? The version of transformers I used is 4.29.0.dev0. Thanks in advance!

2023-04-26 07:19:29.474990: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[2023-04-26 07:19:35,696] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-04-26 07:19:53,218] [INFO] [partition_parameters.py:454:__exit__] finished initializing model with 6.74B parameters
Loading checkpoint shards: 100% 33/33 [01:08<00:00,  2.09s/it]
Using pad_token, but it is not set yet.
WARNING:root:Loading data...
WARNING:root:Formatting inputs...
WARNING:root:Tokenizing inputs... This may take some time...
Using /root/.cache/torch_extensions/py39_cu118 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py39_cu118/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.060922384262085 seconds
Using /root/.cache/torch_extensions/py39_cu118 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py39_cu118/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.49018430709838867 seconds
Parameter Offload: Total persistent parameters: 0 in 0 params
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 9574) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=====================================================
/content/drive/MyDrive/codealpaca/train.py FAILED
-----------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-04-26_07:22:29
  host      : 56de1ccd4f0e
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 9574)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 9574
=====================================================
codemaster17611 commented 1 year ago

Have you ever tried this: git clone https://github.com/huggingface/transformers.git cd transformers git checkout 0041be5 pip install .

maybe it works.

thank for you reply, i have follow you zhihu step by step.

my transformers version

(llmenv3) [xlwu@mochinelearning transformers]$ git checkout 0041be5
HEAD is now at 0041be5b3 LLaMA Implementation (#21955)

Then i run train script:

TORCH_CPP_LOG_LEVEL=INFO NCCL_DEBUG=INFO LOGLEVEL=INFO CUDA_VISIBLE_DEVICES=0,1,2 torchrun \
--nproc_per_node=3 \
--master_port=25001 train.py \
--model_name_or_path /DATA/cdisk/xlwu_workspace/pretrain_model/hf-llama-model/llama-7b \
--data_path /DATA/cdisk/xlwu_workspace/data/test.json \
--output_dir /DATA/cdisk/xlwu_workspace/output/alpaca/sft_7b \
--per_device_eval_batch_size 1 \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 1000 \
--save_total_limit 1 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--report_to "tensorboard" \
--gradient_checkpointing True \
--fp16 True \
--deepspeed ds_config.json 

the error show:

    WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
  entrypoint       : train.py
  min_nodes        : 1
  max_nodes        : 1
  nproc_per_node   : 3
  run_id           : none
  rdzv_backend     : static
  rdzv_endpoint    : 127.0.0.1:25001
  rdzv_configs     : {'rank': 0, 'timeout': 900}
  max_restarts     : 0
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_6we5ifkl/none_yi1eb8vm
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
[I socket.cpp:522] [c10d] The server socket has started to listen on [::]:25001.
[I socket.cpp:725] [c10d] The client socket has connected to [localhost]:25001 on [localhost]:22302.
[I socket.cpp:725] [c10d] The client socket has connected to [localhost]:25001 on [localhost]:22304.
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=127.0.0.1
  master_port=25001
  group_rank=0
  group_world_size=1
  local_ranks=[0, 1, 2]
  role_ranks=[0, 1, 2]
  global_ranks=[0, 1, 2]
  role_world_sizes=[3, 3, 3]
  global_world_sizes=[3, 3, 3]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_6we5ifkl/none_yi1eb8vm/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_6we5ifkl/none_yi1eb8vm/attempt_0/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_6we5ifkl/none_yi1eb8vm/attempt_0/2/error.json
[2023-04-26 16:16:46,603] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[I socket.cpp:725] [c10d] The client socket has connected to [localhost]:25001 on [localhost]:29566.
[I socket.cpp:725] [c10d] The client socket has connected to [localhost]:25001 on [localhost]:29568.
[I socket.cpp:725] [c10d] The client socket has connected to [localhost]:25001 on [localhost]:29570.
[I socket.cpp:725] [c10d] The client socket has connected to [localhost]:25001 on [localhost]:29572.
[I socket.cpp:725] [c10d] The client socket has connected to [localhost]:25001 on [localhost]:29574.
[I socket.cpp:725] [c10d] The client socket has connected to [localhost]:25001 on [localhost]:29576.
[I ProcessGroupNCCL.cpp:751] [Rank 0] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:587] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:587] [Rank 1] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:751] [Rank 1] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:587] [Rank 2] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:751] [Rank 2] NCCL watchdog thread started!
mochinelearning:3293412:3293412 [0] NCCL INFO Bootstrap : Using eth0:10.1.118.59<0>
mochinelearning:3293412:3293412 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
mochinelearning:3293412:3293412 [0] NCCL INFO NET/IB : No device found.
mochinelearning:3293412:3293412 [0] NCCL INFO NET/Socket : Using [0]eth0:10.1.118.59<0> [1]br-e7526d7e228d:192.168.16.1<0> [2]br-1e30049ed87b:192.168.32.1<0> [3]br-100d386c1c63:192.168.48.1<0>
mochinelearning:3293412:3293412 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.3
mochinelearning:3293413:3293413 [1] NCCL INFO Bootstrap : Using eth0:10.1.118.59<0>
mochinelearning:3293414:3293414 [2] NCCL INFO Bootstrap : Using eth0:10.1.118.59<0>
mochinelearning:3293414:3293414 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
mochinelearning:3293413:3293413 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
mochinelearning:3293414:3293414 [2] NCCL INFO NET/IB : No device found.
mochinelearning:3293413:3293413 [1] NCCL INFO NET/IB : No device found.
mochinelearning:3293414:3293414 [2] NCCL INFO NET/Socket : Using [0]eth0:10.1.118.59<0> [1]br-e7526d7e228d:192.168.16.1<0> [2]br-1e30049ed87b:192.168.32.1<0> [3]br-100d386c1c63:192.168.48.1<0>
mochinelearning:3293413:3293413 [1] NCCL INFO NET/Socket : Using [0]eth0:10.1.118.59<0> [1]br-e7526d7e228d:192.168.16.1<0> [2]br-1e30049ed87b:192.168.32.1<0> [3]br-100d386c1c63:192.168.48.1<0>
mochinelearning:3293414:3293414 [2] NCCL INFO Using network Socket
mochinelearning:3293413:3293413 [1] NCCL INFO Using network Socket
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 00/04 :    0   1   2
mochinelearning:3293414:3293480 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] 1/-1/-1->2->-1 [2] -1/-1/-1->2->1 [3] 1/-1/-1->2->-1
mochinelearning:3293413:3293481 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 0/-1/-1->1->2 [2] 2/-1/-1->1->0 [3] 0/-1/-1->1->2
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 01/04 :    0   2   1
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 02/04 :    0   1   2
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 03/04 :    0   2   1
mochinelearning:3293412:3293479 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 00 : 0[80] -> 1[90] via P2P/IPC
mochinelearning:3293414:3293480 [2] NCCL INFO Channel 00 : 2[a0] -> 0[80] via P2P/IPC
mochinelearning:3293413:3293481 [1] NCCL INFO Channel 00 : 1[90] -> 2[a0] via P2P/IPC
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 02 : 0[80] -> 1[90] via P2P/IPC
mochinelearning:3293414:3293480 [2] NCCL INFO Channel 02 : 2[a0] -> 0[80] via P2P/IPC
mochinelearning:3293413:3293481 [1] NCCL INFO Channel 02 : 1[90] -> 2[a0] via P2P/IPC
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 01 : 0[80] -> 2[a0] via P2P/IPC
mochinelearning:3293414:3293480 [2] NCCL INFO Channel 01 : 2[a0] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293481 [1] NCCL INFO Channel 01 : 1[90] -> 0[80] via P2P/IPC
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 03 : 0[80] -> 2[a0] via P2P/IPC
mochinelearning:3293414:3293480 [2] NCCL INFO Channel 03 : 2[a0] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293481 [1] NCCL INFO Channel 03 : 1[90] -> 0[80] via P2P/IPC
mochinelearning:3293412:3293479 [0] NCCL INFO Connected all rings
mochinelearning:3293414:3293480 [2] NCCL INFO Connected all rings
mochinelearning:3293413:3293481 [1] NCCL INFO Connected all rings
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 01 : 0[80] -> 1[90] via P2P/IPC
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 03 : 0[80] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293481 [1] NCCL INFO Channel 01 : 1[90] -> 2[a0] via P2P/IPC
mochinelearning:3293413:3293481 [1] NCCL INFO Channel 03 : 1[90] -> 2[a0] via P2P/IPC
mochinelearning:3293414:3293480 [2] NCCL INFO Channel 00 : 2[a0] -> 1[90] via P2P/IPC
mochinelearning:3293414:3293480 [2] NCCL INFO Channel 02 : 2[a0] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293481 [1] NCCL INFO Channel 00 : 1[90] -> 0[80] via P2P/IPC
mochinelearning:3293413:3293481 [1] NCCL INFO Channel 02 : 1[90] -> 0[80] via P2P/IPC
mochinelearning:3293414:3293480 [2] NCCL INFO Connected all trees
mochinelearning:3293414:3293480 [2] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 8/8/512
mochinelearning:3293414:3293480 [2] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
mochinelearning:3293412:3293479 [0] NCCL INFO Connected all trees
mochinelearning:3293412:3293479 [0] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 8/8/512
mochinelearning:3293412:3293479 [0] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
mochinelearning:3293413:3293481 [1] NCCL INFO Connected all trees
mochinelearning:3293413:3293481 [1] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 8/8/512
mochinelearning:3293413:3293481 [1] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
mochinelearning:3293414:3293480 [2] NCCL INFO comm 0x7fe1ec002fb0 rank 2 nranks 3 cudaDev 2 busId a0 - Init COMPLETE
mochinelearning:3293413:3293481 [1] NCCL INFO comm 0x7fda68002fb0 rank 1 nranks 3 cudaDev 1 busId 90 - Init COMPLETE
mochinelearning:3293412:3293479 [0] NCCL INFO comm 0x7fe718002fb0 rank 0 nranks 3 cudaDev 0 busId 80 - Init COMPLETE
[I ProcessGroupNCCL.cpp:1196] NCCL_DEBUG: INFO
mochinelearning:3293412:3293412 [0] NCCL INFO Launch mode Parallel
[2023-04-26 16:16:56,205] [INFO] [partition_parameters.py:413:__exit__] finished initializing model with 6.74B parameters
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████| 33/33 [01:38<00:00,  2.98s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████| 33/33 [01:38<00:00,  2.98s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████| 33/33 [01:38<00:00,  2.98s/it]
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
WARNING:root:Loading data...
WARNING:root:Formatting inputs...
WARNING:root:Tokenizing inputs... This may take some time...
WARNING:root:Loading data...
WARNING:root:Loading data...
WARNING:root:Formatting inputs...
WARNING:root:Tokenizing inputs... This may take some time...
WARNING:root:Formatting inputs...
WARNING:root:Tokenizing inputs... This may take some time...
[I ProcessGroupNCCL.cpp:587] [Rank 2] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:751] [Rank 2] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:587] [Rank 1] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:751] [Rank 1] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:587] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:751] [Rank 0] NCCL watchdog thread started!
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 00/04 :    0   1   2
mochinelearning:3293413:3293877 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 0/-1/-1->1->2 [2] 2/-1/-1->1->0 [3] 0/-1/-1->1->2
mochinelearning:3293414:3293876 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] 1/-1/-1->2->-1 [2] -1/-1/-1->2->1 [3] 1/-1/-1->2->-1
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 01/04 :    0   2   1
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 02/04 :    0   1   2
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 03/04 :    0   2   1
mochinelearning:3293412:3293875 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
mochinelearning:3293414:3293876 [2] NCCL INFO Channel 00 : 2[a0] -> 0[80] via P2P/IPC
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 00 : 0[80] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293877 [1] NCCL INFO Channel 00 : 1[90] -> 2[a0] via P2P/IPC
mochinelearning:3293414:3293876 [2] NCCL INFO Channel 02 : 2[a0] -> 0[80] via P2P/IPC
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 02 : 0[80] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293877 [1] NCCL INFO Channel 02 : 1[90] -> 2[a0] via P2P/IPC
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 01 : 0[80] -> 2[a0] via P2P/IPC
mochinelearning:3293414:3293876 [2] NCCL INFO Channel 01 : 2[a0] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293877 [1] NCCL INFO Channel 01 : 1[90] -> 0[80] via P2P/IPC
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 03 : 0[80] -> 2[a0] via P2P/IPC
mochinelearning:3293414:3293876 [2] NCCL INFO Channel 03 : 2[a0] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293877 [1] NCCL INFO Channel 03 : 1[90] -> 0[80] via P2P/IPC
mochinelearning:3293414:3293876 [2] NCCL INFO Connected all rings
mochinelearning:3293412:3293875 [0] NCCL INFO Connected all rings
mochinelearning:3293413:3293877 [1] NCCL INFO Connected all rings
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 01 : 0[80] -> 1[90] via P2P/IPC
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 03 : 0[80] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293877 [1] NCCL INFO Channel 01 : 1[90] -> 2[a0] via P2P/IPC
mochinelearning:3293413:3293877 [1] NCCL INFO Channel 03 : 1[90] -> 2[a0] via P2P/IPC
mochinelearning:3293414:3293876 [2] NCCL INFO Channel 00 : 2[a0] -> 1[90] via P2P/IPC
mochinelearning:3293414:3293876 [2] NCCL INFO Channel 02 : 2[a0] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293877 [1] NCCL INFO Channel 00 : 1[90] -> 0[80] via P2P/IPC
mochinelearning:3293413:3293877 [1] NCCL INFO Channel 02 : 1[90] -> 0[80] via P2P/IPC
mochinelearning:3293414:3293876 [2] NCCL INFO Connected all trees
mochinelearning:3293414:3293876 [2] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 8/8/512
mochinelearning:3293414:3293876 [2] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
mochinelearning:3293412:3293875 [0] NCCL INFO Connected all trees
mochinelearning:3293412:3293875 [0] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 8/8/512
mochinelearning:3293412:3293875 [0] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
mochinelearning:3293413:3293877 [1] NCCL INFO Connected all trees
mochinelearning:3293413:3293877 [1] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 8/8/512
mochinelearning:3293413:3293877 [1] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
mochinelearning:3293412:3293875 [0] NCCL INFO comm 0x7fe59c002fb0 rank 0 nranks 3 cudaDev 0 busId 80 - Init COMPLETE
mochinelearning:3293414:3293876 [2] NCCL INFO comm 0x7fe064002fb0 rank 2 nranks 3 cudaDev 2 busId a0 - Init COMPLETE
mochinelearning:3293413:3293877 [1] NCCL INFO comm 0x7fd8e0002fb0 rank 1 nranks 3 cudaDev 1 busId 90 - Init COMPLETE
[I ProcessGroupNCCL.cpp:1196] NCCL_DEBUG: INFO
mochinelearning:3293412:3293412 [0] NCCL INFO Launch mode Parallel
Installed CUDA version 11.2 does not match the version torch was compiled with 11.3 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.2 does not match the version torch was compiled with 11.3 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.2 does not match the version torch was compiled with 11.3 but since the APIs are compatible, accepting this combination
Using /home/xlwu/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
Creating extension directory /home/xlwu/.cache/torch_extensions/py310_cu113/cpu_adam...
Using /home/xlwu/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
Using /home/xlwu/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/xlwu/.cache/torch_extensions/py310_cu113/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include/TH -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_70,code=compute_70 -c /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o 
[2/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include/TH -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX512__ -c /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o 
[3/3] c++ cpu_adam.o custom_cuda_kernel.cuda.o -shared -lcurand -L/DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda_cu -ltorch_cuda_cpp -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o cpu_adam.so
Loading extension module cpu_adam...
Time to load cpu_adam op: 30.544737577438354 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 30.598078966140747 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 30.598520278930664 seconds
Using /home/xlwu/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
Creating extension directory /home/xlwu/.cache/torch_extensions/py310_cu113/utils...
Using /home/xlwu/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
Emitting ninja build file /home/xlwu/.cache/torch_extensions/py310_cu113/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Using /home/xlwu/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
[1/2] c++ -MMD -MF flatten_unflatten.o.d -DTORCH_EXTENSION_NAME=utils -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include/TH -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include/THC -isystem /DATA/xlwu/anconda3/envs/llmenv3/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/deepspeed/ops/csrc/utils/flatten_unflatten.cpp -o flatten_unflatten.o 
[2/2] c++ flatten_unflatten.o -shared -L/DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o utils.so
Loading extension module utils...
Time to load utils op: 15.1830472946167 seconds
Loading extension module utils...
Loading extension module utils...
Time to load utils op: 15.121537685394287 seconds
Time to load utils op: 15.222201108932495 seconds
Parameter Offload: Total persistent parameters: 0 in 0 params
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3293412 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3293414 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 1 (pid: 3293413) of binary: /DATA/xlwu/anconda3/envs/llmenv3/bin/python
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (FAILED). Waiting 300 seconds for other agents to finish
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0019960403442382812 seconds
INFO:torch.distributed.elastic.multiprocessing.errors:local_rank 1 FAILED with no error file. Decorate your entrypoint fn with @record for traceback info. See: https://pytorch.org/docs/stable/elastic/errors.html

Traceback (most recent call last):
  File "/DATA/xlwu/anconda3/envs/llmenv3/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
========================================================
train.py FAILED
--------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-04-26_16:19:53
  host      : mochinelearning
  rank      : 1 (local_rank: 1)
  exitcode  : -9 (pid: 3293413)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 3293413
========================================================
xv994 commented 1 year ago

your specific error is: Installed CUDA version 11.2 does not match the version torch was compiled with 11.3 but since the APIs are compatible, accepting this combination You should make sure the version of cuda and torch is matching

xv994 commented 1 year ago

Have you ever tried this: git clone https://github.com/huggingface/transformers.git cd transformers git checkout 0041be5 pip install . maybe it works.

Hi, I have tried this method, but still got this problem, do you have any idea about this? The version of transformers I used is 4.29.0.dev0. Thanks in advance!

2023-04-26 07:19:29.474990: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[2023-04-26 07:19:35,696] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-04-26 07:19:53,218] [INFO] [partition_parameters.py:454:__exit__] finished initializing model with 6.74B parameters
Loading checkpoint shards: 100% 33/33 [01:08<00:00,  2.09s/it]
Using pad_token, but it is not set yet.
WARNING:root:Loading data...
WARNING:root:Formatting inputs...
WARNING:root:Tokenizing inputs... This may take some time...
Using /root/.cache/torch_extensions/py39_cu118 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py39_cu118/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.060922384262085 seconds
Using /root/.cache/torch_extensions/py39_cu118 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py39_cu118/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.49018430709838867 seconds
Parameter Offload: Total persistent parameters: 0 in 0 params
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 9574) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=====================================================
/content/drive/MyDrive/codealpaca/train.py FAILED
-----------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-04-26_07:22:29
  host      : 56de1ccd4f0e
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 9574)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 9574
=====================================================

my transformers version is4.28.0.dev0 maybe something wrong, please check your step

xv994 commented 1 year ago

如果大家可以用中文交流或许可以减少一些沟通成本...

Lvzhh commented 1 year ago

This error might occur when there is not enough RAM available. It can happen when using FSDP multiprocess and transformers "from_pretrained" method where each process loads the checkpoint. As a result, the memory usage becomes num_processes * (model_size + size_of_largest_shard), leading to process crashes.

To tackle this issue, we can use DeepSpeed instead of FSDP. DeepSpeed optimizes initialization CPU memory usage, and it only uses num_processes * size_of_largest_shard RAM.

newstronger commented 1 year ago

The error show: Traceback (most recent call last): File "tools/train.py", line 194, in main() File "tools/train.py", line 183, in main train_detector( File "/home/wangzhang/mmrotate/mmrotate/apis/train.py", line 144, in train_detector runner.run(data_loaders, cfg.workflow) File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 136, in run epoch_runner(data_loaders[i], kwargs) File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 53, in train self.run_iter(data_batch, train_mode=True, kwargs) File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 31, in run_iter outputs = self.model.train_step(data_batch, self.optimizer, File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/mmcv/parallel/distributed.py", line 63, in train_step output = self.module.train_step(inputs[0], kwargs[0]) File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 248, in train_step losses = self(data) File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(input, *kwargs) File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 119, in new_func return old_func(args, kwargs) File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 172, in forward return self.forward_train(img, img_metas, kwargs) File "/home/wangzhang/mmrotate/mmrotate/models/detectors/single_stage.py", line 81, in forward_train losses = self.bbox_head.forward_train(x, img_metas, gt_bboxes, File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/mmdet/models/dense_heads/base_dense_head.py", line 335, in forward_train losses = self.loss(loss_inputs, gt_bboxes_ignore=gt_bboxes_ignore) File "/home/wangzhang/mmrotate/mmrotate/models/dense_heads/oriented_reppoints_head.py", line 952, in loss quality_assess_list, = multi_apply( File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/mmdet/core/utils/misc.py", line 30, in multi_apply return tuple(map(list, zip(map_results))) File "/home/wangzhang/mmrotate/mmrotate/models/dense_heads/oriented_reppoints_head.py", line 480, in pointsets_quality_assessment sampling_pts_pred_init = self.sampling_points( File "/home/wangzhang/mmrotate/mmrotate/models/dense_heads/oriented_reppoints_head.py", line 342, in sampling_points ratio = torch.linspace(0, 1, points_num).to(device).repeat( RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. terminate called after throwing an instance of 'c10::CUDAError' what(): CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x7fdfc57631ee in /home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: + 0x23a21 (0x7fdfede06a21 in /home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void) + 0x257 (0x7fdfede0b977 in /home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #3: + 0x463418 (0x7fe017356418 in /home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7fdfc574a7a5 in /home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10.so) frame #5: + 0x35f2f5 (0x7fe0172522f5 in /home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #6: + 0x679288 (0x7fe01756c288 in /home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #7: THPVariable_subclass_dealloc(_object) + 0x2d5 (0x7fe01756c655 in /home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #8: /home/wangzhang/anaconda3/envs/open-mmlab/bin/python() [0x4d38df] frame #9: /home/wangzhang/anaconda3/envs/open-mmlab/bin/python() [0x4e55ab] frame #10: /home/wangzhang/anaconda3/envs/open-mmlab/bin/python() [0x4e55ab] frame #11: /home/wangzhang/anaconda3/envs/open-mmlab/bin/python() [0x4e07e0] frame #12: /home/wangzhang/anaconda3/envs/open-mmlab/bin/python() [0x4f1908] frame #13: /home/wangzhang/anaconda3/envs/open-mmlab/bin/python() [0x4f18f1] frame #14: /home/wangzhang/anaconda3/envs/open-mmlab/bin/python() [0x4f18f1] frame #15: /home/wangzhang/anaconda3/envs/open-mmlab/bin/python() [0x4f18f1] frame #16: /home/wangzhang/anaconda3/envs/open-mmlab/bin/python() [0x4f18f1] frame #17: /home/wangzhang/anaconda3/envs/open-mmlab/bin/python() [0x4f18f1] frame #18: /home/wangzhang/anaconda3/envs/open-mmlab/bin/python() [0x4f18f1] frame #19: /home/wangzhang/anaconda3/envs/open-mmlab/bin/python() [0x4c91b0] frame #20: PyDict_SetItemString + 0x52 (0x5819d2 in /home/wangzhang/anaconda3/envs/open-mmlab/bin/python) frame #21: PyImport_Cleanup + 0x93 (0x5a6b73 in /home/wangzhang/anaconda3/envs/open-mmlab/bin/python) frame #22: Py_FinalizeEx + 0x71 (0x5a5ca1 in /home/wangzhang/anaconda3/envs/open-mmlab/bin/python) frame #23: Py_RunMain + 0x112 (0x5a1972 in /home/wangzhang/anaconda3/envs/open-mmlab/bin/python) frame #24: Py_BytesMain + 0x39 (0x579dd9 in /home/wangzhang/anaconda3/envs/open-mmlab/bin/python) frame #25: __libc_start_main + 0xe7 (0x7fe02f882c87 in /lib/x86_64-linux-gnu/libc.so.6) frame #26: /home/wangzhang/anaconda3/envs/open-mmlab/bin/python() [0x579c8d]

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 17013 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 17014 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 17015 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 3 (pid: 17016) of binary: /home/wangzhang/anaconda3/envs/open-mmlab/bin/python Traceback (most recent call last): File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in main() File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run elastic_launch( File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

tools/train.py FAILED

Failures:

------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-05-14_16:36:57 host : user-SYS-7049GP-TRT rank : 3 (local_rank: 3) exitcode : -6 (pid: 17016) error_file: traceback : Signal 6 (SIGABRT) received by PID 17016 ======================================================
Kangzf1996 commented 1 year ago

如果大家可以用中文交流或许可以减少一些沟通成本...

你好,我现在有这个error,请问你知道是什么原因吗?我的transformers是4.28.0.dev0

[2023-06-01 09:44:26,442] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-06-01 09:44:38,504] [INFO] [partition_parameters.py:454:__exit__] finished initializing model with 6.74B parameters
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 33/33 [00:50<00:00,  1.53s/it]
Using pad_token, but it is not set yet.
WARNING:root:Loading data...
WARNING:root:Formatting inputs...
WARNING:root:Tokenizing inputs... This may take some time...
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py38_cu117/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.2360477447509766 seconds
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py38_cu117/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.3748812675476074 seconds
Parameter Offload: Total persistent parameters: 0 in 0 params
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 308887) of binary: /opt/conda/bin/python3.8
Traceback (most recent call last):
  File "/opt/conda/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
train.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-06-01_09:47:02
  host      : alpaca-6655dbbbc6-btc9j
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 308887)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 308887
=======================================================
rbareja25 commented 1 year ago

Has anyone able to fix this? I have increased RAM, reduced batch size, downgraded torch vision version but nothing works..

TX-Yeager commented 7 months ago

when u train the model, it will take a lot of time. The process might be killed by sigup. so don't just use python or torchrun. try this : nohup python xxx.py &