/home/yons/miniconda3/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects --local-rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
[2024-02-06 16:11:58,452] torch.distributed.run: [WARNING]
[2024-02-06 16:11:58,452] torch.distributed.run: [WARNING]
[2024-02-06 16:11:58,452] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-02-06 16:11:58,452] torch.distributed.run: [WARNING]
02/06 16:12:00 - mmengine - INFO -
System environment:
sys.platform: linux
Python: 3.8.18 | packaged by conda-forge | (default, Dec 23 2023, 17:21:28) [GCC 12.3.0]
CUDA available: True
numpy_random_seed: 252581723
GPU 0,1: NVIDIA GeForce RTX 4090
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.2, V12.2.91
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
PyTorch: 2.1.0+cu121
PyTorch compiling details: PyTorch built with:
GCC 9.3
C++ Version: 201703
Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
.....
.....
.....
省略
.....
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 82.00 MiB. GPU 0 has a total capacty of 23.64 GiB of which 59.25 MiB is free. Process 727402 has 1.89 GiB memory in use. Including non-PyTorch memory, this process has 21.32 GiB memory in use. Of the allocated memory 20.56 GiB is allocated by PyTorch, and 312.04 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2024-02-06 16:12:08,473] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 727401) of binary: /home/yons/miniconda3/envs/openmmlab/bin/python
Traceback (most recent call last):
File "/home/yons/miniconda3/envs/openmmlab/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/yons/miniconda3/envs/openmmlab/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/yons/miniconda3/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/launch.py", line 196, in
main()
File "/home/yons/miniconda3/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/launch.py", line 192, in main
launch(args)
File "/home/yons/miniconda3/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/launch.py", line 177, in launch
run(args)
File "/home/yons/miniconda3/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/home/yons/miniconda3/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/yons/miniconda3/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Prerequisite
Environment
Package Version Editable project location
addict 2.4.0 aenum 3.1.15 attrs 23.2.0 certifi 2022.12.7 charset-normalizer 2.1.1 chumpy 0.70 click 8.1.7 colorama 0.4.6 contourpy 1.1.1 coverage 7.4.1 cycler 0.12.1 Cython 3.0.8 dill 0.3.8 exceptiongroup 1.2.0 filelock 3.9.0 flake8 7.0.0 fonttools 4.47.2 fsspec 2023.4.0 grpcio 1.59.2 idna 3.4 importlib-metadata 7.0.1 importlib-resources 6.1.1 iniconfig 2.0.0 interrogate 1.5.0 isort 4.3.21 Jinja2 3.1.2 json-tricks 3.17.3 kiwisolver 1.4.5 markdown-it-py 3.0.0 MarkupSafe 2.1.3 matplotlib 3.7.3 mccabe 0.7.0 mdurl 0.1.1 mmcv 2.1.0 mmdeploy 1.3.0 /home/yons/train/code/mmdeploy mmdet 3.2.0 mmengine 0.9.0 mmpose 1.2.0 /home/yons/train/code/mmpose mpmath 1.3.0 multiprocess 0.70.16 munkres 1.1.4 ncnn 1.0.20240130 /home/yons/train/code/mmdeploy-dep/ncnn/python networkx 3.0 numpy 1.24.1 onnx 1.15.0 opencv-python 4.8.1.78 packaging 23.2 parameterized 0.9.0 pillow 10.2.0 pip 24.0 platformdirs 4.2.0 pluggy 1.4.0 portalocker 2.8.2 prettytable 3.9.0 protobuf 3.20.2 py 1.11.0 pycocotools 2.0.7 pycodestyle 2.11.1 pyflakes 3.2.0 Pygments 2.17.2 pyparsing 3.1.1 pytest 8.0.0 pytest-runner 6.0.1 python-dateutil 2.8.2 PyYAML 6.0.1 requests 2.28.1 rich 13.7.0
scipy 1.10.1 setuptools 69.0.3 shapely 2.0.2 six 1.16.0 sympy 1.12 tabulate 0.9.0 termcolor 2.4.0 terminaltables 3.1.10 toml 0.10.2 tomli 2.0.1 torch 2.1.0+cu121 torchaudio 2.1.0+cu121 torchvision 0.16.0+cu121 tqdm 4.65.2 triton 2.1.0 typing_extensions 4.8.0 urllib3 1.26.13 wcwidth 0.2.13 wheel 0.42.0 xdoctest 1.1.3 xtcocotools 1.14.3 yapf 0.40.2 zipp 3.17.0
Reproduces the problem - code sample
dist_train,sh:
!/usr/bin/env bash
Copyright (c) OpenMMLab. All rights reserved.
CONFIG=$1 GPUS=$2 NNODES=${NNODES:-1} NODE_RANK=${NODE_RANK:-0} PORT=${PORT:-29500} MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"} PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \ python -m torch.distributed.launch \ --nnodes=$NNODES \ --node_rank=$NODE_RANK \ --master_addr=$MASTER_ADDR \ --nproc_per_node=$GPUS \ --master_port=$PORT \ $(dirname "$0")/train.py \ $CONFIG \ --launcher pytorch ${@:3}
Reproduces the problem - command or script
bash ./tools/dist_train.sh /home/yons/train/code/mmpose/configs/body_2d_keypoint/topdown_heatmap/my/pose_td-hm_hrnet-w48_8xb32-210e_plank-384x288.py 2
or
CUDA_VISIBLE_DEVICES=0,1 bash ./tools/dist_train.sh /home/yons/train/code/mmpose/configs/body_2d_keypoint/topdown_heatmap/my/pose_td-hm_hrnet-w48_8xb32-210e_plank-384x288.py 2
or
python -m torch.distributed.launch --nproc_per_node=2 ./tools/train.py /home/yons/train/code/mmpose/configs/body_2d_keypoint/topdown_heatmap/my/pose_td-hm_hrnet-w48_8xb32-210e_plank-384x288.py
Reproduces the problem - error message
/home/yons/miniconda3/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use-env is set by default in torchrun. If your script expects
--local-rank
argument to be set, please change it to read fromos.environ['LOCAL_RANK']
instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructionswarnings.warn( [2024-02-06 16:11:58,452] torch.distributed.run: [WARNING] [2024-02-06 16:11:58,452] torch.distributed.run: [WARNING] [2024-02-06 16:11:58,452] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-02-06 16:11:58,452] torch.distributed.run: [WARNING] 02/06 16:12:00 - mmengine - INFO -
System environment: sys.platform: linux Python: 3.8.18 | packaged by conda-forge | (default, Dec 23 2023, 17:21:28) [GCC 12.3.0] CUDA available: True numpy_random_seed: 252581723 GPU 0,1: NVIDIA GeForce RTX 4090 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 12.2, V12.2.91 GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 PyTorch: 2.1.0+cu121 PyTorch compiling details: PyTorch built with:
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,
TorchVision: 0.16.0+cu121 OpenCV: 4.8.1 MMEngine: 0.9.0
Runtime environment: cudnn_benchmark: False mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0} dist_cfg: {'backend': 'nccl'} seed: 252581723 Distributed launcher: none Distributed training: False GPU number: 1
02/06 16:12:00 - mmengine - INFO -
System environment: sys.platform: linux Python: 3.8.18 | packaged by conda-forge | (default, Dec 23 2023, 17:21:28) [GCC 12.3.0] CUDA available: True numpy_random_seed: 277659974 GPU 0,1: NVIDIA GeForce RTX 4090 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 12.2, V12.2.91 GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 PyTorch: 2.1.0+cu121 PyTorch compiling details: PyTorch built with:
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,
TorchVision: 0.16.0+cu121 OpenCV: 4.8.1 MMEngine: 0.9.0
Runtime environment: cudnn_benchmark: False mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0} dist_cfg: {'backend': 'nccl'} seed: 277659974 Distributed launcher: none Distributed training: False GPU number: 1
..... ..... ..... 省略 ..... torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 82.00 MiB. GPU 0 has a total capacty of 23.64 GiB of which 59.25 MiB is free. Process 727402 has 1.89 GiB memory in use. Including non-PyTorch memory, this process has 21.32 GiB memory in use. Of the allocated memory 20.56 GiB is allocated by PyTorch, and 312.04 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF [2024-02-06 16:12:08,473] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 727401) of binary: /home/yons/miniconda3/envs/openmmlab/bin/python Traceback (most recent call last): File "/home/yons/miniconda3/envs/openmmlab/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/yons/miniconda3/envs/openmmlab/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/yons/miniconda3/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/launch.py", line 196, in
main()
File "/home/yons/miniconda3/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/launch.py", line 192, in main
launch(args)
File "/home/yons/miniconda3/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/launch.py", line 177, in launch
run(args)
File "/home/yons/miniconda3/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/home/yons/miniconda3/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/yons/miniconda3/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
./tools/train.py FAILED
Failures: [1]: time : 2024-02-06_16:12:08 host : yons-MS-7E06 rank : 1 (local_rank: 1) exitcode : 1 (pid: 727402) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure): [0]: time : 2024-02-06_16:12:08 host : yons-MS-7E06 rank : 0 (local_rank: 0) exitcode : 1 (pid: 727401) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Additional information
我直接运行train,py使用单卡训练没有问题,但是我使用如上三个命令进行多卡训练都会报如上错误信息里面的报错。报错中有提示显存溢出的,但是我的batch-size进行单卡训练都不会显存溢出,双卡时却提示显存不足;且还有torch.distributed.elastic.multiprocessing.errors.ChildFailedError的错误,请问是什么原因呢,该如何解决?我进行其他工程的双卡训练是没有问题的。且我接入QQ群或者微信群也没有的群肯通过我的请求加入来进行讨论和提问,我感到很不解。