open-mmlab / mmpose

OpenMMLab Pose Estimation Toolbox and Benchmark.
https://mmpose.readthedocs.io/en/latest/
Apache License 2.0
5.73k stars 1.23k forks source link

[Bug] mmpose提供的多卡训练命令运行出现问题 #2949

Open Alphacch opened 8 months ago

Alphacch commented 8 months ago

Prerequisite

Environment

Package Version Editable project location


addict 2.4.0 aenum 3.1.15 attrs 23.2.0 certifi 2022.12.7 charset-normalizer 2.1.1 chumpy 0.70 click 8.1.7 colorama 0.4.6 contourpy 1.1.1 coverage 7.4.1 cycler 0.12.1 Cython 3.0.8 dill 0.3.8 exceptiongroup 1.2.0 filelock 3.9.0 flake8 7.0.0 fonttools 4.47.2 fsspec 2023.4.0 grpcio 1.59.2 idna 3.4 importlib-metadata 7.0.1 importlib-resources 6.1.1 iniconfig 2.0.0 interrogate 1.5.0 isort 4.3.21 Jinja2 3.1.2 json-tricks 3.17.3 kiwisolver 1.4.5 markdown-it-py 3.0.0 MarkupSafe 2.1.3 matplotlib 3.7.3 mccabe 0.7.0 mdurl 0.1.1 mmcv 2.1.0 mmdeploy 1.3.0 /home/yons/train/code/mmdeploy mmdet 3.2.0 mmengine 0.9.0 mmpose 1.2.0 /home/yons/train/code/mmpose mpmath 1.3.0 multiprocess 0.70.16 munkres 1.1.4 ncnn 1.0.20240130 /home/yons/train/code/mmdeploy-dep/ncnn/python networkx 3.0 numpy 1.24.1 onnx 1.15.0 opencv-python 4.8.1.78 packaging 23.2 parameterized 0.9.0 pillow 10.2.0 pip 24.0 platformdirs 4.2.0 pluggy 1.4.0 portalocker 2.8.2 prettytable 3.9.0 protobuf 3.20.2 py 1.11.0 pycocotools 2.0.7 pycodestyle 2.11.1 pyflakes 3.2.0 Pygments 2.17.2 pyparsing 3.1.1 pytest 8.0.0 pytest-runner 6.0.1 python-dateutil 2.8.2 PyYAML 6.0.1 requests 2.28.1 rich 13.7.0
scipy 1.10.1 setuptools 69.0.3 shapely 2.0.2 six 1.16.0 sympy 1.12 tabulate 0.9.0 termcolor 2.4.0 terminaltables 3.1.10 toml 0.10.2 tomli 2.0.1 torch 2.1.0+cu121 torchaudio 2.1.0+cu121 torchvision 0.16.0+cu121 tqdm 4.65.2 triton 2.1.0 typing_extensions 4.8.0 urllib3 1.26.13 wcwidth 0.2.13 wheel 0.42.0 xdoctest 1.1.3 xtcocotools 1.14.3 yapf 0.40.2 zipp 3.17.0

Reproduces the problem - code sample

dist_train,sh:

!/usr/bin/env bash

Copyright (c) OpenMMLab. All rights reserved.

CONFIG=$1 GPUS=$2 NNODES=${NNODES:-1} NODE_RANK=${NODE_RANK:-0} PORT=${PORT:-29500} MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"} PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \ python -m torch.distributed.launch \ --nnodes=$NNODES \ --node_rank=$NODE_RANK \ --master_addr=$MASTER_ADDR \ --nproc_per_node=$GPUS \ --master_port=$PORT \ $(dirname "$0")/train.py \ $CONFIG \ --launcher pytorch ${@:3}

Reproduces the problem - command or script

bash ./tools/dist_train.sh /home/yons/train/code/mmpose/configs/body_2d_keypoint/topdown_heatmap/my/pose_td-hm_hrnet-w48_8xb32-210e_plank-384x288.py 2

or

CUDA_VISIBLE_DEVICES=0,1 bash ./tools/dist_train.sh /home/yons/train/code/mmpose/configs/body_2d_keypoint/topdown_heatmap/my/pose_td-hm_hrnet-w48_8xb32-210e_plank-384x288.py 2

or

python -m torch.distributed.launch --nproc_per_node=2 ./tools/train.py /home/yons/train/code/mmpose/configs/body_2d_keypoint/topdown_heatmap/my/pose_td-hm_hrnet-w48_8xb32-210e_plank-384x288.py

Reproduces the problem - error message

/home/yons/miniconda3/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use-env is set by default in torchrun. If your script expects --local-rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

warnings.warn( [2024-02-06 16:11:58,452] torch.distributed.run: [WARNING] [2024-02-06 16:11:58,452] torch.distributed.run: [WARNING] [2024-02-06 16:11:58,452] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-02-06 16:11:58,452] torch.distributed.run: [WARNING] 02/06 16:12:00 - mmengine - INFO -

System environment: sys.platform: linux Python: 3.8.18 | packaged by conda-forge | (default, Dec 23 2023, 17:21:28) [GCC 12.3.0] CUDA available: True numpy_random_seed: 252581723 GPU 0,1: NVIDIA GeForce RTX 4090 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 12.2, V12.2.91 GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 PyTorch: 2.1.0+cu121 PyTorch compiling details: PyTorch built with:

Runtime environment: cudnn_benchmark: False mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0} dist_cfg: {'backend': 'nccl'} seed: 252581723 Distributed launcher: none Distributed training: False GPU number: 1

02/06 16:12:00 - mmengine - INFO -

System environment: sys.platform: linux Python: 3.8.18 | packaged by conda-forge | (default, Dec 23 2023, 17:21:28) [GCC 12.3.0] CUDA available: True numpy_random_seed: 277659974 GPU 0,1: NVIDIA GeForce RTX 4090 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 12.2, V12.2.91 GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 PyTorch: 2.1.0+cu121 PyTorch compiling details: PyTorch built with:

Runtime environment: cudnn_benchmark: False mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0} dist_cfg: {'backend': 'nccl'} seed: 277659974 Distributed launcher: none Distributed training: False GPU number: 1

..... ..... ..... 省略 ..... torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 82.00 MiB. GPU 0 has a total capacty of 23.64 GiB of which 59.25 MiB is free. Process 727402 has 1.89 GiB memory in use. Including non-PyTorch memory, this process has 21.32 GiB memory in use. Of the allocated memory 20.56 GiB is allocated by PyTorch, and 312.04 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF [2024-02-06 16:12:08,473] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 727401) of binary: /home/yons/miniconda3/envs/openmmlab/bin/python Traceback (most recent call last): File "/home/yons/miniconda3/envs/openmmlab/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/yons/miniconda3/envs/openmmlab/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/yons/miniconda3/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/launch.py", line 196, in main() File "/home/yons/miniconda3/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/launch.py", line 192, in main launch(args) File "/home/yons/miniconda3/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/launch.py", line 177, in launch run(args) File "/home/yons/miniconda3/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/home/yons/miniconda3/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/yons/miniconda3/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

./tools/train.py FAILED

Failures: [1]: time : 2024-02-06_16:12:08 host : yons-MS-7E06 rank : 1 (local_rank: 1) exitcode : 1 (pid: 727402) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2024-02-06_16:12:08 host : yons-MS-7E06 rank : 0 (local_rank: 0) exitcode : 1 (pid: 727401) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Additional information

我直接运行train,py使用单卡训练没有问题,但是我使用如上三个命令进行多卡训练都会报如上错误信息里面的报错。报错中有提示显存溢出的,但是我的batch-size进行单卡训练都不会显存溢出,双卡时却提示显存不足;且还有torch.distributed.elastic.multiprocessing.errors.ChildFailedError的错误,请问是什么原因呢,该如何解决?我进行其他工程的双卡训练是没有问题的。且我接入QQ群或者微信群也没有的群肯通过我的请求加入来进行讨论和提问,我感到很不解。

Ben-Louis commented 8 months ago

貌似其中一张卡有其他进程占用显存导致剩余显存不足? 下面这个社区群应该还能直接扫码进群 image

Alphacch commented 8 months ago

貌似其中一张卡有其他进度占用显存导致剩余显存不足? 下面这个社区群应该还能直接扫码进群 图像 可是为什么还会有显存溢出以外的其他报错呢,而且多卡的话占用的是全部的显存之和才对呀