Multi GPU training error

Search before asking

[X] I have searched the YOLOv8 issues and found no similar bug report.

YOLOv8 Component

No response

Bug

Transferred 355/355 items from pretrained weights
Ultralytics YOLOv8.2.46 🚀 Python-3.10.14 torch-2.3.1+cu118 CUDA:6 (Tesla V100-SXM2-16GB, 16161MiB)
                                                            CUDA:5 (Tesla V100-SXM2-16GB, 16161MiB)
                                                            CUDA:4 (Tesla V100-SXM2-16GB, 16161MiB)
                                                            CUDA:3 (Tesla V100-SXM2-16GB, 16161MiB)
                                                            CUDA:2 (Tesla V100-SXM2-16GB, 16161MiB)
                                                            CUDA:1 (Tesla V100-SXM2-16GB, 16161MiB)
engine/trainer: task=detect, mode=train, model=yolov8n.yaml, data=./ultralytics/cfg/datasets/layout.yaml, epochs=300, time=None, patience=100, batch=192, imgsz=672, save=True, save_period=-1, cache=False, device=6,5,4,3,2,1, workers=0, project=None, name=train8, exist_ok=False, pretrained=True, optimizer=auto, verbose=True, seed=0, deterministic=True, single_cls=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, amp=True, fraction=1.0, profile=False, freeze=None, multi_scale=False, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, vid_stride=1, stream_buffer=False, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, embed=None, show=False, save_frames=False, save_txt=False, save_conf=False, save_crop=False, show_labels=True, show_conf=True, show_boxes=True, line_width=None, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=False, opset=None, workspace=4, nms=False, lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, pose=12.0, kobj=1.0, label_smoothing=0.0, nbs=64, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, bgr=0.0, mosaic=1.0, mixup=0.0, copy_paste=0.0, auto_augment=randaugment, erasing=0.4, crop_fraction=1.0, cfg=None, tracker=botsort.yaml, mean=[0.92367495, 0.92194206, 0.91870715], std=[0.15719514, 0.15910767, 0.15992016], save_dir=runs/detect/train8
Overriding model.yaml nc=80 with nc=8

                   from  n    params  module                                       arguments                     
  0                  -1  1       464  ultralytics.nn.modules.conv.Conv             [3, 16, 3, 2]                 
  1                  -1  1      4672  ultralytics.nn.modules.conv.Conv             [16, 32, 3, 2]                
  2                  -1  1      7360  ultralytics.nn.modules.block.C2f             [32, 32, 1, True]             
  3                  -1  1     18560  ultralytics.nn.modules.conv.Conv             [32, 64, 3, 2]                
  4                  -1  2     49664  ultralytics.nn.modules.block.C2f             [64, 64, 2, True]             
  5                  -1  1     73984  ultralytics.nn.modules.conv.Conv             [64, 128, 3, 2]               
  6                  -1  2    197632  ultralytics.nn.modules.block.C2f             [128, 128, 2, True]           
  7                  -1  1    295424  ultralytics.nn.modules.conv.Conv             [128, 256, 3, 2]              
  8                  -1  1    460288  ultralytics.nn.modules.block.C2f             [256, 256, 1, True]           
  9                  -1  1    164608  ultralytics.nn.modules.block.SPPF            [256, 256, 5]                 
 10                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 11             [-1, 6]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 12                  -1  1    148224  ultralytics.nn.modules.block.C2f             [384, 128, 1]                 
 13                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 14             [-1, 4]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 15                  -1  1     37248  ultralytics.nn.modules.block.C2f             [192, 64, 1]                  
 16                  -1  1     36992  ultralytics.nn.modules.conv.Conv             [64, 64, 3, 2]                
 17            [-1, 12]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 18                  -1  1    123648  ultralytics.nn.modules.block.C2f             [192, 128, 1]                 
 19                  -1  1    147712  ultralytics.nn.modules.conv.Conv             [128, 128, 3, 2]              
 20             [-1, 9]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 21                  -1  1    493056  ultralytics.nn.modules.block.C2f             [384, 256, 1]                 
 22        [15, 18, 21]  1    752872  ultralytics.nn.modules.head.Detect           [8, [64, 128, 256]]           
YOLOv8n summary: 225 layers, 3012408 parameters, 3012392 gradients, 8.2 GFLOPs

Transferred 319/355 items from pretrained weights
DDP: debug command /home/opt/anaconda3/envs/torch_gqh/bin/python -m torch.distributed.run --nproc_per_node 6 --master_port 24515 /home/work/.config/Ultralytics/DDP/_temp_jp0x4t9_140566440279200.py

Ultralytics YOLOv8.2.46 🚀 Python-3.10.14 torch-2.3.1+cu118 CUDA:6 (Tesla V100-SXM2-16GB, 16161MiB)
                                                            CUDA:5 (Tesla V100-SXM2-16GB, 16161MiB)
                                                            CUDA:4 (Tesla V100-SXM2-16GB, 16161MiB)
                                                            CUDA:3 (Tesla V100-SXM2-16GB, 16161MiB)
                                                            CUDA:2 (Tesla V100-SXM2-16GB, 16161MiB)
                                                            CUDA:1 (Tesla V100-SXM2-16GB, 16161MiB)
TensorBoard: Start with 'tensorboard --logdir runs/detect/train8', view at http://localhost:6006/
Overriding model.yaml nc=80 with nc=8
Freezing layer 'model.22.dfl.conv.weight'
AMP: running Automatic Mixed Precision (AMP) checks with YOLOv8n...
Downloading https://github.com/ultralytics/assets/releases/download/v8.2.0/yolov8n.pt to 'yolov8n.pt'...
AMP: checks skipped ⚠️, offline and unable to download YOLOv8n. Setting 'amp=True'. If you experience zero-mAP or NaN losses you can disable AMP with amp=False.
train: Scanning /home/work/guoquanhao/ultralytics/layout/labels/train-nopicfoot.cache... 4025 images, 0 backgrounds, 0 corrupt: 100%|█
train: WARNING ⚠️ /home/work/guoquanhao/ultralytics/layout/images/train-nopicfoot/doc-jwd5zy2i6qf6jmi2.jpg: 1 duplicate labels removed
train: WARNING ⚠️ /home/work/guoquanhao/ultralytics/layout/images/train-nopicfoot/doc-xebshfsts2jqpd5g.jpg: 1 duplicate labels removed
val: Scanning /home/work/guoquanhao/ultralytics/layout/labels/valid-nopicfoot.cache... 1000 images, 0 backgrounds, 0 corrupt: 100%|███
Plotting labels to runs/detect/train8/labels.jpg... 
optimizer: 'optimizer=auto' found, ignoring 'lr0=0.01' and 'momentum=0.937' and determining best 'optimizer', 'lr0' and 'momentum' automatically... 
optimizer: AdamW(lr=0.000714, momentum=0.9) with parameter groups 57 weight(decay=0.0), 64 weight(decay=0.0015), 63 bias(decay=0.0)
W0701 15:44:03.619000 140038850516736 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1805550 closing signal SIGTERM
W0701 15:44:03.619000 140038850516736 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1805551 closing signal SIGTERM
W0701 15:44:03.621000 140038850516736 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1805552 closing signal SIGTERM
W0701 15:44:03.622000 140038850516736 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1805553 closing signal SIGTERM
W0701 15:44:03.626000 140038850516736 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1805554 closing signal SIGTERM
E0701 15:44:04.066000 140038850516736 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -11) local_rank: 0 (pid: 1805549) of binary: /home/opt/anaconda3/envs/torch_gqh/bin/python
Traceback (most recent call last):
  File "/home/opt/anaconda3/envs/torch_gqh/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/opt/anaconda3/envs/torch_gqh/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/opt/anaconda3/envs/torch_gqh/lib/python3.10/site-packages/torch/distributed/run.py", line 883, in <module>
    main()
  File "/home/opt/anaconda3/envs/torch_gqh/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/home/opt/anaconda3/envs/torch_gqh/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/home/opt/anaconda3/envs/torch_gqh/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/opt/anaconda3/envs/torch_gqh/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/opt/anaconda3/envs/torch_gqh/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/home/work/.config/Ultralytics/DDP/_temp_jp0x4t9_140566440279200.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-07-01_15:44:03
  host      : yq01-sys-hic-k8s-v100-box-a223-0033.yq01.baidu.com
  rank      : 0 (local_rank: 0)
  exitcode  : -11 (pid: 1805549)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 1805549
============================================================
Traceback (most recent call last):
  File "/home/work/guoquanhao/ultralytics/yolo_train.py", line 11, in <module>
    results = model.train(data="./ultralytics/cfg/datasets/layout.yaml", epochs=300, imgsz=672, device=os.getenv('CUDA_VISIBLE_DEVICES'), workers=0, batch=192)
  File "/home/work/guoquanhao/ultralytics/ultralytics/engine/model.py", line 650, in train
    self.trainer.train()
  File "/home/work/guoquanhao/ultralytics/ultralytics/engine/trainer.py", line 205, in train
    raise e
  File "/home/work/guoquanhao/ultralytics/ultralytics/engine/trainer.py", line 203, in train
    subprocess.run(cmd, check=True)
  File "/home/opt/anaconda3/envs/torch_gqh/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/home/opt/anaconda3/envs/torch_gqh/bin/python', '-m', 'torch.distributed.run', '--nproc_per_node', '6', '--master_port', '24515', '/home/work/.config/Ultralytics/DDP/_temp_jp0x4t9_140566440279200.py']' returned non-zero exit status 1.

Environment

Package                   Version              Editable project location
------------------------- -------------------- ---------------------------------
absl-py                   2.1.0
aiofiles                  23.2.1
albucore                  0.0.12
albumentations            1.4.10
altair                    5.3.0
annotated-types           0.7.0
anyio                     4.4.0
asttokens                 2.4.1
attrs                     23.2.0
certifi                   2024.6.2
charset-normalizer        3.3.2
click                     8.1.7
cmake                     3.29.6
coloredlogs               15.0.1
contourpy                 1.2.1
cycler                    0.12.1
decorator                 5.1.1
dnspython                 2.6.1
email_validator           2.2.0
exceptiongroup            1.2.1
executing                 2.0.1
fastapi                   0.111.0
fastapi-cli               0.0.4
ffmpy                     0.3.2
filelock                  3.15.4
flatbuffers               24.3.25
fonttools                 4.53.0
fsspec                    2024.6.0
gitdb                     4.0.11
GitPython                 3.1.43
gradio                    4.31.5
gradio_client             0.16.4
grpcio                    1.64.1
h11                       0.14.0
httpcore                  1.0.5
httptools                 0.6.1
httpx                     0.27.0
huggingface-hub           0.23.2
humanfriendly             10.0
idna                      3.7
imageio                   2.34.2
importlib_resources       6.4.0
ipython                   8.26.0
jedi                      0.19.1
Jinja2                    3.1.4
joblib                    1.4.2
jsonschema                4.22.0
jsonschema-specifications 2023.12.1
kiwisolver                1.4.5
lazy_loader               0.4
lit                       18.1.8
Markdown                  3.6
markdown-it-py            3.0.0
MarkupSafe                2.1.5
matplotlib                3.9.0
matplotlib-inline         0.1.7
mdurl                     0.1.2
mpmath                    1.3.0
networkx                  3.3
numpy                     1.26.4
nvidia-cublas-cu11        11.11.3.6
nvidia-cuda-cupti-cu11    11.8.87
nvidia-cuda-nvrtc-cu11    11.8.89
nvidia-cuda-runtime-cu11  11.8.89
nvidia-cudnn-cu11         8.7.0.84
nvidia-cufft-cu11         10.9.0.58
nvidia-curand-cu11        10.3.0.86
nvidia-cusolver-cu11      11.4.1.48
nvidia-cusparse-cu11      11.7.5.86
nvidia-nccl-cu11          2.20.5
nvidia-nvtx-cu11          11.8.86
onnx                      1.14.0
onnxruntime               1.15.1
onnxruntime-gpu           1.16.3
onnxsim                   0.4.36
opencv-python             4.10.0.84
opencv-python-headless    4.10.0.84
orjson                    3.10.5
packaging                 24.1
pandas                    2.2.2
parso                     0.8.4
pexpect                   4.9.0
pillow                    10.3.0
pip                       24.0
prompt_toolkit            3.0.47
protobuf                  4.25.3
psutil                    5.9.8
ptyprocess                0.7.0
pure-eval                 0.2.2
py-cpuinfo                9.0.0
pycocotools               2.0.7
pydantic                  2.7.4
pydantic_core             2.18.4
pydub                     0.25.1
Pygments                  2.18.0
pyparsing                 3.1.2
python-dateutil           2.9.0.post0
python-dotenv             1.0.1
python-multipart          0.0.9
pytz                      2024.1
PyYAML                    6.0.1
referencing               0.35.1
requests                  2.32.3
rich                      13.7.1
rpds-py                   0.18.1
ruff                      0.5.0
safetensors               0.4.3
scikit-image              0.24.0
scikit-learn              1.5.0
scipy                     1.13.0
seaborn                   0.13.2
semantic-version          2.10.0
setuptools                69.5.1
shellingham               1.5.4
six                       1.16.0
smmap                     5.0.1
sniffio                   1.3.1
stack-data                0.6.3
starlette                 0.37.2
sympy                     1.12.1
tensorboard               2.17.0
tensorboard-data-server   0.7.2
thop                      0.1.1.post2209072238
threadpoolctl             3.5.0
tifffile                  2024.6.18
tomli                     2.0.1
tomlkit                   0.12.0
toolz                     0.12.1
torch                     2.3.1+cu118
torchaudio                2.3.1+cu118
torchvision               0.18.1+cu118
tqdm                      4.66.4
traitlets                 5.14.3
triton                    2.3.1
typer                     0.12.3
typing_extensions         4.12.2
tzdata                    2024.1
ujson                     5.10.0
ultralytics               8.2.46               /home/work/guoquanhao/ultralytics
ultralytics-thop          2.0.0
urllib3                   2.2.2
uvicorn                   0.30.1
uvloop                    0.19.0
watchfiles                0.22.0
wcwidth                   0.2.13
websockets                11.0.3
Werkzeug                  3.0.3
wheel                     0.43.0

Minimal Reproducible Example

yolo_train.py

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '6,5,4,3,2,1'

from ultralytics import YOLO

# Load a model
model = YOLO("yolov8n.yaml").load("./pretrained_model/yolov8n.pt")  # build from YAML and transfer weights

# Train the model
results = model.train(data="./ultralytics/cfg/datasets/layout.yaml", epochs=300, imgsz=672, device=os.getenv('CUDA_VISIBLE_DEVICES'), workers=0, batch=192)

python yolo_train.py

Additional

No response

Are you willing to submit a PR?

[ ] Yes I'd like to help by submitting a PR!

ultralytics / ultralytics

Multi GPU training error #14124

Search before asking

YOLOv8 Component

Bug

Environment

Minimal Reproducible Example

Additional

Are you willing to submit a PR?