ultralytics / ultralytics

NEW - YOLOv8 🚀 in PyTorch > ONNX > OpenVINO > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
25.91k stars 5.16k forks source link

Multi GPU training error #14124

Open GuoQuanhao opened 1 week ago

GuoQuanhao commented 1 week ago

Search before asking

YOLOv8 Component

No response

Bug

Transferred 355/355 items from pretrained weights
Ultralytics YOLOv8.2.46 🚀 Python-3.10.14 torch-2.3.1+cu118 CUDA:6 (Tesla V100-SXM2-16GB, 16161MiB)
                                                            CUDA:5 (Tesla V100-SXM2-16GB, 16161MiB)
                                                            CUDA:4 (Tesla V100-SXM2-16GB, 16161MiB)
                                                            CUDA:3 (Tesla V100-SXM2-16GB, 16161MiB)
                                                            CUDA:2 (Tesla V100-SXM2-16GB, 16161MiB)
                                                            CUDA:1 (Tesla V100-SXM2-16GB, 16161MiB)
engine/trainer: task=detect, mode=train, model=yolov8n.yaml, data=./ultralytics/cfg/datasets/layout.yaml, epochs=300, time=None, patience=100, batch=192, imgsz=672, save=True, save_period=-1, cache=False, device=6,5,4,3,2,1, workers=0, project=None, name=train8, exist_ok=False, pretrained=True, optimizer=auto, verbose=True, seed=0, deterministic=True, single_cls=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, amp=True, fraction=1.0, profile=False, freeze=None, multi_scale=False, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, vid_stride=1, stream_buffer=False, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, embed=None, show=False, save_frames=False, save_txt=False, save_conf=False, save_crop=False, show_labels=True, show_conf=True, show_boxes=True, line_width=None, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=False, opset=None, workspace=4, nms=False, lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, pose=12.0, kobj=1.0, label_smoothing=0.0, nbs=64, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, bgr=0.0, mosaic=1.0, mixup=0.0, copy_paste=0.0, auto_augment=randaugment, erasing=0.4, crop_fraction=1.0, cfg=None, tracker=botsort.yaml, mean=[0.92367495, 0.92194206, 0.91870715], std=[0.15719514, 0.15910767, 0.15992016], save_dir=runs/detect/train8
Overriding model.yaml nc=80 with nc=8

                   from  n    params  module                                       arguments                     
  0                  -1  1       464  ultralytics.nn.modules.conv.Conv             [3, 16, 3, 2]                 
  1                  -1  1      4672  ultralytics.nn.modules.conv.Conv             [16, 32, 3, 2]                
  2                  -1  1      7360  ultralytics.nn.modules.block.C2f             [32, 32, 1, True]             
  3                  -1  1     18560  ultralytics.nn.modules.conv.Conv             [32, 64, 3, 2]                
  4                  -1  2     49664  ultralytics.nn.modules.block.C2f             [64, 64, 2, True]             
  5                  -1  1     73984  ultralytics.nn.modules.conv.Conv             [64, 128, 3, 2]               
  6                  -1  2    197632  ultralytics.nn.modules.block.C2f             [128, 128, 2, True]           
  7                  -1  1    295424  ultralytics.nn.modules.conv.Conv             [128, 256, 3, 2]              
  8                  -1  1    460288  ultralytics.nn.modules.block.C2f             [256, 256, 1, True]           
  9                  -1  1    164608  ultralytics.nn.modules.block.SPPF            [256, 256, 5]                 
 10                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 11             [-1, 6]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 12                  -1  1    148224  ultralytics.nn.modules.block.C2f             [384, 128, 1]                 
 13                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 14             [-1, 4]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 15                  -1  1     37248  ultralytics.nn.modules.block.C2f             [192, 64, 1]                  
 16                  -1  1     36992  ultralytics.nn.modules.conv.Conv             [64, 64, 3, 2]                
 17            [-1, 12]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 18                  -1  1    123648  ultralytics.nn.modules.block.C2f             [192, 128, 1]                 
 19                  -1  1    147712  ultralytics.nn.modules.conv.Conv             [128, 128, 3, 2]              
 20             [-1, 9]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 21                  -1  1    493056  ultralytics.nn.modules.block.C2f             [384, 256, 1]                 
 22        [15, 18, 21]  1    752872  ultralytics.nn.modules.head.Detect           [8, [64, 128, 256]]           
YOLOv8n summary: 225 layers, 3012408 parameters, 3012392 gradients, 8.2 GFLOPs

Transferred 319/355 items from pretrained weights
DDP: debug command /home/opt/anaconda3/envs/torch_gqh/bin/python -m torch.distributed.run --nproc_per_node 6 --master_port 24515 /home/work/.config/Ultralytics/DDP/_temp_jp0x4t9_140566440279200.py

Ultralytics YOLOv8.2.46 🚀 Python-3.10.14 torch-2.3.1+cu118 CUDA:6 (Tesla V100-SXM2-16GB, 16161MiB)
                                                            CUDA:5 (Tesla V100-SXM2-16GB, 16161MiB)
                                                            CUDA:4 (Tesla V100-SXM2-16GB, 16161MiB)
                                                            CUDA:3 (Tesla V100-SXM2-16GB, 16161MiB)
                                                            CUDA:2 (Tesla V100-SXM2-16GB, 16161MiB)
                                                            CUDA:1 (Tesla V100-SXM2-16GB, 16161MiB)
TensorBoard: Start with 'tensorboard --logdir runs/detect/train8', view at http://localhost:6006/
Overriding model.yaml nc=80 with nc=8
Freezing layer 'model.22.dfl.conv.weight'
AMP: running Automatic Mixed Precision (AMP) checks with YOLOv8n...
Downloading https://github.com/ultralytics/assets/releases/download/v8.2.0/yolov8n.pt to 'yolov8n.pt'...
AMP: checks skipped ⚠️, offline and unable to download YOLOv8n. Setting 'amp=True'. If you experience zero-mAP or NaN losses you can disable AMP with amp=False.
train: Scanning /home/work/guoquanhao/ultralytics/layout/labels/train-nopicfoot.cache... 4025 images, 0 backgrounds, 0 corrupt: 100%|█
train: WARNING ⚠️ /home/work/guoquanhao/ultralytics/layout/images/train-nopicfoot/doc-jwd5zy2i6qf6jmi2.jpg: 1 duplicate labels removed
train: WARNING ⚠️ /home/work/guoquanhao/ultralytics/layout/images/train-nopicfoot/doc-xebshfsts2jqpd5g.jpg: 1 duplicate labels removed
val: Scanning /home/work/guoquanhao/ultralytics/layout/labels/valid-nopicfoot.cache... 1000 images, 0 backgrounds, 0 corrupt: 100%|███
Plotting labels to runs/detect/train8/labels.jpg... 
optimizer: 'optimizer=auto' found, ignoring 'lr0=0.01' and 'momentum=0.937' and determining best 'optimizer', 'lr0' and 'momentum' automatically... 
optimizer: AdamW(lr=0.000714, momentum=0.9) with parameter groups 57 weight(decay=0.0), 64 weight(decay=0.0015), 63 bias(decay=0.0)
W0701 15:44:03.619000 140038850516736 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1805550 closing signal SIGTERM
W0701 15:44:03.619000 140038850516736 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1805551 closing signal SIGTERM
W0701 15:44:03.621000 140038850516736 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1805552 closing signal SIGTERM
W0701 15:44:03.622000 140038850516736 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1805553 closing signal SIGTERM
W0701 15:44:03.626000 140038850516736 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1805554 closing signal SIGTERM
E0701 15:44:04.066000 140038850516736 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -11) local_rank: 0 (pid: 1805549) of binary: /home/opt/anaconda3/envs/torch_gqh/bin/python
Traceback (most recent call last):
  File "/home/opt/anaconda3/envs/torch_gqh/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/opt/anaconda3/envs/torch_gqh/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/opt/anaconda3/envs/torch_gqh/lib/python3.10/site-packages/torch/distributed/run.py", line 883, in <module>
    main()
  File "/home/opt/anaconda3/envs/torch_gqh/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/home/opt/anaconda3/envs/torch_gqh/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/home/opt/anaconda3/envs/torch_gqh/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/opt/anaconda3/envs/torch_gqh/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/opt/anaconda3/envs/torch_gqh/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/home/work/.config/Ultralytics/DDP/_temp_jp0x4t9_140566440279200.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-07-01_15:44:03
  host      : yq01-sys-hic-k8s-v100-box-a223-0033.yq01.baidu.com
  rank      : 0 (local_rank: 0)
  exitcode  : -11 (pid: 1805549)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 1805549
============================================================
Traceback (most recent call last):
  File "/home/work/guoquanhao/ultralytics/yolo_train.py", line 11, in <module>
    results = model.train(data="./ultralytics/cfg/datasets/layout.yaml", epochs=300, imgsz=672, device=os.getenv('CUDA_VISIBLE_DEVICES'), workers=0, batch=192)
  File "/home/work/guoquanhao/ultralytics/ultralytics/engine/model.py", line 650, in train
    self.trainer.train()
  File "/home/work/guoquanhao/ultralytics/ultralytics/engine/trainer.py", line 205, in train
    raise e
  File "/home/work/guoquanhao/ultralytics/ultralytics/engine/trainer.py", line 203, in train
    subprocess.run(cmd, check=True)
  File "/home/opt/anaconda3/envs/torch_gqh/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/home/opt/anaconda3/envs/torch_gqh/bin/python', '-m', 'torch.distributed.run', '--nproc_per_node', '6', '--master_port', '24515', '/home/work/.config/Ultralytics/DDP/_temp_jp0x4t9_140566440279200.py']' returned non-zero exit status 1.

Environment

Package                   Version              Editable project location
------------------------- -------------------- ---------------------------------
absl-py                   2.1.0
aiofiles                  23.2.1
albucore                  0.0.12
albumentations            1.4.10
altair                    5.3.0
annotated-types           0.7.0
anyio                     4.4.0
asttokens                 2.4.1
attrs                     23.2.0
certifi                   2024.6.2
charset-normalizer        3.3.2
click                     8.1.7
cmake                     3.29.6
coloredlogs               15.0.1
contourpy                 1.2.1
cycler                    0.12.1
decorator                 5.1.1
dnspython                 2.6.1
email_validator           2.2.0
exceptiongroup            1.2.1
executing                 2.0.1
fastapi                   0.111.0
fastapi-cli               0.0.4
ffmpy                     0.3.2
filelock                  3.15.4
flatbuffers               24.3.25
fonttools                 4.53.0
fsspec                    2024.6.0
gitdb                     4.0.11
GitPython                 3.1.43
gradio                    4.31.5
gradio_client             0.16.4
grpcio                    1.64.1
h11                       0.14.0
httpcore                  1.0.5
httptools                 0.6.1
httpx                     0.27.0
huggingface-hub           0.23.2
humanfriendly             10.0
idna                      3.7
imageio                   2.34.2
importlib_resources       6.4.0
ipython                   8.26.0
jedi                      0.19.1
Jinja2                    3.1.4
joblib                    1.4.2
jsonschema                4.22.0
jsonschema-specifications 2023.12.1
kiwisolver                1.4.5
lazy_loader               0.4
lit                       18.1.8
Markdown                  3.6
markdown-it-py            3.0.0
MarkupSafe                2.1.5
matplotlib                3.9.0
matplotlib-inline         0.1.7
mdurl                     0.1.2
mpmath                    1.3.0
networkx                  3.3
numpy                     1.26.4
nvidia-cublas-cu11        11.11.3.6
nvidia-cuda-cupti-cu11    11.8.87
nvidia-cuda-nvrtc-cu11    11.8.89
nvidia-cuda-runtime-cu11  11.8.89
nvidia-cudnn-cu11         8.7.0.84
nvidia-cufft-cu11         10.9.0.58
nvidia-curand-cu11        10.3.0.86
nvidia-cusolver-cu11      11.4.1.48
nvidia-cusparse-cu11      11.7.5.86
nvidia-nccl-cu11          2.20.5
nvidia-nvtx-cu11          11.8.86
onnx                      1.14.0
onnxruntime               1.15.1
onnxruntime-gpu           1.16.3
onnxsim                   0.4.36
opencv-python             4.10.0.84
opencv-python-headless    4.10.0.84
orjson                    3.10.5
packaging                 24.1
pandas                    2.2.2
parso                     0.8.4
pexpect                   4.9.0
pillow                    10.3.0
pip                       24.0
prompt_toolkit            3.0.47
protobuf                  4.25.3
psutil                    5.9.8
ptyprocess                0.7.0
pure-eval                 0.2.2
py-cpuinfo                9.0.0
pycocotools               2.0.7
pydantic                  2.7.4
pydantic_core             2.18.4
pydub                     0.25.1
Pygments                  2.18.0
pyparsing                 3.1.2
python-dateutil           2.9.0.post0
python-dotenv             1.0.1
python-multipart          0.0.9
pytz                      2024.1
PyYAML                    6.0.1
referencing               0.35.1
requests                  2.32.3
rich                      13.7.1
rpds-py                   0.18.1
ruff                      0.5.0
safetensors               0.4.3
scikit-image              0.24.0
scikit-learn              1.5.0
scipy                     1.13.0
seaborn                   0.13.2
semantic-version          2.10.0
setuptools                69.5.1
shellingham               1.5.4
six                       1.16.0
smmap                     5.0.1
sniffio                   1.3.1
stack-data                0.6.3
starlette                 0.37.2
sympy                     1.12.1
tensorboard               2.17.0
tensorboard-data-server   0.7.2
thop                      0.1.1.post2209072238
threadpoolctl             3.5.0
tifffile                  2024.6.18
tomli                     2.0.1
tomlkit                   0.12.0
toolz                     0.12.1
torch                     2.3.1+cu118
torchaudio                2.3.1+cu118
torchvision               0.18.1+cu118
tqdm                      4.66.4
traitlets                 5.14.3
triton                    2.3.1
typer                     0.12.3
typing_extensions         4.12.2
tzdata                    2024.1
ujson                     5.10.0
ultralytics               8.2.46               /home/work/guoquanhao/ultralytics
ultralytics-thop          2.0.0
urllib3                   2.2.2
uvicorn                   0.30.1
uvloop                    0.19.0
watchfiles                0.22.0
wcwidth                   0.2.13
websockets                11.0.3
Werkzeug                  3.0.3
wheel                     0.43.0

Minimal Reproducible Example

yolo_train.py

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '6,5,4,3,2,1'

from ultralytics import YOLO

# Load a model
model = YOLO("yolov8n.yaml").load("./pretrained_model/yolov8n.pt")  # build from YAML and transfer weights

# Train the model
results = model.train(data="./ultralytics/cfg/datasets/layout.yaml", epochs=300, imgsz=672, device=os.getenv('CUDA_VISIBLE_DEVICES'), workers=0, batch=192)

python yolo_train.py

Additional

No response

Are you willing to submit a PR?

glenn-jocher commented 3 days ago

@GuoQuanhao hello,

Thank you for providing detailed information about the issue you're encountering with multi-GPU training. It appears that there is a problem with the distributed training setup, as indicated by the torch.distributed.elastic.multiprocessing.errors.ChildFailedError and the exitcode: -11 (pid: 1805549) error.

To help us diagnose and resolve this issue more effectively, could you please provide a minimal reproducible example? This will allow us to replicate the problem on our end. You can find guidelines on how to create a minimal reproducible example here: Minimum Reproducible Example.

Additionally, please ensure that you are using the latest versions of all relevant packages, including PyTorch and Ultralytics YOLO. Sometimes, issues are resolved in newer releases, and updating might fix the problem.

Here are a few steps you can try to troubleshoot the issue:

  1. Verify CUDA and NCCL Installation: Ensure that your CUDA and NCCL installations are correctly set up and compatible with your PyTorch version.

  2. Reduce Batch Size: Sometimes, reducing the batch size can help if the issue is related to memory constraints.

  3. Check Environment Variables: Ensure that CUDA_VISIBLE_DEVICES is correctly set and that all GPUs are accessible.

  4. Simplify the Setup: Try running the training with fewer GPUs to see if the issue persists. For example, start with 2 GPUs and then gradually increase the number.

  5. Use Different Distributed Backend: You can try using a different distributed backend like gloo instead of nccl to see if it resolves the issue.

Here is an example of how you might modify your script to use fewer GPUs and a different backend:

import os
from ultralytics import YOLO

os.environ['CUDA_VISIBLE_DEVICES'] = '6,5'

# Load a model
model = YOLO("yolov8n.yaml").load("./pretrained_model/yolov8n.pt")  # build from YAML and transfer weights

# Train the model
results = model.train(data="./ultralytics/cfg/datasets/layout.yaml", epochs=300, imgsz=672, device=os.getenv('CUDA_VISIBLE_DEVICES'), workers=0, batch=96)

If the issue persists, please provide any additional logs or error messages that might help us diagnose the problem further.

Thank you for your patience and cooperation. We look forward to resolving this issue together.