Multi-GPU not working in python mode (not CLI)

bzisl commented 1 year ago

Search before asking

[X] I have searched the YOLOv8 issues and found no similar bug report.

YOLOv8 Component

No response

Bug

When I tried to launch multiGPU training (it was working some previous versions ago),. torch DDP is launching multiple times my script thinking that it is Yolo:

Ultralytics YOLOv8.0.54 🚀 Python-3.10.6 torch-1.13.1+cu117 CUDA:0 (NVIDIA GeForce RTX 3090, 24238MiB) CUDA:1 (NVIDIA GeForce RTX 3090, 24260MiB/ yolo/engine/trainer: task=detect, mode=train, model=yolov8n.pt, data=/home/user/AI/tests/data/myproj.yaml, epochs=200, patience=50, batch=148, imgsz=1024, save=True, save_period=-1, cache=True, device=[0, 1], workers=8, project=myproj, name=brain04_01, exist_ok=True, pretrained=False, optimizer=SGD, verbose=True, seed=19750129, deterministic=True, single_cls=False, image_weights=False, rect=True, cos_lr=False, close_mosaic=10, resume=False, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, show=False, save_txt=False, save_conf=False, save_crop=False, hide_labels=False, hide_conf=False, vid_stride=1, line_thickness=3, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, boxes=True, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=False, opset=None, workspace=4, nms=False, lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, fl_gamma=0.0, label_smoothing=0.0, nbs=64, hsv_h=0, hsv_s=0, hsv_v=0.4, degrees=0.0, translate=0, scale=0.3, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.0, mosaic=0.0, mixup=0.0, copy_paste=0.0, cfg=None, v5loader=False, tracker=botsort.yaml, save_dir=myproj/brain04_01 Overriding model.yaml nc=80 with nc=2

               from  n    params  module                                       arguments

0 -1 1 1 -1 1 2 -1 1 3 -1 1 4 -1 2 5 -1 1 6 7 8 9 10 -1 1 11 [-1, 6] 1 12 13 -1 1 14 [-1, 4] 1 15 16 17 [-1, 12] 1 18 19 20 [-1, 9] 1 21 22 [15, 18, 21] 1 Model summary: 225 464 ultralytics.nn.modules.Conv [3, 16, 3, 2] 4672 ultralytics.nn.modules.Conv [16, 32, 3, 2] 7360 ultralytics.nn.modules.C2f [32, 32, 1, True] 18560 ultralytics.nn.modules.Conv [32, 64, 3, 2] 49664 ultralytics.nn.modules.C2f [64, 64, 2, True] 73984 ultralytics.nn.modules.Conv [64, 128, 3, 2] -1 2 197632 ultralytics.nn.modules.C2f [128, 128, 2, True] -1 1 295424 ultralytics.nn.modules.Conv [128, 256, 3, 2] -1 1 460288 ultralytics.nn.modules.C2f [256, 256, 1, True] -1 1 164608 ultralytics.nn.modules.SPPF [256, 256, 5] 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 0 ultralytics.nn.modules.Concat [1] -1 1 148224 ultralytics.nn.modules.C2f [384, 128, 1] 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 0 ultralytics.nn.modules.Concat [1] -1 1 37248 ultralytics.nn.modules.C2f [192, 64, 1] -1 1 36992 ultralytics.nn.modules.Conv [64, 64, 3, 2] 0 ultralytics.nn.modules.Concat [1] -1 1 123648 ultralytics.nn.modules.C2f [192, 128, 1] -1 1 147712 ultralytics.nn.modules.Conv [128, 128, 3, 2] 0 ultralytics.nn.modules.Concat [1] -1 1 493056 ultralytics.nn.modules.C2f [384, 256, 1] 751702 ultralytics.nn.modules.Detect [2, [64, 128, 256]] layers, 3011238 parameters, 3011222 gradients, 8.2 GFLOPs

Transferred 319/355 items from pretrained weights Running DDP command ['/home/user/AI/tests/yolov8_venv/bin/python3', '-m', 'torch.distributed.run', '--nproc_per_node', '2', '--master_port', '58683', '/home/user/AI/tests/stage4.py', 'task=detect', 'mode=train', 'model=yolov8n.pt', 'data=/home/user/AI/tests/data/myproj.yaml', 'epochs=200', 'patience=50', 'batch=148', 'imgsz=1024', 'save=True', 'save_period=-1', 'cache=True', 'device=[0, 1]', 'workers=8', 'project=myproj', 'name=brain04_01', 'exist_ok=True', 'pretrained=False', 'optimizer=SGD', 'verbose=True', 'seed=19750129', 'deterministic=True', 'single_cls=False', 'image_weights=False', 'rect=True', 'cos_lr=False', 'close_mosaic=10', 'resume=False', 'overlap_mask=True', 'mask_ratio=4', 'dropout=0.0', 'val=True', 'split=val', 'save_json=False', 'save_hybrid=False', 'conf=None', 'iou=0.7', 'max_det=300', 'half=False', 'dnn=False', 'plots=True', 'source=None', 'show=False', 'save_txt=False', 'save_conf=False', 'save_crop=False', 'hide_labels=False', 'hide_conf=False', 'vid_stride=1', 'line_thickness=3', 'visualize=False', 'augment=False', 'agnostic_nms=False', 'classes=None', 'retina_masks=False', 'boxes=True', 'format=torchscript', 'keras=False', 'optimize=False', 'int8=False', 'dynamic=False', 'simplify=False', 'opset=None', 'workspace=4', 'nms=False', 'lr0=0.01', 'lrf=0.01', 'momentum=0.937', 'weight_decay=0.0005', 'warmup_epochs=3.0', 'warmup_momentum=0.8', 'warmup_bias_lr=0.1', 'box=7.5', 'cls=0.5', 'dfl=1.5', 'fl_gamma=0.0', 'label_smoothing=0.0', 'nbs=64', 'hsv_h=0', 'hsv_s=0', 'hsv_v=0.4', 'degrees=0.0', 'translate=0', 'scale=0.3', 'shear=0.0', 'perspective=0.0', 'flipud=0.0', 'fliplr=0.0', 'mosaic=0.0', 'mixup=0.0', 'copy_paste=0.0', 'cfg=None', 'v5loader=False', 'tracker=botsort.yaml'] Using torch 1.13.1+cu117. GPU count: 2 0 NVIDIA GeForce RTX 3090 25414860800 1 NVIDIA GeForce RTX 3090 25438126080 Setup complete ✅ (48 CPUs, 125.6 GB RAM, 494.1/1862.0 GB disk) usage: stage4.py [-h] (-t | -e | -o) -i INPUT [-g GPU] [-r RUN] [-s SELECT] stage4.py: error: the following arguments are required: -i/--input Using torch 1.13.1+cu117. GPU count: 2 0 NVIDIA GeForce RTX 3090 25414860800 1 NVIDIA GeForce RTX 3090 25438126080 usage: stage4.py [-h] (-t | -e | -o) -i INPUT [-g GPU] [-r RUN] [-s SELECT] stage4.py: error: the following arguments are required: -i/--input

Environment

Ultralytics YOLOv8.0.54 🚀 Python-3.10.6 torch-1.13.1+cu117 Ubuntu 22.04

Minimal Reproducible Example

No response

Additional

No response

Are you willing to submit a PR?

[X] Yes I'd like to help by submitting a PR!

glenn-jocher commented 1 year ago

@bzisl DDP Python works correctly for me in our Docker image, so I'm unable to reproduce any problem like you describe:

from ultralytics import YOLO

model = YOLO()

model.train(data='coco128.yaml', device=[0,1])

Ultralytics YOLOv8.0.54 🚀 Python-3.10.8 torch-1.13.1 CUDA:0 (A100-SXM-80GB, 81251MiB)
                                                      CUDA:1 (A100-SXM-80GB, 81251MiB)
yolo/engine/trainer: task=detect, mode=train, model=yolov8n.pt, data=coco128.yaml, epochs=100, patience=50, batch=16, imgsz=640, save=True, save_period=-1, cache=False, device=[0, 1], workers=8, project=None, name=None, exist_ok=False, pretrained=False, optimizer=SGD, verbose=True, seed=0, deterministic=True, single_cls=False, image_weights=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, show=False, save_txt=False, save_conf=False, save_crop=False, hide_labels=False, hide_conf=False, vid_stride=1, line_thickness=3, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, boxes=True, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=False, opset=None, workspace=4, nms=False, lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, fl_gamma=0.0, label_smoothing=0.0, nbs=64, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0, cfg=None, v5loader=False, tracker=botsort.yaml, save_dir=/usr/src/app/runs/detect/train

                   from  n    params  module                                       arguments                     
  0                  -1  1       464  ultralytics.nn.modules.Conv                  [3, 16, 3, 2]                 
  1                  -1  1      4672  ultralytics.nn.modules.Conv                  [16, 32, 3, 2]                
  2                  -1  1      7360  ultralytics.nn.modules.C2f                   [32, 32, 1, True]             
  3                  -1  1     18560  ultralytics.nn.modules.Conv                  [32, 64, 3, 2]                
  4                  -1  2     49664  ultralytics.nn.modules.C2f                   [64, 64, 2, True]             
  5                  -1  1     73984  ultralytics.nn.modules.Conv                  [64, 128, 3, 2]               
  6                  -1  2    197632  ultralytics.nn.modules.C2f                   [128, 128, 2, True]           
  7                  -1  1    295424  ultralytics.nn.modules.Conv                  [128, 256, 3, 2]              
  8                  -1  1    460288  ultralytics.nn.modules.C2f                   [256, 256, 1, True]           
  9                  -1  1    164608  ultralytics.nn.modules.SPPF                  [256, 256, 5]                 
 10                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 11             [-1, 6]  1         0  ultralytics.nn.modules.Concat                [1]                           
 12                  -1  1    148224  ultralytics.nn.modules.C2f                   [384, 128, 1]                 
 13                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 14             [-1, 4]  1         0  ultralytics.nn.modules.Concat                [1]                           
 15                  -1  1     37248  ultralytics.nn.modules.C2f                   [192, 64, 1]                  
 16                  -1  1     36992  ultralytics.nn.modules.Conv                  [64, 64, 3, 2]                
 17            [-1, 12]  1         0  ultralytics.nn.modules.Concat                [1]                           
 18                  -1  1    123648  ultralytics.nn.modules.C2f                   [192, 128, 1]                 
 19                  -1  1    147712  ultralytics.nn.modules.Conv                  [128, 128, 3, 2]              
 20             [-1, 9]  1         0  ultralytics.nn.modules.Concat                [1]                           
 21                  -1  1    493056  ultralytics.nn.modules.C2f                   [384, 256, 1]                 
 22        [15, 18, 21]  1    897664  ultralytics.nn.modules.Detect                [80, [64, 128, 256]]          
Model summary: 225 layers, 3157200 parameters, 3157184 gradients, 8.9 GFLOPs

Transferred 355/355 items from pretrained weights
Running DDP command ['/opt/conda/bin/python', '-m', 'torch.distributed.run', '--nproc_per_node', '2', '--master_port', '42909', '/root/.config/Ultralytics/DDP/_temp_jdxnjxrs140516923834624.py', 'task=detect', 'mode=train', 'model=yolov8n.pt', 'data=/usr/src/app/ultralytics/datasets/coco128.yaml', 'epochs=100', 'patience=50', 'batch=16', 'imgsz=640', 'save=True', 'save_period=-1', 'cache=False', 'device=[0, 1]', 'workers=8', 'project=None', 'name=None', 'exist_ok=False', 'pretrained=False', 'optimizer=SGD', 'verbose=True', 'seed=0', 'deterministic=True', 'single_cls=False', 'image_weights=False', 'rect=False', 'cos_lr=False', 'close_mosaic=10', 'resume=False', 'overlap_mask=True', 'mask_ratio=4', 'dropout=0.0', 'val=True', 'split=val', 'save_json=False', 'save_hybrid=False', 'conf=None', 'iou=0.7', 'max_det=300', 'half=False', 'dnn=False', 'plots=True', 'source=None', 'show=False', 'save_txt=False', 'save_conf=False', 'save_crop=False', 'hide_labels=False', 'hide_conf=False', 'vid_stride=1', 'line_thickness=3', 'visualize=False', 'augment=False', 'agnostic_nms=False', 'classes=None', 'retina_masks=False', 'boxes=True', 'format=torchscript', 'keras=False', 'optimize=False', 'int8=False', 'dynamic=False', 'simplify=False', 'opset=None', 'workspace=4', 'nms=False', 'lr0=0.01', 'lrf=0.01', 'momentum=0.937', 'weight_decay=0.0005', 'warmup_epochs=3.0', 'warmup_momentum=0.8', 'warmup_bias_lr=0.1', 'box=7.5', 'cls=0.5', 'dfl=1.5', 'fl_gamma=0.0', 'label_smoothing=0.0', 'nbs=64', 'hsv_h=0.015', 'hsv_s=0.7', 'hsv_v=0.4', 'degrees=0.0', 'translate=0.1', 'scale=0.5', 'shear=0.0', 'perspective=0.0', 'flipud=0.0', 'fliplr=0.5', 'mosaic=1.0', 'mixup=0.0', 'copy_paste=0.0', 'cfg=None', 'v5loader=False', 'tracker=botsort.yaml']
DDP settings: RANK 0, WORLD_SIZE 2, DEVICE cuda:0

Transferred 355/355 items from pretrained weights
AMP: running Automatic Mixed Precision (AMP) checks with YOLOv8n...
AMP: checks passed ✅
optimizer: SGD(lr=0.01) with parameter groups 57 weight(decay=0.0), 64 weight(decay=0.0005), 63 bias
train: Scanning /usr/src/datasets/coco128/labels/train2017.cache... 126 images, 2 backgrounds, 0 corrupt: 100%|██████████| 128/128 [00:00<?, ?it/s]
albumentations: Blur(p=0.01, blur_limit=(3, 7)), MedianBlur(p=0.01, blur_limit=(3, 7)), ToGray(p=0.01), CLAHE(p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8))
val: Scanning /usr/src/datasets/coco128/labels/train2017.cache... 126 images, 2 backgrounds, 0 corrupt: 100%|██████████| 128/128 [00:00<?, ?it/s]
Plotting labels to /usr/src/app/runs/detect/train/labels.jpg... 
Image sizes 640 train, 640 val
Using 16 dataloader workers
Logging results to /usr/src/app/runs/detect/train
Starting training for 100 epochs...

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
      1/100      1.37G      1.208      1.523      1.237         77        640: 100%|██████████| 8/8 [00:05<00:00,  1.50it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 8/8 [00:04<00:00,  1.67it/s]
                   all        128        929      0.641      0.539      0.615      0.456

glenn-jocher commented 1 year ago

Environments

YOLOv8 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

bzisl commented 1 year ago

I installed CUDA docker and ultralitics dockers with exactly the same result. I will check more to see if I found the problem

Harsh188 commented 1 year ago

Hi I'm facing the same issue. The model works perfectly fine if I don't specify the device parameter within model.

Once I specify device=0 or device=[0,1] I run into the following error:

Transferred 469/475 items from pretrained weights
Running DDP command ['/usr/bin/python3', '-m', 'torch.distributed.run', '--nproc_per_node', '2', '--master_port', '45303'
, '/PCWSeverityScoring-poc/severityModel/sceneBased/objectDetection.py', 'task=detect', 'mode=train', 'model=yolov8m.pt',
 'data=/PCWSeverityScoring-poc/data/Yolo/TotalDataset.yaml', 'epochs=10', 'patience=50', 'batch=16', 'imgsz=640', 'save=T
rue', 'save_period=-1', 'cache=False', 'device=0', 'workers=8', 'project=None', 'name=yolov8m_custom', 'exist_ok=False', 
'pretrained=False', 'optimizer=SGD', 'verbose=True', 'seed=0', 'deterministic=True', 'single_cls=False', 'image_weights=F
alse', 'rect=False', 'cos_lr=False', 'close_mosaic=10', 'resume=False', 'overlap_mask=True', 'mask_ratio=4', 'dropout=0.0
', 'val=True', 'split=val', 'save_json=False', 'save_hybrid=False', 'conf=None', 'iou=0.7', 'max_det=300', 'half=False', 
'dnn=False', 'plots=True', 'source=None', 'show=False', 'save_txt=False', 'save_conf=False', 'save_crop=False', 'hide_lab
els=False', 'hide_conf=False', 'vid_stride=1', 'line_thickness=3', 'visualize=False', 'augment=False', 'agnostic_nms=Fals
e', 'classes=None', 'retina_masks=False', 'boxes=True', 'format=torchscript', 'keras=False', 'optimize=False', 'int8=Fals
e', 'dynamic=False', 'simplify=False', 'opset=None', 'workspace=4', 'nms=False', 'lr0=0.01', 'lrf=0.01', 'momentum=0.937'
, 'weight_decay=0.0005', 'warmup_epochs=3.0', 'warmup_momentum=0.8', 'warmup_bias_lr=0.1', 'box=7.5', 'cls=0.5', 'dfl=1.5
', 'fl_gamma=0.0', 'label_smoothing=0.0', 'nbs=64', 'hsv_h=0.015', 'hsv_s=0.7', 'hsv_v=0.4', 'degrees=0.0', 'translate=0.
1', 'scale=0.5', 'shear=0.0', 'perspective=0.0', 'flipud=0.0', 'fliplr=0.5', 'mosaic=1.0', 'mixup=0.0', 'copy_paste=0.0',
 'cfg=None', 'v5loader=False', 'tracker=botsort.yaml']
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, 
please further tune the variable for optimal performance in your application as needed. 
*****************************************
usage: objectDetection.py [-h] [--epochs -E] [--batch_size -B] [--verbose] [--yaml_path YAML_PATH]
objectDetection.py: error: unrecognized arguments: task=detect mode=train model=yolov8m.pt data=/PCWSeverityScoring-poc/d
ata/Yolo/TotalDataset.yaml epochs=10 patience=50 batch=16 imgsz=640 save=True save_period=-1 cache=False device=0 workers
=8 project=None name=yolov8m_custom exist_ok=False pretrained=False optimizer=SGD verbose=True seed=0 deterministic=True 
single_cls=False image_weights=False rect=False cos_lr=False close_mosaic=10 resume=False overlap_mask=True mask_ratio=4 
dropout=0.0 val=True split=val save_json=False save_hybrid=False conf=None iou=0.7 max_det=300 half=False dnn=False plots
=True source=None show=False save_txt=False save_conf=False save_crop=False hide_labels=False hide_conf=False vid_stride=
1 line_thickness=3 visualize=False augment=False agnostic_nms=False classes=None retina_masks=False boxes=True format=tor
chscript keras=False optimize=False int8=False dynamic=False simplify=False opset=None workspace=4 nms=False lr0=0.01 lrf
=0.01 momentum=0.937 weight_decay=0.0005 warmup_epochs=3.0 warmup_momentum=0.8 warmup_bias_lr=0.1 box=7.5 cls=0.5 dfl=1.5
 fl_gamma=0.0 label_smoothing=0.0 nbs=64 hsv_h=0.015 hsv_s=0.7 hsv_v=0.4 degrees=0.0 translate=0.1 scale=0.5 shear=0.0 pe
rspective=0.0 flipud=0.0 fliplr=0.5 mosaic=1.0 mixup=0.0 copy_paste=0.0 cfg=None v5loader=False tracker=botsort.yaml
usage: objectDetection.py [-h] [--epochs -E] [--batch_size -B] [--verbose] [--yaml_path YAML_PATH]
objectDetection.py: error: unrecognized arguments: task=detect mode=train model=yolov8m.pt data=/PCWSeverityScoring-poc/d
ata/Yolo/TotalDataset.yaml epochs=10 patience=50 batch=16 imgsz=640 save=True save_period=-1 cache=False device=0 workers
=8 project=None name=yolov8m_custom exist_ok=False pretrained=False optimizer=SGD verbose=True seed=0 deterministic=True 
single_cls=False image_weights=False rect=False cos_lr=False close_mosaic=10 resume=False overlap_mask=True mask_ratio=4 
dropout=0.0 val=True split=val save_json=False save_hybrid=False conf=None iou=0.7 max_det=300 half=False dnn=False plots
=True source=None show=False save_txt=False save_conf=False save_crop=False hide_labels=False hide_conf=False vid_stride=
1 line_thickness=3 visualize=False augment=False agnostic_nms=False classes=None retina_masks=False boxes=True format=tor
chscript keras=False optimize=False int8=False dynamic=False simplify=False opset=None workspace=4 nms=False lr0=0.01 lrf
=0.01 momentum=0.937 weight_decay=0.0005 warmup_epochs=3.0 warmup_momentum=0.8 warmup_bias_lr=0.1 box=7.5 cls=0.5 dfl=1.5
 fl_gamma=0.0 label_smoothing=0.0 nbs=64 hsv_h=0.015 hsv_s=0.7 hsv_v=0.4 degrees=0.0 translate=0.1 scale=0.5 shear=0.0 pe
rspective=0.0 flipud=0.0 fliplr=0.5 mosaic=1.0 mixup=0.0 copy_paste=0.0 cfg=None v5loader=False tracker=botsort.yaml
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 208) of binary: /usr/bin/pyt
hon3
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 798, in <module>
    main()
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, i
n wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================ 
/PCWSeverityScoring-poc/severityModel/sceneBased/objectDetection.py FAILED
------------------------------------------------------------ 
Failures:
[1]:
  time      : 2023-03-24_04:32:53
  host      : 20e4dd801c07
  rank      : 1 (local_rank: 1)
  exitcode  : 2 (pid: 209)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------ 
Root Cause (first observed failure):
[0]:
  time      : 2023-03-24_04:32:53
  host      : 20e4dd801c07
  rank      : 0 (local_rank: 0)
  exitcode  : 2 (pid: 208)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================ 
Traceback (most recent call last):
  File "objectDetection.py", line 105, in <module>
    cobj.train(args.yaml_path,640)
  File "objectDetection.py", line 84, in train
    results = self.model.train(
  File "/usr/local/lib/python3.8/dist-packages/ultralytics/yolo/engine/model.py", line 326, in train
    self.trainer.train()
  File "/usr/local/lib/python3.8/dist-packages/ultralytics/yolo/engine/trainer.py", line 182, in train
    raise e
  File "/usr/local/lib/python3.8/dist-packages/ultralytics/yolo/engine/trainer.py", line 180, in train
    subprocess.run(cmd, check=True)
  File "/usr/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-m', 'torch.distributed.run', '--nproc_per_node', '2', '--m
aster_port', '45303', '/PCWSeverityScoring-poc/severityModel/sceneBased/objectDetection.py', 'task=detect', 'mode=train',
 'model=yolov8m.pt', 'data=/PCWSeverityScoring-poc/data/Yolo/TotalDataset.yaml', 'epochs=10', 'patience=50', 'batch=16', 
'imgsz=640', 'save=True', 'save_period=-1', 'cache=False', 'device=0', 'workers=8', 'project=None', 'name=yolov8m_custom'
, 'exist_ok=False', 'pretrained=False', 'optimizer=SGD', 'verbose=True', 'seed=0', 'deterministic=True', 'single_cls=Fals
e', 'image_weights=False', 'rect=False', 'cos_lr=False', 'close_mosaic=10', 'resume=False', 'overlap_mask=True', 'mask_ra
tio=4', 'dropout=0.0', 'val=True', 'split=val', 'save_json=False', 'save_hybrid=False', 'conf=None', 'iou=0.7', 'max_det=
300', 'half=False', 'dnn=False', 'plots=True', 'source=None', 'show=False', 'save_txt=False', 'save_conf=False', 'save_cr
op=False', 'hide_labels=False', 'hide_conf=False', 'vid_stride=1', 'line_thickness=3', 'visualize=False', 'augment=False'
, 'agnostic_nms=False', 'classes=None', 'retina_masks=False', 'boxes=True', 'format=torchscript', 'keras=False', 'optimiz
e=False', 'int8=False', 'dynamic=False', 'simplify=False', 'opset=None', 'workspace=4', 'nms=False', 'lr0=0.01', 'lrf=0.0
1', 'momentum=0.937', 'weight_decay=0.0005', 'warmup_epochs=3.0', 'warmup_momentum=0.8', 'warmup_bias_lr=0.1', 'box=7.5',
 'cls=0.5', 'dfl=1.5', 'fl_gamma=0.0', 'label_smoothing=0.0', 'nbs=64', 'hsv_h=0.015', 'hsv_s=0.7', 'hsv_v=0.4', 'degrees
=0.0', 'translate=0.1', 'scale=0.5', 'shear=0.0', 'perspective=0.0', 'flipud=0.0', 'fliplr=0.5', 'mosaic=1.0', 'mixup=0.0
', 'copy_paste=0.0', 'cfg=None', 'v5loader=False', 'tracker=botsort.yaml']' returned non-zero exit status 1.

It seems like the issue arises when my python file objectDetection.py is being called and passed the model training arguments. Which leads to objectDetection.py: error: unrecognized arguments:.

Any inputs on this @glenn-jocher? I'm assuming that it's an issue with how the subprocesses are being called.

For reference here is my objectDetection.py:

class CustomObjectDetection:
    '''Object Detection with YOLOv8'''

    def __init__(self,args,logging):
        '''Constructor'''

        # Model parameters
        self.epochs = args.epochs
        self.batch = args.batch_size

        # Set logging
        self.logging = logging

        # Load the model
        self.model = YOLO('yolov8m.pt')
        self.model.to('cuda')

    def train(self, data, image_size):
        '''
        Uses the Ultralytics Python API to train the model.
        '''

        results = self.model.train(
            data=data,
            imgsz=image_size,
            epochs=self.epochs,
            batch=self.batch,
            name='yolov8m_custom'
        )

if __name__ == "__main__":
    import logging
    # Parse arguments
    args = parseArgs()

    if not args.verbose:
        logging.basicConfig(stream=sys.stdout, level=logging.ERROR)
    else:
        logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)

    # Initialize object
    cobj = CustomObjectDetection(args,logging)
    cobj.train(args.yaml_path,640)

I'm using torch-2.0.0+cu117 and Python-3.8.10

bzisl commented 1 year ago

I've tried both with virtuavenv and with a the official docker container and I had the same results

Harsh188 commented 1 year ago

I've tried both with virtuavenv and with a the official docker container and I had the same results

Yep, I was able to reproduce the error using the Ultralytics docker container as well:

Ultralytics YOLOv8.0.56 🚀 Python-3.10.8 torch-1.13.1 CUDA:0 (NVIDIA GeForce RTX 3060, 12042MiB)
                                                      CUDA:1 (NVIDIA GeForce RTX 3060, 12044MiB)
yolo/engine/trainer: task=detect, mode=train, model=yolov8m.pt, data=/PCWSeverityScoring-poc/data/Yolo/TotalDataset.yaml, epochs=10, patience=50, batch=16, imgsz=640, save=True, save_period=-1, cache=False, device=[0, 1], workers=8, project=None, name=yolov8m_custom, exist_ok=False, pretrained=False, optimizer=SGD, verbose=True, seed=0, deterministic=True, single_cls=False, image_weights=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, show=False, save_txt=False, save_conf=False, save_crop=False, hide_labels=False, hide_conf=False, vid_stride=1, line_thickness=3, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, boxes=True, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=False, opset=None, workspace=4, nms=False, lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, fl_gamma=0.0, label_smoothing=0.0, nbs=64, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0, cfg=None, v5loader=False, tracker=botsort.yaml, save_dir=runs/detect/yolov8m_custom21
Overriding model.yaml nc=80 with nc=8

                   from  n    params  module                                       arguments                     
  0                  -1  1      1392  ultralytics.nn.modules.Conv                  [3, 48, 3, 2]                 
  1                  -1  1     41664  ultralytics.nn.modules.Conv                  [48, 96, 3, 2]                
  2                  -1  2    111360  ultralytics.nn.modules.C2f                   [96, 96, 2, True]             
  3                  -1  1    166272  ultralytics.nn.modules.Conv                  [96, 192, 3, 2]               
  4                  -1  4    813312  ultralytics.nn.modules.C2f                   [192, 192, 4, True]           
  5                  -1  1    664320  ultralytics.nn.modules.Conv                  [192, 384, 3, 2]              
  6                  -1  4   3248640  ultralytics.nn.modules.C2f                   [384, 384, 4, True]           
  7                  -1  1   1991808  ultralytics.nn.modules.Conv                  [384, 576, 3, 2]              
  8                  -1  2   3985920  ultralytics.nn.modules.C2f                   [576, 576, 2, True]           
  9                  -1  1    831168  ultralytics.nn.modules.SPPF                  [576, 576, 5]                 
 10                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 11             [-1, 6]  1         0  ultralytics.nn.modules.Concat                [1]                           
 12                  -1  2   1993728  ultralytics.nn.modules.C2f                   [960, 384, 2]                 
 13                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 14             [-1, 4]  1         0  ultralytics.nn.modules.Concat                [1]                           
 15                  -1  2    517632  ultralytics.nn.modules.C2f                   [576, 192, 2]                 
 16                  -1  1    332160  ultralytics.nn.modules.Conv                  [192, 192, 3, 2]              
 17            [-1, 12]  1         0  ultralytics.nn.modules.Concat                [1]                           
 18                  -1  2   1846272  ultralytics.nn.modules.C2f                   [576, 384, 2]                 
 19                  -1  1   1327872  ultralytics.nn.modules.Conv                  [384, 384, 3, 2]              
 20             [-1, 9]  1         0  ultralytics.nn.modules.Concat                [1]                           
 21                  -1  2   4207104  ultralytics.nn.modules.C2f                   [960, 576, 2]                 
 22        [15, 18, 21]  1   3780328  ultralytics.nn.modules.Detect                [8, [192, 384, 576]]          
Model summary: 295 layers, 25860952 parameters, 25860936 gradients, 79.1 GFLOPs

Transferred 469/475 items from pretrained weights
Running DDP command ['/opt/conda/bin/python3', '-m', 'torch.distributed.run', '--nproc_per_node', '2', '--master_port', '35675', '/PCWSeverityScoring-poc/severityModel/sceneBased/objectDetection.py', 'task=detect', 'mode=train', 'model=yolov8m.pt', 'data=/PCWSeverityScoring-poc/data/Yolo/TotalDataset.yaml', 'epochs=10', 'patience=50', 'batch=16', 'imgsz=640', 'save=True', 'save_period=-1', 'cache=False', 'device=[0, 1]', 'workers=8', 'project=None', 'name=yolov8m_custom', 'exist_ok=False', 'pretrained=False', 'optimizer=SGD', 'verbose=True', 'seed=0', 'deterministic=True', 'single_cls=False', 'image_weights=False', 'rect=False', 'cos_lr=False', 'close_mosaic=10', 'resume=False', 'overlap_mask=True', 'mask_ratio=4', 'dropout=0.0', 'val=True', 'split=val', 'save_json=False', 'save_hybrid=False', 'conf=None', 'iou=0.7', 'max_det=300', 'half=False', 'dnn=False', 'plots=True', 'source=None', 'show=False', 'save_txt=False', 'save_conf=False', 'save_crop=False', 'hide_labels=False', 'hide_conf=False', 'vid_stride=1', 'line_thickness=3', 'visualize=False', 'augment=False', 'agnostic_nms=False', 'classes=None', 'retina_masks=False', 'boxes=True', 'format=torchscript', 'keras=False', 'optimize=False', 'int8=False', 'dynamic=False', 'simplify=False', 'opset=None', 'workspace=4', 'nms=False', 'lr0=0.01', 'lrf=0.01', 'momentum=0.937', 'weight_decay=0.0005', 'warmup_epochs=3.0', 'warmup_momentum=0.8', 'warmup_bias_lr=0.1', 'box=7.5', 'cls=0.5', 'dfl=1.5', 'fl_gamma=0.0', 'label_smoothing=0.0', 'nbs=64', 'hsv_h=0.015', 'hsv_s=0.7', 'hsv_v=0.4', 'degrees=0.0', 'translate=0.1', 'scale=0.5', 'shear=0.0', 'perspective=0.0', 'flipud=0.0', 'fliplr=0.5', 'mosaic=1.0', 'mixup=0.0', 'copy_paste=0.0', 'cfg=None', 'v5loader=False', 'tracker=botsort.yaml']
usage: objectDetection.py [-h] [--epochs -E] [--batch_size -B] [--verbose] [--yaml_path YAML_PATH]
usage: objectDetection.py [-h] [--epochs -E] [--batch_size -B] [--verbose] [--yaml_path YAML_PATH]
objectDetection.py: error: unrecognized arguments: task=detect mode=train model=yolov8m.pt data=/PCWSeverityScoring-poc/data/Yolo/TotalDataset.yaml epochs=10 patience=50 batch=16 imgsz=640 save=True save_period=-1 cache=False device=[0, 1] workers=8 project=None name=yolov8m_custom exist_ok=False pretrained=False optimizer=SGD verbose=True seed=0 deterministic=True single_cls=False image_weights=False rect=False cos_lr=False close_mosaic=10 resume=False overlap_mask=True mask_ratio=4 dropout=0.0 val=True split=val save_json=False save_hybrid=False conf=None iou=0.7 max_det=300 half=False dnn=False plots=True source=None show=False save_txt=False save_conf=False save_crop=False hide_labels=False hide_conf=False vid_stride=1 line_thickness=3 visualize=False augment=False agnostic_nms=False classes=None retina_masks=False boxes=True format=torchscript keras=False optimize=False int8=False dynamic=False simplify=False opset=None workspace=4 nms=False lr0=0.01 lrf=0.01 momentum=0.937 weight_decay=0.0005 warmup_epochs=3.0 warmup_momentum=0.8 warmup_bias_lr=0.1 box=7.5 cls=0.5 dfl=1.5 fl_gamma=0.0 label_smoothing=0.0 nbs=64 hsv_h=0.015 hsv_s=0.7 hsv_v=0.4 degrees=0.0 translate=0.1 scale=0.5 shear=0.0 perspective=0.0 flipud=0.0 fliplr=0.5 mosaic=1.0 mixup=0.0 copy_paste=0.0 cfg=None v5loader=False tracker=botsort.yaml
objectDetection.py: error: unrecognized arguments: task=detect mode=train model=yolov8m.pt data=/PCWSeverityScoring-poc/data/Yolo/TotalDataset.yaml epochs=10 patience=50 batch=16 imgsz=640 save=True save_period=-1 cache=False device=[0, 1] workers=8 project=None name=yolov8m_custom exist_ok=False pretrained=False optimizer=SGD verbose=True seed=0 deterministic=True single_cls=False image_weights=False rect=False cos_lr=False close_mosaic=10 resume=False overlap_mask=True mask_ratio=4 dropout=0.0 val=True split=val save_json=False save_hybrid=False conf=None iou=0.7 max_det=300 half=False dnn=False plots=True source=None show=False save_txt=False save_conf=False save_crop=False hide_labels=False hide_conf=False vid_stride=1 line_thickness=3 visualize=False augment=False agnostic_nms=False classes=None retina_masks=False boxes=True format=torchscript keras=False optimize=False int8=False dynamic=False simplify=False opset=None workspace=4 nms=False lr0=0.01 lrf=0.01 momentum=0.937 weight_decay=0.0005 warmup_epochs=3.0 warmup_momentum=0.8 warmup_bias_lr=0.1 box=7.5 cls=0.5 dfl=1.5 fl_gamma=0.0 label_smoothing=0.0 nbs=64 hsv_h=0.015 hsv_s=0.7 hsv_v=0.4 degrees=0.0 translate=0.1 scale=0.5 shear=0.0 perspective=0.0 flipud=0.0 fliplr=0.5 mosaic=1.0 mixup=0.0 copy_paste=0.0 cfg=None v5loader=False tracker=botsort.yaml
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 21) of binary: /opt/conda/bin/python3
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 766, in <module>
    main()
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/PCWSeverityScoring-poc/severityModel/sceneBased/objectDetection.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-03-24_09:22:30
  host      : d229d18e66ed
  rank      : 1 (local_rank: 1)
  exitcode  : 2 (pid: 22)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-24_09:22:30
  host      : d229d18e66ed
  rank      : 0 (local_rank: 0)
  exitcode  : 2 (pid: 21)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Traceback (most recent call last):
  File "/PCWSeverityScoring-poc/severityModel/sceneBased/objectDetection.py", line 105, in <module>
    cobj.train(args.yaml_path,640)
  File "/PCWSeverityScoring-poc/severityModel/sceneBased/objectDetection.py", line 84, in train
    results = self.model.train(
  File "/opt/conda/lib/python3.10/site-packages/ultralytics/yolo/engine/model.py", line 326, in train
    self.trainer.train()
  File "/opt/conda/lib/python3.10/site-packages/ultralytics/yolo/engine/trainer.py", line 182, in train
    raise e
  File "/opt/conda/lib/python3.10/site-packages/ultralytics/yolo/engine/trainer.py", line 180, in train
    subprocess.run(cmd, check=True)
  File "/opt/conda/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/opt/conda/bin/python3', '-m', 'torch.distributed.run', '--nproc_per_node', '2', '--master_port', '35675', '/PCWSeverityScoring-poc/severityModel/sceneBased/objectDetection.py', 'task=detect', 'mode=train', 'model=yolov8m.pt', 'data=/PCWSeverityScoring-poc/data/Yolo/TotalDataset.yaml', 'epochs=10', 'patience=50', 'batch=16', 'imgsz=640', 'save=True', 'save_period=-1', 'cache=False', 'device=[0, 1]', 'workers=8', 'project=None', 'name=yolov8m_custom', 'exist_ok=False', 'pretrained=False', 'optimizer=SGD', 'verbose=True', 'seed=0', 'deterministic=True', 'single_cls=False', 'image_weights=False', 'rect=False', 'cos_lr=False', 'close_mosaic=10', 'resume=False', 'overlap_mask=True', 'mask_ratio=4', 'dropout=0.0', 'val=True', 'split=val', 'save_json=False', 'save_hybrid=False', 'conf=None', 'iou=0.7', 'max_det=300', 'half=False', 'dnn=False', 'plots=True', 'source=None', 'show=False', 'save_txt=False', 'save_conf=False', 'save_crop=False', 'hide_labels=False', 'hide_conf=False', 'vid_stride=1', 'line_thickness=3', 'visualize=False', 'augment=False', 'agnostic_nms=False', 'classes=None', 'retina_masks=False', 'boxes=True', 'format=torchscript', 'keras=False', 'optimize=False', 'int8=False', 'dynamic=False', 'simplify=False', 'opset=None', 'workspace=4', 'nms=False', 'lr0=0.01', 'lrf=0.01', 'momentum=0.937', 'weight_decay=0.0005', 'warmup_epochs=3.0', 'warmup_momentum=0.8', 'warmup_bias_lr=0.1', 'box=7.5', 'cls=0.5', 'dfl=1.5', 'fl_gamma=0.0', 'label_smoothing=0.0', 'nbs=64', 'hsv_h=0.015', 'hsv_s=0.7', 'hsv_v=0.4', 'degrees=0.0', 'translate=0.1', 'scale=0.5', 'shear=0.0', 'perspective=0.0', 'flipud=0.0', 'fliplr=0.5', 'mosaic=1.0', 'mixup=0.0', 'copy_paste=0.0', 'cfg=None', 'v5loader=False', 'tracker=botsort.yaml']' returned non-zero exit status 1.

glenn-jocher commented 1 year ago

@Harsh188 DDP works correctly, we are constantly training DDP models in our Docker image on A100 servers, so there may be a problem with your custom code python file.

Harsh188 commented 1 year ago

@glenn-jocher

Yep, it looks like the code fails considering that model.train() was called within a python class.

I rewrote my python script without any classes or methods and it seems to work fine. I tried to look into the source code to figure out how DDP is being called but couldn't really pinpoint the error. I'll try to give it another look tomorrow :)

glenn-jocher commented 1 year ago

@Harsh188 we verify DDP in the python console itself (and in CLI) but we don't have an example python script, you may be running into problems due to the structure of your file, but I can't really say.

kbratsy commented 1 year ago

I used Ultralytics YOLOv8.0.58 🚀 Python-3.8.10 torch-2.0.0 versions. CUDA:0 (Tesla P100-16GB, 16281MiB) CUDA:1 (Tesla P100-16GB, 16281MiB)

from ultralytics import YOLO
import cv2
from IPython.display import display, Image
import torch

model = YOLO("yolov8x.pt") 

results = model.train(data="../data.yaml", epochs=25, imgsz=640,device='0,1', batch=32, pretrained=True, scale=0.2, translate=0.1) 
results = model.val() 
model.export()

When I ran the script below, I got errors similar to the ones above. How can I solve it?

Running DDP command ['.../envs/yolov8/bin/python', '-m', 'torch.distributed.run', '--nproc_per_node', '2', '--master_port', '36503', '.../demo_debug.py', 'task=detect', 'mode=train', 'model=yolov8x.pt', 'data=.../data.yaml', 'epochs=2', 'patience=50', 'batch=32', 'imgsz=640', 'save=True', 'save_period=-1', 'cache=False', 'device=0,1', 'workers=8', 'project=None', 'name=None', 'exist_ok=False', 'pretrained=True', 'optimizer=SGD', 'verbose=True', 'seed=0', 'deterministic=True', 'single_cls=False', 'image_weights=False', 'rect=False', 'cos_lr=False', 'close_mosaic=10', 'resume=False', 'amp=True', 'overlap_mask=True', 'mask_ratio=4', 'dropout=0.0', 'val=True', 'split=val', 'save_json=False', 'save_hybrid=False', 'conf=None', 'iou=0.7', 'max_det=300', 'half=False', 'dnn=False', 'plots=True', 'source=None', 'show=False', 'save_txt=False', 'save_conf=False', 'save_crop=False', 'hide_labels=False', 'hide_conf=False', 'vid_stride=1', 'line_thickness=3', 'visualize=False', 'augment=False', 'agnostic_nms=False', 'classes=None', 'retina_masks=False', 'boxes=True', 'format=torchscript', 'keras=False', 'optimize=False', 'int8=False', 'dynamic=False', 'simplify=False', 'opset=None', 'workspace=4', 'nms=False', 'lr0=0.01', 'lrf=0.01', 'momentum=0.937', 'weight_decay=0.0005', 'warmup_epochs=3.0', 'warmup_momentum=0.8', 'warmup_bias_lr=0.1', 'box=7.5', 'cls=0.5', 'dfl=1.5', 'fl_gamma=0.0', 'label_smoothing=0.0', 'nbs=64', 'hsv_h=0.015', 'hsv_s=0.7', 'hsv_v=0.4', 'degrees=0.0', 'translate=0.1', 'scale=0.2', 'shear=0.0', 'perspective=0.0', 'flipud=0.0', 'fliplr=0.5', 'mosaic=1.0', 'mixup=0.0', 'copy_paste=0.0', 'cfg=None', 'v5loader=False', 'tracker=botsort.yaml'] WARNING:main:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

TypeError: 'IterableSimpleNamespace' object is not subscriptable

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 95015) of binary: .../envs/yolov8/bin/python

kbratsy commented 1 year ago

I used Ultralytics YOLOv8.0.58 🚀 Python-3.8.10 torch-2.0.0 versions.

from ultralytics import YOLO
import cv2
from IPython.display import display, Image
import torch

model = YOLO("yolov8x.pt") 

results = model.train(data="../data.yaml", epochs=25, imgsz=640,device='0,1', batch=32, pretrained=True, scale=0.2, translate=0.1) 
results = model.val() 
model.export()

When I ran the script below, I got errors similar to the ones above. How can I solve it?

Running DDP command ['.../envs/yolov8/bin/python', '-m', 'torch.distributed.run', '--nproc_per_node', '2', '--master_port', '36503', '.../demo_debug.py', 'task=detect', 'mode=train', 'model=yolov8x.pt', 'data=.../data.yaml', 'epochs=2', 'patience=50', 'batch=32', 'imgsz=640', 'save=True', 'save_period=-1', 'cache=False', 'device=0,1', 'workers=8', 'project=None', 'name=None', 'exist_ok=False', 'pretrained=True', 'optimizer=SGD', 'verbose=True', 'seed=0', 'deterministic=True', 'single_cls=False', 'image_weights=False', 'rect=False', 'cos_lr=False', 'close_mosaic=10', 'resume=False', 'amp=True', 'overlap_mask=True', 'mask_ratio=4', 'dropout=0.0', 'val=True', 'split=val', 'save_json=False', 'save_hybrid=False', 'conf=None', 'iou=0.7', 'max_det=300', 'half=False', 'dnn=False', 'plots=True', 'source=None', 'show=False', 'save_txt=False', 'save_conf=False', 'save_crop=False', 'hide_labels=False', 'hide_conf=False', 'vid_stride=1', 'line_thickness=3', 'visualize=False', 'augment=False', 'agnostic_nms=False', 'classes=None', 'retina_masks=False', 'boxes=True', 'format=torchscript', 'keras=False', 'optimize=False', 'int8=False', 'dynamic=False', 'simplify=False', 'opset=None', 'workspace=4', 'nms=False', 'lr0=0.01', 'lrf=0.01', 'momentum=0.937', 'weight_decay=0.0005', 'warmup_epochs=3.0', 'warmup_momentum=0.8', 'warmup_bias_lr=0.1', 'box=7.5', 'cls=0.5', 'dfl=1.5', 'fl_gamma=0.0', 'label_smoothing=0.0', 'nbs=64', 'hsv_h=0.015', 'hsv_s=0.7', 'hsv_v=0.4', 'degrees=0.0', 'translate=0.1', 'scale=0.2', 'shear=0.0', 'perspective=0.0', 'flipud=0.0', 'fliplr=0.5', 'mosaic=1.0', 'mixup=0.0', 'copy_paste=0.0', 'cfg=None', 'v5loader=False', 'tracker=botsort.yaml'] WARNING:main:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

TypeError: 'IterableSimpleNamespace' object is not subscriptable

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 95015) of binary: .../envs/yolov8/bin/python

ZZHOO1 commented 1 year ago

Hello, has this problem been solved, I have the same problem as you @Harsh188

191086 commented 1 year ago

Search before asking

[x] I have searched the YOLOv8 issues and found no similar bug report.

YOLOv8 Component

No response

Bug

When I tried to launch multiGPU training (it was working some previous versions ago),. torch DDP is launching multiple times my script thinking that it is Yolo:

Ultralytics YOLOv8.0.54 🚀 Python-3.10.6 torch-1.13.1+cu117 CUDA:0 (NVIDIA GeForce RTX 3090, 24238MiB) CUDA:1 (NVIDIA GeForce RTX 3090, 24260MiB/ yolo/engine/trainer: task=detect, mode=train, model=yolov8n.pt, data=/home/user/AI/tests/data/myproj.yaml, epochs=200, patience=50, batch=148, imgsz=1024, save=True, save_period=-1, cache=True, device=[0, 1], workers=8, project=myproj, name=brain04_01, exist_ok=True, pretrained=False, optimizer=SGD, verbose=True, seed=19750129, deterministic=True, single_cls=False, image_weights=False, rect=True, cos_lr=False, close_mosaic=10, resume=False, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, show=False, save_txt=False, save_conf=False, save_crop=False, hide_labels=False, hide_conf=False, vid_stride=1, line_thickness=3, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, boxes=True, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=False, opset=None, workspace=4, nms=False, lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, fl_gamma=0.0, label_smoothing=0.0, nbs=64, hsv_h=0, hsv_s=0, hsv_v=0.4, degrees=0.0, translate=0, scale=0.3, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.0, mosaic=0.0, mixup=0.0, copy_paste=0.0, cfg=None, v5loader=False, tracker=botsort.yaml, save_dir=myproj/brain04_01 Overriding model.yaml nc=80 with nc=2
               from  n    params  module                                       arguments
0 -1 1 464 ultralytics.nn.modules.Conv [3, 16, 3, 2] 1 -1 1 4672 ultralytics.nn.modules.Conv [16, 32, 3, 2] 2 -1 1 7360 ultralytics.nn.modules.C2f [32, 32, 1, True] 3 -1 1 18560 ultralytics.nn.modules.Conv [32, 64, 3, 2] 4 -1 2 49664 ultralytics.nn.modules.C2f [64, 64, 2, True] 5 -1 1 73984 ultralytics.nn.modules.Conv [64, 128, 3, 2] 6 -1 2 197632 ultralytics.nn.modules.C2f [128, 128, 2, True] 7 -1 1 295424 ultralytics.nn.modules.Conv [128, 256, 3, 2] 8 -1 1 460288 ultralytics.nn.modules.C2f [256, 256, 1, True] 9 -1 1 164608 ultralytics.nn.modules.SPPF [256, 256, 5] 10 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 11 [-1, 6] 1 0 ultralytics.nn.modules.Concat [1] 12 -1 1 148224 ultralytics.nn.modules.C2f [384, 128, 1] 13 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 14 [-1, 4] 1 0 ultralytics.nn.modules.Concat [1] 15 -1 1 37248 ultralytics.nn.modules.C2f [192, 64, 1] 16 -1 1 36992 ultralytics.nn.modules.Conv [64, 64, 3, 2] 17 [-1, 12] 1 0 ultralytics.nn.modules.Concat [1] 18 -1 1 123648 ultralytics.nn.modules.C2f [192, 128, 1] 19 -1 1 147712 ultralytics.nn.modules.Conv [128, 128, 3, 2] 20 [-1, 9] 1 0 ultralytics.nn.modules.Concat [1] 21 -1 1 493056 ultralytics.nn.modules.C2f [384, 256, 1] 22 [15, 18, 21] 1 751702 ultralytics.nn.modules.Detect [2, [64, 128, 256]] Model summary: 225 layers, 3011238 parameters, 3011222 gradients, 8.2 GFLOPs

Transferred 319/355 items from pretrained weights Running DDP command ['/home/user/AI/tests/yolov8_venv/bin/python3', '-m', 'torch.distributed.run', '--nproc_per_node', '2', '--master_port', '58683', '/home/user/AI/tests/stage4.py', 'task=detect', 'mode=train', 'model=yolov8n.pt', 'data=/home/user/AI/tests/data/myproj.yaml', 'epochs=200', 'patience=50', 'batch=148', 'imgsz=1024', 'save=True', 'save_period=-1', 'cache=True', 'device=[0, 1]', 'workers=8', 'project=myproj', 'name=brain04_01', 'exist_ok=True', 'pretrained=False', 'optimizer=SGD', 'verbose=True', 'seed=19750129', 'deterministic=True', 'single_cls=False', 'image_weights=False', 'rect=True', 'cos_lr=False', 'close_mosaic=10', 'resume=False', 'overlap_mask=True', 'mask_ratio=4', 'dropout=0.0', 'val=True', 'split=val', 'save_json=False', 'save_hybrid=False', 'conf=None', 'iou=0.7', 'max_det=300', 'half=False', 'dnn=False', 'plots=True', 'source=None', 'show=False', 'save_txt=False', 'save_conf=False', 'save_crop=False', 'hide_labels=False', 'hide_conf=False', 'vid_stride=1', 'line_thickness=3', 'visualize=False', 'augment=False', 'agnostic_nms=False', 'classes=None', 'retina_masks=False', 'boxes=True', 'format=torchscript', 'keras=False', 'optimize=False', 'int8=False', 'dynamic=False', 'simplify=False', 'opset=None', 'workspace=4', 'nms=False', 'lr0=0.01', 'lrf=0.01', 'momentum=0.937', 'weight_decay=0.0005', 'warmup_epochs=3.0', 'warmup_momentum=0.8', 'warmup_bias_lr=0.1', 'box=7.5', 'cls=0.5', 'dfl=1.5', 'fl_gamma=0.0', 'label_smoothing=0.0', 'nbs=64', 'hsv_h=0', 'hsv_s=0', 'hsv_v=0.4', 'degrees=0.0', 'translate=0', 'scale=0.3', 'shear=0.0', 'perspective=0.0', 'flipud=0.0', 'fliplr=0.0', 'mosaic=0.0', 'mixup=0.0', 'copy_paste=0.0', 'cfg=None', 'v5loader=False', 'tracker=botsort.yaml'] Using torch 1.13.1+cu117. GPU count: 2 0 NVIDIA GeForce RTX 3090 25414860800 1 NVIDIA GeForce RTX 3090 25438126080 Setup complete ✅ (48 CPUs, 125.6 GB RAM, 494.1/1862.0 GB disk) usage: stage4.py [-h] (-t | -e | -o) -i INPUT [-g GPU] [-r RUN] [-s SELECT] stage4.py: error: the following arguments are required: -i/--input Using torch 1.13.1+cu117. GPU count: 2 0 NVIDIA GeForce RTX 3090 25414860800 1 NVIDIA GeForce RTX 3090 25438126080 usage: stage4.py [-h] (-t | -e | -o) -i INPUT [-g GPU] [-r RUN] [-s SELECT] stage4.py: error: the following arguments are required: -i/--input

Environment

Ultralytics YOLOv8.0.54 🚀 Python-3.10.6 torch-1.13.1+cu117 Ubuntu 22.04

Minimal Reproducible Example

No response

Additional

No response

Are you willing to submit a PR?

[x] Yes I'd like to help by submitting a PR!

you should run container with "--ipc=host"

github-actions[bot] commented 1 year ago

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Docs: https://docs.ultralytics.com
HUB: https://hub.ultralytics.com
Community: https://community.ultralytics.com

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐

jokober commented 11 months ago

I had a similar problem. I solved it by removing all ArgumentParser arguments and write the valued hardcoded into the python script I start. It worked.

In your case it looks like the python script stage4.py requires the argument -i/--input. Try to remove that argument.

I have to say that I have no clue about why this solves the problem...

glenn-jocher commented 11 months ago

@jokober hello! Thank you for your contribution and for sharing your solution.

The error you're seeing seems to be due to the -i/--input argument that is required by the script. Upon launch, the script 'stage4.py' is expecting an input value which it's not able to find.

You're right in stating that you've managed to fix this by hard coding the values directly into your Python script, thereby eliminating the need for argparse. This is one way to resolve the issue, but it might not be most flexible one, especially if you are working with different inputs.

In a more dynamic scenario where you'd need to pass different arguments, one workaround could be to ensure all necessary arguments are supplied when running the script, or to give arguments default values within the script, so it doesn't fail when an argument isn't provided.

While the exact root of the problem isn't clear without more context, it most likely has to do with how the script is called and how it's handling (or not handling) the arguments.

Again, thank you for your input. Hopefully this information can help to improve the experience for other users in future!

pramanik2289 commented 1 month ago

[ERROR] [launch.py:325:sigkill_handler] ['/home/ubuntu/gen_ai_models/gpt_plus/bin/python', '-u', '/home/ubuntu/gen_ai_models/VideoGPT-plus/videogpt_plus/train.py', '--local_rank=3', '--lora_enable', 'True', '--lora_r', '128', '--lora_alpha', '256', '--mm_projector_lr', '2e-5', '--deepspeed', '/home/ubuntu/gen_ai_models/VideoGPT-plus/scripts/zero3.json', '--model_name_or_path', '/home/ubuntu/gen_ai_models/Phi-3-mini-4k-instruct', '--version', 'phi3_instruct', '--dataset_use', 'FINETUNING', '--vision_tower', '/home/ubuntu/gen_ai_models/InternVideo2-Stage2_1B-224p-f4', '--image_vision_tower', '/home/ubuntu/gen_ai_models/clip-vit-large-patch14-336', '--mm_projector_type', 'mlp2x_gelu', '--image_mm_projector_type', 'mlp2x_gelu', '--pretrain_mm_mlp_adapter', '/home/ubuntu/Downloads/VideoGPT-plus_Phi3-mini-4k_Pretrain/mlp2x_gelu_internvideo2/mm_projector.bin', '--pretrain_image_mm_mlp_adapter', '/home/ubuntu/Downloads/VideoGPT-plus_Phi3-mini-4k_Pretrain/mlp2x_gelu_clip_l14_336px/mm_projector.bin', '--mm_vision_select_layer', '-2', '--mm_use_im_start_end', 'False', '--mm_use_im_patch_token', 'False', '--image_aspect_ratio', 'pad', '--group_by_modality_length', 'True', '--bf16', 'True', '--output_dir', '/home/ubuntu/Downloads/results/videogpt_plus_finetune', '--num_train_epochs', '1', '--per_device_train_batch_size', '8', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '2', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '50000', '--save_total_limit', '1', '--learning_rate', '2e-4', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--model_max_length', '4096', '--gradient_checkpointing', 'True', '--dataloader_num_workers', '4', '--lazy_preprocess', 'True', '--report_to', 'none'] exits with return code = -11 why this error?

glenn-jocher commented 1 month ago

Hello,

Thank you for reaching out and providing detailed information about the error you're encountering. The exit code -11 typically indicates a segmentation fault, which can be caused by various issues such as memory access violations, hardware limitations, or software bugs.

To better assist you, could you please provide a minimal reproducible example of your code? This will help us diagnose the issue more effectively. You can find guidelines for creating a minimal reproducible example here: Minimum Reproducible Example.

Additionally, please ensure that you are using the latest versions of all relevant packages, including Ultralytics YOLO, PyTorch, and any other dependencies. Sometimes, updating to the latest versions can resolve unexpected issues.

If the problem persists after updating and providing a reproducible example, we can delve deeper into potential causes and solutions.

Looking forward to your response!

Eslam21 commented 1 week ago

I am training on Kaggle using T4 x2 GPUs. I face same error when adding device parameter like this : results = model.train(data='path_of_data/data.yaml', epochs=500, device=[0,1] )

Instead, I wrote it as quotation marks instead if a list and it worked results = model.train(data='path_of_data/data.yaml', epochs=500, device="0,1" )

glenn-jocher commented 1 week ago

Thank you for sharing your experience. Using quotation marks for the device parameter is indeed the correct approach. If you encounter any further issues, please ensure you're using the latest version of the package.

ultralytics / ultralytics