ultralytics / yolov5

YOLOv5 πŸš€ in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
51.23k stars 16.44k forks source link

Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00) #4574

Closed Darshcg closed 3 years ago

Darshcg commented 3 years ago

Hi @glenn-jocher ,

I am trying to work on training the Yolov5m on a custom Dataset through Multi GPU training. But when I am running the command: sudo python3 -m torch.distributed.run --nproc_per_node 4 train.py --batch 32 --data coco128.yaml --weights yolov5m.pt --device 0,1,2,3 it is continoudly showing: Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00) and the training is not getting started.

Here is the detailed log of running sudo python3 -m torch.distributed.run --nproc_per_node 4 train.py --batch 32 --data coco128.yaml --weights yolov5m.pt --device 0,1,2,3

[INFO] 2021-08-28 11:28:56,912 run: Running torch.distributed.run with args: ['/home/darshan/.local/lib/python3.6/site-packages/torch/distributed/run.py', '--nproc_per_node', '4', 'train.py', '--batch', '32', '--data', 'coco128.yaml', '--weights', 'yolov5m.pt', '--device', '0,1,2,3']
[INFO] 2021-08-28 11:28:56,914 run: Using nproc_per_node=4.
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
[INFO] 2021-08-28 11:28:56,914 api: Starting elastic_operator with launch configs:
  entrypoint       : train.py
  min_nodes        : 1
  max_nodes        : 1
  nproc_per_node   : 4
  run_id           : none
  rdzv_backend     : static
  rdzv_endpoint    : 127.0.0.1:29500
  rdzv_configs     : {'rank': 0, 'timeout': 900}
  max_restarts     : 3
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

[INFO] 2021-08-28 11:28:56,918 local_elastic_agent: log directory set to: /tmp/torchelastic__j4nx57v/none_enejjrp0
[INFO] 2021-08-28 11:28:56,918 api: [default] starting workers for entrypoint: python3
[INFO] 2021-08-28 11:28:56,918 api: [default] Rendezvous'ing worker group
[INFO] 2021-08-28 11:28:56,918 static_tcp_rendezvous: Creating TCPStore as the c10d::Store implementation
/home/darshan/.local/lib/python3.6/site-packages/torch/distributed/elastic/utils/store.py:53: FutureWarning: This is an experimental API and will be changed in future.
  "This is an experimental API and will be changed in future.", FutureWarning
[INFO] 2021-08-28 11:28:56,921 api: [default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=127.0.0.1
  master_port=29500
  group_rank=0
  group_world_size=1
  local_ranks=[0, 1, 2, 3]
  role_ranks=[0, 1, 2, 3]
  global_ranks=[0, 1, 2, 3]
  role_world_sizes=[4, 4, 4, 4]
  global_world_sizes=[4, 4, 4, 4]

[INFO] 2021-08-28 11:28:56,922 api: [default] Starting worker group
[INFO] 2021-08-28 11:28:56,922 __init__: Setting worker0 reply file to: /tmp/torchelastic__j4nx57v/none_enejjrp0/attempt_0/0/error.json
[INFO] 2021-08-28 11:28:56,922 __init__: Setting worker1 reply file to: /tmp/torchelastic__j4nx57v/none_enejjrp0/attempt_0/1/error.json
[INFO] 2021-08-28 11:28:56,922 __init__: Setting worker2 reply file to: /tmp/torchelastic__j4nx57v/none_enejjrp0/attempt_0/2/error.json
[INFO] 2021-08-28 11:28:56,922 __init__: Setting worker3 reply file to: /tmp/torchelastic__j4nx57v/none_enejjrp0/attempt_0/3/error.json
train: weights=yolov5m.pt, cfg=, data=coco128.yaml, hyp=data/hyps/hyp.scratch.yaml, epochs=300, batch_size=32, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, evolve=None, bucket=, cache=None, image_weights=False, device=0,1,2,3, multi_scale=False, single_cls=False, adam=False, sync_bn=False, workers=0, project=runs/train, entity=None, name=exp, exist_ok=False, quad=False, linear_lr=False, label_smoothing=0.0, upload_dataset=False, bbox_interval=-1, save_period=-1, artifact_alias=latest, local_rank=-1, freeze=0
github: up to date with https://github.com/ultralytics/yolov5 βœ…
YOLOv5 πŸš€ v5.0-383-g8b18b66 torch 1.9.0+cu102 CUDA:0 (Tesla V100-SXM2-16GB, 16160.5MB)
                                              CUDA:1 (Tesla V100-SXM2-16GB, 16160.5MB)
                                              CUDA:2 (Tesla V100-SXM2-16GB, 16160.5MB)
                                              CUDA:3 (Tesla V100-SXM2-16GB, 16160.5MB)

Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for 4 nodes.
hyperparameters: lr0=0.01, lrf=0.2, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/
wandb: Currently logged in as: darshancg (use `wandb login --relogin` to force relogin)
wandb: Tracking run with wandb version 0.12.1
wandb: Syncing run vocal-snowball-78
wandb: ⭐ View project at https://wandb.ai/darshancg/YOLOv5
wandb: πŸš€ View run at https://wandb.ai/darshancg/YOLOv5/runs/wz5ihs2g
wandb: Run data is saved locally in /home/darshan/yolov5/wandb/run-20210828_112901-wz5ihs2g
wandb: Run `wandb offline` to turn off syncing.

Overriding model.yaml nc=80 with nc=55

                 from  n    params  module                                  arguments                     
  0                -1  1      5280  models.common.Focus                     [3, 48, 3]                    
  1                -1  1     41664  models.common.Conv                      [48, 96, 3, 2]                
  2                -1  2     65280  models.common.C3                        [96, 96, 2]                   
  3                -1  1    166272  models.common.Conv                      [96, 192, 3, 2]               
  4                -1  6    629760  models.common.C3                        [192, 192, 6]                 
  5                -1  1    664320  models.common.Conv                      [192, 384, 3, 2]              
  6                -1  6   2512896  models.common.C3                        [384, 384, 6]                 
  7                -1  1   2655744  models.common.Conv                      [384, 768, 3, 2]              
  8                -1  1   1476864  models.common.SPP                       [768, 768, [5, 9, 13]]        
  9                -1  2   4134912  models.common.C3                        [768, 768, 2, False]          
 10                -1  1    295680  models.common.Conv                      [768, 384, 1, 1]              
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 12           [-1, 6]  1         0  models.common.Concat                    [1]                           
 13                -1  2   1182720  models.common.C3                        [768, 384, 2, False]          
 14                -1  1     74112  models.common.Conv                      [384, 192, 1, 1]              
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 16           [-1, 4]  1         0  models.common.Concat                    [1]                           
 17                -1  2    296448  models.common.C3                        [384, 192, 2, False]          
 18                -1  1    332160  models.common.Conv                      [192, 192, 3, 2]              
 19          [-1, 14]  1         0  models.common.Concat                    [1]                           
 20                -1  2   1035264  models.common.C3                        [384, 384, 2, False]          
 21                -1  1   1327872  models.common.Conv                      [384, 384, 3, 2]              
 22          [-1, 10]  1         0  models.common.Concat                    [1]                           
 23                -1  2   4134912  models.common.C3                        [768, 768, 2, False]          
 24      [17, 20, 23]  1    242460  models.yolo.Detect                      [55, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [192, 384, 768]]
Model Summary: 391 layers, 21274620 parameters, 21274620 gradients, 51.1 GFLOPs

Transferred 500/506 items from yolov5m.pt
Scaled weight_decay = 0.0005
optimizer: SGD with parameter groups 83 weight, 86 weight (no decay), 86 bias
train: Scanning '/mnt/disks/dataset/data/train/labels/01e7532678224dd7a9203bb34e24f38b.cache' images and labels... 7050000 found, 20000 missing, 143 empty, 0 corrupted: 100%|β–ˆ| 7070000/7070000 [00:00<?, ?
val: Scanning '/mnt/disks/dataset/data/valid/labels/03f6b3c419854a2cadcabf838762e6fe.cache' images and labels... 99000 found, 0 missing, 1 empty, 0 corrupted: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 99000/99000 [00:00<?, ?it/s]
Plotting labels... 
[ERROR] 2021-08-28 11:39:34,946 api: failed (exitcode: -9) local_rank: 0 (pid: 28175) of binary: /usr/bin/python3
[ERROR] 2021-08-28 11:39:34,947 local_elastic_agent: [default] Worker group failed
[INFO] 2021-08-28 11:39:34,947 api: [default] Worker group FAILED. 3/3 attempts left; will restart worker group
[INFO] 2021-08-28 11:39:34,947 api: [default] Stopping worker group
[INFO] 2021-08-28 11:39:34,947 api: [default] Rendezvous'ing worker group
[INFO] 2021-08-28 11:39:34,947 static_tcp_rendezvous: Creating TCPStore as the c10d::Store implementation
[INFO] 2021-08-28 11:39:34,959 api: [default] Rendezvous complete for workers. Result:
  restart_count=1
  master_addr=127.0.0.1
  master_port=29500
  group_rank=0
  group_world_size=1
  local_ranks=[0, 1, 2, 3]
  role_ranks=[0, 1, 2, 3]
  global_ranks=[0, 1, 2, 3]
  role_world_sizes=[4, 4, 4, 4]
  global_world_sizes=[4, 4, 4, 4]

[INFO] 2021-08-28 11:39:34,960 api: [default] Starting worker group
[INFO] 2021-08-28 11:39:34,963 __init__: Setting worker0 reply file to: /tmp/torchelastic__j4nx57v/none_enejjrp0/attempt_1/0/error.json
[INFO] 2021-08-28 11:39:34,963 __init__: Setting worker1 reply file to: /tmp/torchelastic__j4nx57v/none_enejjrp0/attempt_1/1/error.json
[INFO] 2021-08-28 11:39:34,963 __init__: Setting worker2 reply file to: /tmp/torchelastic__j4nx57v/none_enejjrp0/attempt_1/2/error.json
[INFO] 2021-08-28 11:39:34,963 __init__: Setting worker3 reply file to: /tmp/torchelastic__j4nx57v/none_enejjrp0/attempt_1/3/error.json
/usr/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 6 leaked semaphores to clean up at shutdown
  len(cache))
train: weights=yolov5m.pt, cfg=, data=coco128.yaml, hyp=data/hyps/hyp.scratch.yaml, epochs=300, batch_size=32, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, evolve=None, bucket=, cache=None, image_weights=False, device=0,1,2,3, multi_scale=False, single_cls=False, adam=False, sync_bn=False, workers=0, project=runs/train, entity=None, name=exp, exist_ok=False, quad=False, linear_lr=False, label_smoothing=0.0, upload_dataset=False, bbox_interval=-1, save_period=-1, artifact_alias=latest, local_rank=-1, freeze=0
github: up to date with https://github.com/ultralytics/yolov5 βœ…
YOLOv5 πŸš€ v5.0-383-g8b18b66 torch 1.9.0+cu102 CUDA:0 (Tesla V100-SXM2-16GB, 16160.5MB)
                                              CUDA:1 (Tesla V100-SXM2-16GB, 16160.5MB)
                                              CUDA:2 (Tesla V100-SXM2-16GB, 16160.5MB)
                                              CUDA:3 (Tesla V100-SXM2-16GB, 16160.5MB)

Added key: store_based_barrier_key:1 to store for rank: 0
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)

Can anyone help me to resolve the issue? I am about to train on 4 V100 GPUs having 16GB mem each, with Pytorch 1.9 and CUDA 10.2.

Thanks, Darshan C G

glenn-jocher commented 3 years ago

@Darshcg I recommend all DDP trainings be run in our Docker image, with a custom PyTorch install in the image if necessary from https://pytorch.org/get-started/locally/

github-actions[bot] commented 3 years ago

πŸ‘‹ Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 πŸš€ resources:

Access additional Ultralytics ⚑ resources:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 πŸš€ and Vision AI ⭐!

HeegerGao commented 2 years ago

Hi @Darshcg , I have exactlyly the same issue. Are there any updates? Thank you!

CanyonWind commented 1 year ago

Try to terminate the works on all the nodes, then launch the training again

zhangze1 commented 1 year ago

I also had the same problem, did you solve it?

glenn-jocher commented 1 year ago

@zhangze1 it seems like there may be an issue with the DDP (Distributed Data Parallel) setup in the current configuration. One solution is to terminate the works on all the nodes and launch the training again, as you have suggested. Another potential solution would be to use a custom PyTorch install in the Docker image. You can find more information on this here: https://pytorch.org/get-started/locally/.

If these solutions do not work or if you need more help, please provide more details about your setup and the specific error messages you are encountering, and we can try to help you solve the problem.