ultralytics / yolov5

YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
51.24k stars 16.44k forks source link

Docker Multi-GPU DDP training hang on `destroy_process_group()` with `wandb` option 3 #5160

Closed Yoon5 closed 2 years ago

Yoon5 commented 3 years ago

Hello, when I try to training using multi gpu based on docker file images. I got the below error. I use Ubuntu 18.04, python 3.8. <<<<<<<<<<<<<<<<>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

root@5a70a5f2d489:/usr/src/app# python -m torch.distributed.run --nproc_per_node 2 train.py --batch 64 --data data.yaml --weights yolov5s.pt --device 0,1
WARNING:__main__:*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Traceback (most recent call last):
  File "train.py", line 620, in <module>
    main(opt)
  File "train.py", line 497, in main
    check_file(opt.data), check_yaml(opt.cfg), check_yaml(opt.hyp), str(opt.weights), str(opt.project)  # checks
  File "/usr/src/app/utils/general.py", line 326, in check_file
    assert len(files), f'File not found: {file}'  # assert file was found
AssertionError: File not found: data.yaml
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: (30 second timeout) ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 405) of binary: /opt/conda/bin/python
/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py:367: UserWarning: 

**********************************************************************
               CHILD PROCESS FAILED WITH NO ERROR_FILE                
**********************************************************************
CHILD PROCESS FAILED WITH NO ERROR_FILE
Child process 405 (local_rank 1) FAILED (exitcode 1)
Error msg: Process failed with exitcode 1
Without writing an error file to <N/A>.
While this DOES NOT affect the correctness of your application,
no trace information about the error will be available for inspection.
Consider decorating your top level entrypoint function with
torch.distributed.elastic.multiprocessing.errors.record. Example:

  from torch.distributed.elastic.multiprocessing.errors import record

  @record
  def trainer_main(args):
     # do train
**********************************************************************
  warnings.warn(_no_error_file_warning_msg(rank, failure))
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 702, in <module>
    main()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 361, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 698, in main
    run(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run
    elastic_launch(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
***************************************
            train.py FAILED            
=======================================
Root Cause:
[0]:
  time: 2021-10-13_04:30:25
  rank: 1 (local_rank: 1)
  exitcode: 1 (pid: 405)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"
=======================================
Other Failures:
  <NO_OTHER_FAILURES>
***************************************

root@5a70a5f2d489:/usr/src/app#
glenn-jocher commented 3 years ago

@Yoon5 thanks for the bug report! If you uninstall wandb before training (pip uninstall wandb), or login to wandb before training (wandb login API_KEY), does this resolve the issue for you?

@AyushExel I did some testing and it seems like wandb may be causing issues with DDP. I train all of my DDP models already logged in, but if not logged in and presented with 1,2,3 options query training may crash as above, or if training completes process group is not destroyed and system hangs. The steps I used to reproduce are on a 2-GPU training are here. Can you try to reproduce on your end?

# Pull image
t=ultralytics/yolov5:latest && sudo docker pull $t && sudo docker run -it --ipc=host --gpus all $t

# Train 3 epochs COCO128 with DDP
python -m torch.distributed.launch --nproc_per_node 1 --master_port 2 train.py --data coco128.yaml --epochs 3

Hang looks like this, seems to occur with wandb installed and enabled:

Screen Shot 2021-10-12 at 10 16 03 PM

EDIT1: Summary is here:

Yoon5 commented 3 years ago

Thank you I will try)))))

Yoon5 commented 3 years ago

I tried above lines (# Pull image t=ultralytics/yolov5:latest && sudo docker pull $t && sudo docker run -it --ipc=host --gpus all $t

Train 3 epochs COCO128 with DDP

python -m torch.distributed.launch --nproc_per_node 1 --master_port 2 train.py --data coco128.yaml --epochs 3). And I do not have wandb. I am did not installed it to my env. And I got this after

~/Desktop/yolov5-master$ t=ultralytics/yolov5:latest && sudo docker pull $t && sudo docker run -it --ipc=host --gpus all $t latest: Pulling from ultralytics/yolov5 Digest: sha256:eee5c66aa087376ab6b70b737b6825dcc59bc1059407a6875c8ef627e2e11f9c Status: Image is up to date for ultralytics/yolov5:latest docker.io/ultralytics/yolov5:latest

============= == PyTorch ==

NVIDIA Release 21.05 (build 22595835) PyTorch Version 1.9.0a0+2ecb2c7

Container image Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.

Copyright (c) 2014-2021 Facebook Inc. Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert) Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu) Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu) Copyright (c) 2011-2013 NYU (Clement Farabet) Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston) Copyright (c) 2006 Idiap Research Institute (Samy Bengio) Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz) Copyright (c) 2015 Google Inc. Copyright (c) 2015 Yangqing Jia Copyright (c) 2013-2016 The Caffe contributors All rights reserved.

NVIDIA Deep Learning Profiler (dlprof) Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License. By pulling and using the container, you accept the terms and conditions of this license: https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

ERROR: This container was built for NVIDIA Driver Release 465.19 or later, but version 460.91.03 was detected and compatibility mode is UNAVAILABLE.

   [[Forward compatibility was attempted on non supported HW (CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE) cuInit()=804]]

NOTE: MOFED driver for multi-node communication was not detected. Multi-node communication performance may be reduced.

root@1709fa266811:/usr/src/app# python -m torch.distributed.launch --nproc_per_node 1 --master_port 2 train.py --data coco128.yaml --epochs 3 /opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torch.distributed.run. Note that --use_env is set by default in torch.distributed.run. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

warnings.warn( wandb: (1) Create a W&B account wandb: (2) Use an existing W&B account wandb: (3) Don't visualize my results wandb: Enter your choice: (30 second timeout) wandb login API_KEY wandb: WARNING Invalid choice wandb: Enter your choice: (30 second timeout) wandb: W&B disabled due to login timeout. train: weights=yolov5s.pt, cfg=, data=coco128.yaml, hyp=data/hyps/hyp.scratch.yaml, epochs=3, batch_size=16, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, evolve=None, bucket=, cache=None, image_weights=False, device=, multi_scale=False, single_cls=False, adam=False, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, linear_lr=False, label_smoothing=0.0, patience=100, freeze=0, save_period=-1, local_rank=0, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest github: skipping check (Docker image), for updates see https://github.com/ultralytics/yolov5 /opt/conda/lib/python3.8/site-packages/torch/cuda/init.py:106: UserWarning: GeForce RTX 3080 Ti with CUDA capability sm_86 is not compatible with the current PyTorch installation. The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70. If you want to use the GeForce RTX 3080 Ti GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name)) YOLOv5 🚀 v6.0-3-g20a809d torch 1.9.1+cu102 CUDA:0 (GeForce RTX 3080 Ti, 12053.8125MB)

Added key: store_based_barrier_key:1 to store for rank: 0 Rank 0: Completed store-based barrier for 1 nodes. hyperparameters: lr0=0.01, lrf=0.1, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0 TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/

WARNING: Dataset not found, nonexistent paths: ['/usr/src/datasets/coco128/images/train2017'] Downloading https://github.com/ultralytics/yolov5/releases/download/v1.0/coco128.zip to coco128.zip... 100%|██████████████████████████████████████| 6.66M/6.66M [00:00<00:00, 8.76MB/s] Dataset autodownload success, saved to ../datasets

Downloading https://github.com/ultralytics/yolov5/releases/download/v6.0/yolov5s.pt to yolov5s.pt... 100%|██████████████████████████████████████| 14.0M/14.0M [00:04<00:00, 3.64MB/s]

             from  n    params  module                                  arguments                     

0 -1 1 3520 models.common.Conv [3, 32, 6, 2, 2]
1 -1 1 18560 models.common.Conv [32, 64, 3, 2]
2 -1 1 18816 models.common.C3 [64, 64, 1]
3 -1 1 73984 models.common.Conv [64, 128, 3, 2]
4 -1 2 115712 models.common.C3 [128, 128, 2]
5 -1 1 295424 models.common.Conv [128, 256, 3, 2]
6 -1 3 625152 models.common.C3 [256, 256, 3]
7 -1 1 1180672 models.common.Conv [256, 512, 3, 2]
8 -1 1 1182720 models.common.C3 [512, 512, 1]
9 -1 1 656896 models.common.SPPF [512, 512, 5]
10 -1 1 131584 models.common.Conv [512, 256, 1, 1]
11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
12 [-1, 6] 1 0 models.common.Concat [1]
13 -1 1 361984 models.common.C3 [512, 256, 1, False]
14 -1 1 33024 models.common.Conv [256, 128, 1, 1]
15 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
16 [-1, 4] 1 0 models.common.Concat [1]
17 -1 1 90880 models.common.C3 [256, 128, 1, False]
18 -1 1 147712 models.common.Conv [128, 128, 3, 2]
19 [-1, 14] 1 0 models.common.Concat [1]
20 -1 1 296448 models.common.C3 [256, 256, 1, False]
21 -1 1 590336 models.common.Conv [256, 256, 3, 2]
22 [-1, 10] 1 0 models.common.Concat [1]
23 -1 1 1182720 models.common.C3 [512, 512, 1, False]
24 [17, 20, 23] 1 229245 models.yolo.Detect [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]] Model Summary: 270 layers, 7235389 parameters, 7235389 gradients, 16.5 GFLOPs

Traceback (most recent call last): File "train.py", line 620, in main(opt) File "train.py", line 517, in main train(opt.hyp, opt, device, callbacks) File "train.py", line 119, in train csd = ckpt['model'].float().state_dict() # checkpoint state_dict as FP32 File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 692, in float return self._apply(lambda t: t.float() if t.is_floating_point() else t) File "/usr/src/app/models/yolo.py", line 239, in _apply self = super()._apply(fn) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 530, in _apply module._apply(fn) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 530, in _apply module._apply(fn) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 530, in _apply module._apply(fn) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 552, in _apply param_applied = fn(param) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 692, in return self._apply(lambda t: t.float() if t.is_floating_point() else t) RuntimeError: CUDA error: no kernel image is available for execution on the device CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 362) of binary: /opt/conda/bin/python Traceback (most recent call last): File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in main() File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run elastic_launch( File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:


        train.py FAILED            

======================================= Root Cause: [0]: time: 2021-10-13_05:37:52 rank: 0 (local_rank: 0) exitcode: 1 (pid: 362) error_file: <N/A> msg: "Process failed with exitcode 1"

Other Failures:

*************************************** ****
glenn-jocher commented 3 years ago

@Yoon5 before you do anything you need to update your nvidia drivers as your error message states:

ERROR: This container was built for NVIDIA Driver Release 465.19 or later, but
version 460.91.03 was detected and compatibility mode is UNAVAILABLE.

@AyushExel I'm manually pushing a new ultralytics/yolov5:latest image without wandb for now until we can sort this out, so if you pull the image after seeing this message you'll have to pip install wandb in the image to get started testing.

Yoon5 commented 3 years ago

Thank you

AyushExel commented 3 years ago

@glenn-jocher I'm testing this now

AyushExel commented 3 years ago

@glenn-jocher The problem doesn't occur for me. I'm running on 2 T4 GPUs and the program exited fine. I've tried this 2 times. Screenshot (182)

Full trace:

/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torch.distributed.run.
Note that --use_env is set by default in torch.distributed.run.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: (30 second timeout) 
wandb: W&B disabled due to login timeout.
train: weights=yolov5s.pt, cfg=, data=coco128.yaml, hyp=data/hyps/hyp.scratch.yaml, epochs=3, batch_size=16, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, evolve=None, bucket=, cache=None, image_weights=False, device=, multi_scale=False, single_cls=False, adam=False, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, linear_lr=False, label_smoothing=0.0, patience=100, freeze=0, save_period=-1, local_rank=0, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: skipping check (Docker image), for updates see https://github.com/ultralytics/yolov5
YOLOv5 🚀 v6.0-4-gb754525 torch 1.9.1+cu102 CUDA:0 (Tesla T4, 15109.75MB)

Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for 1 nodes.
hyperparameters: lr0=0.01, lrf=0.1, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/

WARNING: Dataset not found, nonexistent paths: ['/usr/src/datasets/coco128/images/train2017']
Downloading https://github.com/ultralytics/yolov5/releases/download/v1.0/coco128.zip to coco128.zip...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 6.66M/6.66M [00:00<00:00, 123MB/s]
Dataset autodownload success, saved to ../datasets

Downloading https://github.com/ultralytics/yolov5/releases/download/v6.0/yolov5s.pt to yolov5s.pt...
100%|████████████████████████████████████████████████████████████████████████████████████████████████| 14.0M/14.0M [00:00<00:00, 31.5MB/s]

                 from  n    params  module                                  arguments                     
  0                -1  1      3520  models.common.Conv                      [3, 32, 6, 2, 2]              
  1                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]                
  2                -1  1     18816  models.common.C3                        [64, 64, 1]                   
  3                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]               
  4                -1  2    115712  models.common.C3                        [128, 128, 2]                 
  5                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]              
  6                -1  3    625152  models.common.C3                        [256, 256, 3]                 
  7                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]              
  8                -1  1   1182720  models.common.C3                        [512, 512, 1]                 
  9                -1  1    656896  models.common.SPPF                      [512, 512, 5]                 
 10                -1  1    131584  models.common.Conv                      [512, 256, 1, 1]              
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 12           [-1, 6]  1         0  models.common.Concat                    [1]                           
 13                -1  1    361984  models.common.C3                        [512, 256, 1, False]          
 14                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]              
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 16           [-1, 4]  1         0  models.common.Concat                    [1]                           
 17                -1  1     90880  models.common.C3                        [256, 128, 1, False]          
 18                -1  1    147712  models.common.Conv                      [128, 128, 3, 2]              
 19          [-1, 14]  1         0  models.common.Concat                    [1]                           
 20                -1  1    296448  models.common.C3                        [256, 256, 1, False]          
 21                -1  1    590336  models.common.Conv                      [256, 256, 3, 2]              
 22          [-1, 10]  1         0  models.common.Concat                    [1]                           
 23                -1  1   1182720  models.common.C3                        [512, 512, 1, False]          
 24      [17, 20, 23]  1    229245  models.yolo.Detect                      [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
Model Summary: 270 layers, 7235389 parameters, 7235389 gradients, 16.5 GFLOPs

Transferred 349/349 items from yolov5s.pt
Scaled weight_decay = 0.0005
optimizer: SGD with parameter groups 57 weight, 60 weight (no decay), 60 bias
train: Scanning '../datasets/coco128/labels/train2017' images and labels...128 found, 0 missing, 2 empty, 0 corrupted: 100%|█| 128/128 [00
train: New cache created: ../datasets/coco128/labels/train2017.cache
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|█| 128/12
Plotting labels... 
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|█| 128/12
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|█| 128/12
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|█| 128/12
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|█| 128/12
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|█| 128/12
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|█| 128/12
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|█| 128/12
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|█| 128/12

autoanchor: Analyzing anchors... anchors/target = 4.26, Best Possible Recall (BPR) = 0.9946
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to runs/train/exp
Starting training for 3 epochs...

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       0/2     2.53G   0.04048   0.07431   0.02063       210       640:  12%|███▉                           | 1/8 [00:06<00:44,  6.40s/it]Reducer buckets have been rebuilt in this iteration.
       0/2     6.65G   0.04252   0.06144   0.02085       232       640: 100%|███████████████████████████████| 8/8 [00:08<00:00,  1.02s/it]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|███████████████| 4/4 [00:09<00:00,  2.44s/it]
                 all        128        929      0.671      0.533      0.621      0.407

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       1/2     7.05G   0.04501   0.06472    0.0191       180       640: 100%|███████████████████████████████| 8/8 [00:02<00:00,  4.00it/s]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|███████████████| 4/4 [00:03<00:00,  1.22it/s]
                 all        128        929      0.697      0.536      0.631      0.416

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       2/2     7.05G   0.04566   0.06474   0.02024       305       640: 100%|███████████████████████████████| 8/8 [00:01<00:00,  4.31it/s]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|███████████████| 4/4 [00:03<00:00,  1.20it/s]
                 all        128        929      0.704      0.547      0.633      0.418

3 epochs completed in 0.009 hours.
Optimizer stripped from runs/train/exp/weights/last.pt, 14.9MB
Optimizer stripped from runs/train/exp/weights/best.pt, 14.9MB

Validating runs/train/exp/weights/best.pt...
Fusing layers... 
Model Summary: 213 layers, 7225885 parameters, 0 gradients, 16.5 GFLOPs
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|███████████████| 4/4 [00:05<00:00,  1.28s/it]
                 all        128        929      0.704      0.547      0.634      0.419
              person        128        254      0.807      0.681      0.774      0.508
             bicycle        128          6      0.748      0.496      0.545      0.318
                 car        128         46      0.801      0.351      0.484      0.212
          motorcycle        128          5      0.672        0.6        0.8      0.633
            airplane        128          6          1      0.831      0.995      0.721
                 bus        128          7      0.648      0.714      0.694       0.59
               train        128          3      0.809          1      0.995      0.632
               truck        128         12      0.605       0.25      0.446      0.251
                boat        128          6      0.735      0.333      0.489      0.141
       traffic light        128         14      0.587      0.143      0.238      0.149
           stop sign        128          2      0.627        0.5      0.828      0.663
               bench        128          9      0.908      0.444       0.57      0.243
                bird        128         16      0.888      0.991      0.988      0.652
                 cat        128          4          1      0.729      0.836      0.691
                 dog        128          9      0.916      0.667      0.887      0.547
               horse        128          2      0.705          1      0.995      0.697
            elephant        128         17          1      0.927      0.946      0.696
                bear        128          1       0.52          1      0.995      0.995
               zebra        128          4      0.849          1      0.995      0.952
             giraffe        128          9      0.813      0.778      0.869      0.576
            backpack        128          6          1      0.307      0.452      0.201
            umbrella        128         18      0.724      0.556      0.722      0.394
             handbag        128         19      0.662      0.104      0.167       0.11
                 tie        128          7      0.895      0.571      0.693      0.466
            suitcase        128          4          1      0.992      0.995      0.621
             frisbee        128          5      0.652        0.8      0.798      0.694
                skis        128          1      0.616          1      0.995      0.497
           snowboard        128          7          1      0.705      0.766      0.558
         sports ball        128          6      0.659        0.5      0.622      0.341
                kite        128         10      0.557        0.5      0.557      0.204
        baseball bat        128          4      0.391        0.5      0.275      0.136
      baseball glove        128          7      0.474      0.391      0.327      0.197
          skateboard        128          5      0.754      0.614      0.792      0.557
       tennis racket        128          7      0.536      0.571      0.538      0.299
              bottle        128         18      0.649      0.389      0.484      0.286
          wine glass        128         16      0.771      0.875      0.853      0.397
                 cup        128         36      0.852      0.361      0.493      0.294
                fork        128          6      0.378      0.167      0.252      0.194
               knife        128         16      0.888      0.625      0.667      0.449
               spoon        128         22      0.811      0.391      0.531      0.257
                bowl        128         28       0.75      0.571      0.617      0.461
              banana        128          1          0          0      0.142     0.0142
            sandwich        128          2          0          0     0.0957     0.0743
              orange        128          4          1          0      0.578      0.199
            broccoli        128         11      0.418      0.182      0.314      0.273
              carrot        128         24      0.716      0.542      0.636      0.383
             hot dog        128          2        0.4      0.699      0.497      0.465
               pizza        128          5      0.629          1      0.831      0.603
               donut        128         14      0.692          1      0.952      0.823
                cake        128          4       0.73          1      0.895      0.713
               chair        128         35      0.458      0.486      0.476      0.232
               couch        128          6      0.723      0.333      0.801      0.453
        potted plant        128         14      0.791      0.714      0.806      0.447
                 bed        128          3          1          0      0.746      0.275
        dining table        128         13       0.83      0.462      0.476      0.299
              toilet        128          2      0.456        0.5      0.566      0.496
                  tv        128          2      0.752          1      0.995      0.846
              laptop        128          3          1          0      0.426      0.185
               mouse        128          2          1          0     0.0268     0.0215
              remote        128          8       0.71      0.625      0.635      0.506
          cell phone        128          8      0.599      0.199      0.429      0.202
           microwave        128          3      0.402          1      0.995      0.743
                oven        128          5      0.364        0.4      0.427      0.248
                sink        128          6      0.344      0.167      0.265      0.161
        refrigerator        128          5      0.704        0.8      0.814      0.456
                book        128         29      0.601      0.138      0.294      0.133
               clock        128          9       0.87      0.778       0.91      0.599
                vase        128          2      0.286          1      0.663      0.597
            scissors        128          1          1          0     0.0302    0.00603
          teddy bear        128         21       0.85      0.381      0.613      0.344
          toothbrush        128          5          1      0.477      0.708      0.438
Results saved to runs/train/exp
root@903423287a25:/usr/src/app# 
glenn-jocher commented 3 years ago

@AyushExel oh interesting. Can you try again and enter wandb: (3) Don't visualize my results

glenn-jocher commented 3 years ago

@AyushExel also I just noticed in your output your training is only using 1 GPU. When you use multiple devices they will be listed together. Ah sorry, I see my command to reproduce above was incorrect. This is the correct 2-gpu training command:

python -m torch.distributed.launch --nproc_per_node 2 --master_port 1 train.py --data coco128.yaml --epochs 3 --device 0,1
AyushExel commented 3 years ago

@glenn-jocher thanks. I tried it again. It's not getting stuck. Here's the traceback:

root@6f70428df512:/usr/src/app# python -m torch.distributed.launch --nproc_per_node 2 --master_port 1 train.py --data coco128.yaml --epochs 3 --device 0,1

/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torch.distributed.run.
Note that --use_env is set by default in torch.distributed.run.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
WARNING:torch.distributed.run:*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: (30 second timeout) wandb: Enter your choice: (30 second timeout) 
wandb: W&B disabled due to login timeout.
train: weights=yolov5s.pt, cfg=, data=coco128.yaml, hyp=data/hyps/hyp.scratch.yaml, epochs=3, batch_size=16, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, evolve=None, bucket=, cache=None, image_weights=False, device=0,1, multi_scale=False, single_cls=False, adam=False, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, linear_lr=False, label_smoothing=0.0, patience=100, freeze=0, save_period=-1, local_rank=0, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: skipping check (Docker image), for updates see https://github.com/ultralytics/yolov5
YOLOv5 🚀 v6.0-4-gb754525 torch 1.9.1+cu102 CUDA:0 (Tesla T4, 15109.75MB)
                                            CUDA:1 (Tesla T4, 15109.75MB)

Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for 2 nodes.
hyperparameters: lr0=0.01, lrf=0.1, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/

WARNING: Dataset not found, nonexistent paths: ['/usr/src/datasets/coco128/images/train2017']
Downloading https://github.com/ultralytics/yolov5/releases/download/v1.0/coco128.zip to coco128.zip...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6.66M/6.66M [00:00<00:00, 92.7MB/s]
Dataset autodownload success, saved to ../datasets

Downloading https://github.com/ultralytics/yolov5/releases/download/v6.0/yolov5s.pt to yolov5s.pt...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14.0M/14.0M [00:00<00:00, 76.4MB/s]

                 from  n    params  module                                  arguments                     
  0                -1  1      3520  models.common.Conv                      [3, 32, 6, 2, 2]              
  1                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]                
  2                -1  1     18816  models.common.C3                        [64, 64, 1]                   
  3                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]               
  4                -1  2    115712  models.common.C3                        [128, 128, 2]                 
  5                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]              
  6                -1  3    625152  models.common.C3                        [256, 256, 3]                 
  7                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]              
  8                -1  1   1182720  models.common.C3                        [512, 512, 1]                 
  9                -1  1    656896  models.common.SPPF                      [512, 512, 5]                 
 10                -1  1    131584  models.common.Conv                      [512, 256, 1, 1]              
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 12           [-1, 6]  1         0  models.common.Concat                    [1]                           
 13                -1  1    361984  models.common.C3                        [512, 256, 1, False]          
 14                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]              
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 16           [-1, 4]  1         0  models.common.Concat                    [1]                           
 17                -1  1     90880  models.common.C3                        [256, 128, 1, False]          
 18                -1  1    147712  models.common.Conv                      [128, 128, 3, 2]              
 19          [-1, 14]  1         0  models.common.Concat                    [1]                           
 20                -1  1    296448  models.common.C3                        [256, 256, 1, False]          
 21                -1  1    590336  models.common.Conv                      [256, 256, 3, 2]              
 22          [-1, 10]  1         0  models.common.Concat                    [1]                           
 23                -1  1   1182720  models.common.C3                        [512, 512, 1, False]          
 24      [17, 20, 23]  1    229245  models.yolo.Detect                      [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
Model Summary: 270 layers, 7235389 parameters, 7235389 gradients, 16.5 GFLOPs

Transferred 349/349 items from yolov5s.pt
Scaled weight_decay = 0.0005
optimizer: SGD with parameter groups 57 weight, 60 weight (no decay), 60 bias
train: Scanning '../datasets/coco128/labels/train2017' images and labels...128 found, 0 missing, 2 empty, 0 corrupted: 100%|█| 128/128 [00:00<00:00, 738.
train: New cache created: ../datasets/coco128/labels/train2017.cache
train: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|█| 128/128 [00:00<?, ?
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|█| 128/128 [00:00<?, ?it
Plotting labels... 
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|█| 128/128 [00:00<?, ?it
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|█| 128/128 [00:00<?, ?it
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|█| 128/128 [00:00<?, ?it
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|█| 128/128 [00:00<?, ?it
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|█| 128/128 [00:00<?, ?it
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|█| 128/128 [00:00<?, ?it
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|█| 128/128 [00:00<?, ?it

autoanchor: Analyzing anchors... anchors/target = 4.26, Best Possible Recall (BPR) = 0.9946
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to runs/train/exp
Starting training for 3 epochs...

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       0/2     1.89G   0.04361   0.07621   0.02057       131       640:  12%|█████▊                                        | 1/8 [00:06<00:43,  6.21s/it]Reducer buckets have been rebuilt in this iteration.
       0/2     6.28G   0.04354   0.06285   0.02263        95       640: 100%|██████████████████████████████████████████████| 8/8 [00:07<00:00,  1.07it/s]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|██████████████████████████████| 8/8 [00:08<00:00,  1.09s/it]
                 all        128        929      0.679      0.535      0.621      0.407

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       1/2     6.43G   0.04465   0.06801   0.02394       117       640: 100%|██████████████████████████████████████████████| 8/8 [00:01<00:00,  5.84it/s]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|██████████████████████████████| 8/8 [00:03<00:00,  2.42it/s]
                 all        128        929      0.681      0.548      0.632      0.411

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       2/2     6.43G   0.04337   0.07029   0.02062        91       640: 100%|██████████████████████████████████████████████| 8/8 [00:01<00:00,  6.63it/s]
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|█| 128/128 [00:29<?, ?it
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|██████████████████████████████| 8/8 [00:03<00:00,  2.44it/s]
                 all        128        929      0.617      0.598      0.634      0.416

3 epochs completed in 0.008 hours.
Optimizer stripped from runs/train/exp/weights/last.pt, 14.8MB
Optimizer stripped from runs/train/exp/weights/best.pt, 14.8MB

Validating runs/train/exp/weights/best.pt...
Fusing layers... 
Model Summary: 213 layers, 7225885 parameters, 0 gradients, 16.5 GFLOPs
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|██████████████████████████████| 8/8 [00:04<00:00,  1.69it/s]
                 all        128        929      0.616      0.598      0.633      0.416
              person        128        254      0.722       0.74      0.775      0.508
             bicycle        128          6      0.545      0.667      0.561      0.326
                 car        128         46      0.598       0.37      0.462        0.2
          motorcycle        128          5       0.66        0.6      0.812      0.637
            airplane        128          6          1      0.948      0.995      0.764
                 bus        128          7      0.535      0.714       0.71      0.605
               train        128          3      0.696          1      0.995      0.632
               truck        128         12      0.432      0.333      0.474      0.288
                boat        128          6      0.404      0.333      0.464       0.14
       traffic light        128         14      0.523      0.159      0.243      0.161
           stop sign        128          2      0.576        0.5      0.828      0.663
               bench        128          9      0.704      0.444      0.572      0.237
                bird        128         16        0.8          1      0.988      0.647
                 cat        128          4      0.941       0.75      0.828      0.714
                 dog        128          9      0.761      0.667      0.852      0.527
               horse        128          2       0.58          1      0.995      0.697
            elephant        128         17      0.942      0.882      0.943      0.693
                bear        128          1      0.386          1      0.995      0.995
               zebra        128          4      0.827          1      0.995      0.908
             giraffe        128          9      0.741      0.889      0.851      0.597
            backpack        128          6      0.661      0.333      0.496      0.211
            umbrella        128         18      0.615      0.611      0.731      0.395
             handbag        128         19      0.514      0.105       0.18      0.112
                 tie        128          7      0.606      0.571      0.683      0.463
            suitcase        128          4      0.709          1      0.995       0.54
             frisbee        128          5       0.54        0.8      0.798      0.705
                skis        128          1      0.516          1      0.995      0.497
           snowboard        128          7      0.868      0.714      0.767      0.555
         sports ball        128          6      0.531        0.5      0.581      0.325
                kite        128         10      0.584      0.563      0.564      0.208
        baseball bat        128          4      0.411        0.5      0.283     0.0876
      baseball glove        128          7      0.339      0.429      0.366      0.222
          skateboard        128          5      0.795      0.779      0.735        0.5
       tennis racket        128          7      0.442      0.571      0.551      0.314
              bottle        128         18      0.461        0.5      0.476      0.289
          wine glass        128         16      0.585      0.795       0.74      0.386
                 cup        128         36      0.822      0.361        0.5      0.319
                fork        128          6      0.571      0.237      0.341      0.226
               knife        128         16      0.523      0.625      0.674      0.452
               spoon        128         22      0.602        0.5      0.532       0.26
                bowl        128         28      0.668      0.571       0.63      0.448
              banana        128          1      0.147          1      0.166     0.0498
            sandwich        128          2          0          0      0.133      0.105
              orange        128          4          1          0      0.545      0.151
            broccoli        128         11      0.298       0.31      0.236      0.205
              carrot        128         24      0.481      0.583       0.63      0.425
             hot dog        128          2      0.462          1      0.497      0.497
               pizza        128          5      0.599          1      0.824      0.566
               donut        128         14      0.675          1      0.946      0.848
                cake        128          4      0.698          1      0.895      0.704
               chair        128         35      0.408      0.543       0.46      0.221
               couch        128          6          1      0.481      0.829      0.504
        potted plant        128         14      0.796      0.786       0.82      0.467
                 bed        128          3      0.992      0.333      0.753      0.269
        dining table        128         13      0.571      0.462      0.438      0.242
              toilet        128          2      0.388        0.5      0.554      0.487
                  tv        128          2      0.672          1      0.995      0.846
              laptop        128          3          1          0      0.415      0.193
               mouse        128          2          1          0     0.0375       0.03
              remote        128          8      0.596      0.625      0.636      0.506
          cell phone        128          8      0.579      0.375      0.389      0.182
           microwave        128          3      0.343          1      0.995      0.786
                oven        128          5      0.301        0.4      0.432      0.249
                sink        128          6      0.338      0.167      0.294      0.168
        refrigerator        128          5       0.69        0.8      0.815      0.506
                book        128         29        0.5      0.207      0.295      0.125
               clock        128          9      0.787      0.778      0.898      0.589
                vase        128          2      0.181          1      0.663      0.597
            scissors        128          1          1          0     0.0332    0.00663
          teddy bear        128         21      0.814      0.418      0.608       0.35
          toothbrush        128          5      0.704        0.6      0.739      0.191
Results saved to runs/train/exp
Destroying process group... 
root@6f70428df512:/usr/src/app# 
AyushExel commented 3 years ago

@glenn-jocher Ok I was able to reproduce. It occurs on manually choosing option 3. I think I know the source of the problem. I'll push a fix

AyushExel commented 3 years ago

@glenn-jocher ok I found the root cause of the problem. The import checks are happening in loggers/init.py which makes the checks at wandb_utils.py redundant. I've moved the checks to init.py now. The PR should fix the problem.

Also, it'd be nice to catch these problems early on during CI checks but it's mostly limited because there's no backdoor to stop/resume runs during tests. I think setting up a revamped testing suite to test DDP, integrations etc. might be worth it!

glenn-jocher commented 3 years ago

@Yoon5 good news 😃! Your original issue may now be fixed ✅ in PR #5163 by @AyushExel. To receive this update:

Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀!

Zegorax commented 3 years ago

I have the same problem using Docker. However I run the container using the env variable WANDB_API_KEY, which does not require wandb login. Using this method, the training hangs at the end

glenn-jocher commented 3 years ago

@Zegorax @AyushExel this issue should be resolved in #5163, so please ensure you are using the very latest code or Docker image. To pull the latest docker image use sudo docker pull ultralytics/yolov5:latest. If you are still experiencing an issue with the very latest image please let us know and we will reopen, thanks!

Zegorax commented 3 years ago

@glenn-jocher I'm using the latest version of the code, the problem is still present even with fix #5163

AyushExel commented 3 years ago

@Zegorax are reaching the end of training? I tried to reproduce this in the latest version of the repo and I'm getting this error:

Traceback (most recent call last):
  File "train.py", line 627, in <module>
    main(opt)
  File "train.py", line 524, in main
    train(opt.hyp, opt, device, callbacks)
  File "train.py", line 249, in train
    nl = model.model[-1].nl  # number of detection layers (to scale hyps)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DistributedDataParallel' object has no attribute 'model'
Traceback (most recent call last):
  File "train.py", line 627, in <module>
    main(opt)
  File "train.py", line 524, in main
    train(opt.hyp, opt, device, callbacks)
  File "train.py", line 249, in train
    nl = model.model[-1].nl  # number of detection layers (to scale hyps)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DistributedDataParallel' object has no attribute 'model'

This seems unrelated to W&B as I've disbaled it

glenn-jocher commented 3 years ago

@AyushExel nl error is probably caused by my recent autobatch PR https://github.com/ultralytics/yolov5/pull/5092. I will investigate.

EDIT: this comes back to our general lack of DDP CI. It's an open issue, still don't have a solution for this.

Zegorax commented 3 years ago

@glenn-jocher @AyushExel It's the exact same problem as the original issue. The training finishes, but the process hangs and never returns

AyushExel commented 3 years ago

@Zegorax you're probably on an older version of the repo because the latest version has another bug which won't let the training start. try running git pull inside youy yolov5 directory.

@glenn-jocher sure. Let me know once the issue is fixed and I'll try to confirm if the wandb issue still exists

Zegorax commented 3 years ago

@Zegorax you're probably on an older version of the repo because the latest version has another bug which won't let the training start. try running git pull inside youy yolov5 directory.

@glenn-jocher sure. Let me know once the issue is fixed and I'll try to confirm if the wandb issue still exists

As I said earlier, no I'm not. I'm using the latest YOLOv5 on master branch.

glenn-jocher commented 3 years ago

@AyushExel nl bug fixed in #5332. Verified with:

python -m torch.distributed.run --nproc_per_node 2 --master_port 1 train.py --epochs 3 --device 0,1

Please wait 15 min for Docker Autobuild to complete and deploy this latest merge, then update your Docker image with

t=ultralytics/yolov5:latest && sudo docker pull $t && sudo docker run -it --ipc=host --gpus all $t
AyushExel commented 3 years ago

@Zegorax @glenn-jocher I just tested using the lastest master branch, I can run DDP with wand disabled without any hang. I'll test it on docker image once that is available too. Traceback for command - python -m torch.distributed.launch --nproc_per_node 2 train.py --data coco128.yaml --epochs 3 --device 0,1

wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: (30 second timeout) 3
wandb: You chose 'Don't visualize my results'
train: weights=yolov5s.pt, cfg=, data=coco128.yaml, hyp=data/hyps/hyp.scratch.yaml, epochs=3, batch_size=16, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, evolve=None, bucket=, cache=None, image_weights=False, device=0,1, multi_scale=False, single_cls=False, adam=False, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, linear_lr=False, label_smoothing=0.0, patience=100, freeze=0, save_period=-1, local_rank=0, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: up to date with https://github.com/ultralytics/yolov5 ✅
YOLOv5 🚀 v6.0-35-ga4fece8 torch 1.9.0+cu102 CUDA:0 (Tesla V100-SXM2-16GB, 16160.5MB)
                                             CUDA:1 (Tesla V100-SXM2-16GB, 16160.5MB)

Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for 2 nodes.
hyperparameters: lr0=0.01, lrf=0.1, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
Weights & Biases: run 'pip install wandb' to automatically track and visualize YOLOv5 🚀 runs (RECOMMENDED)
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/
2021-10-25 14:05:44.253930: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0

                 from  n    params  module                                  arguments                     
  0                -1  1      3520  models.common.Conv                      [3, 32, 6, 2, 2]              
  1                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]                
  2                -1  1     18816  models.common.C3                        [64, 64, 1]                   
  3                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]               
  4                -1  2    115712  models.common.C3                        [128, 128, 2]                 
  5                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]              
  6                -1  3    625152  models.common.C3                        [256, 256, 3]                 
  7                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]              
  8                -1  1   1182720  models.common.C3                        [512, 512, 1]                 
  9                -1  1    656896  models.common.SPPF                      [512, 512, 5]                 
 10                -1  1    131584  models.common.Conv                      [512, 256, 1, 1]              
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 12           [-1, 6]  1         0  models.common.Concat                    [1]                           
 13                -1  1    361984  models.common.C3                        [512, 256, 1, False]          
 14                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]              
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 16           [-1, 4]  1         0  models.common.Concat                    [1]                           
 17                -1  1     90880  models.common.C3                        [256, 128, 1, False]          
 18                -1  1    147712  models.common.Conv                      [128, 128, 3, 2]              
 19          [-1, 14]  1         0  models.common.Concat                    [1]                           
 20                -1  1    296448  models.common.C3                        [256, 256, 1, False]          
 21                -1  1    590336  models.common.Conv                      [256, 256, 3, 2]              
 22          [-1, 10]  1         0  models.common.Concat                    [1]                           
 23                -1  1   1182720  models.common.C3                        [512, 512, 1, False]          
 24      [17, 20, 23]  1    229245  models.yolo.Detect                      [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
Model Summary: 270 layers, 7235389 parameters, 7235389 gradients

Transferred 349/349 items from yolov5s.pt
Scaled weight_decay = 0.0005
optimizer: SGD with parameter groups 57 weight, 60 weight (no decay), 60 bias
train: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|██████████████████████████████████████████████████████████████████████| 128/128 [00:00<?, ?it/s]
train: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|██████████████████████████████████████████████████████████████████████| 128/128 [00:00<?, ?it/s]
train: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|██████████████████████████████████████████████████████████████████████| 128/128 [00:00<?, ?it/s]
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████████████████████████████████████████████████████████| 128/128 [00:00<?, ?it/s][W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████████████████████████████████████████████████████████| 128/128 [00:00<?, ?it/s]
Plotting labels... 
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████████████████████████████████████████████████████████| 128/128 [00:01<?, ?it/s]
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████████████████████████████████████████████████████████| 128/128 [00:01<?, ?it/s]
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████████████████████████████████████████████████████████| 128/128 [00:01<?, ?it/s]

autoanchor: Analyzing anchors... anchors/target = 4.26, Best Possible Recall (BPR) = 0.9946
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to runs/train/exp2
Starting training for 3 epochs...

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       0/2     1.87G   0.04361   0.07627   0.02057       131       640:  12%|███████████████                                                                                                         | 1/8 [00:04<00:32,  4.66s/it]Reducer buckets have been rebuilt in this iteration.
       0/2     6.27G   0.04354   0.06284   0.02263        95       640: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:05<00:00,  1.44it/s]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:03<00:00,  2.02it/s]
                 all        128        929      0.679      0.536      0.623      0.407

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       1/2      6.3G   0.04465   0.06803   0.02394       117       640: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00,  9.52it/s]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 12.00it/s]
                 all        128        929      0.683      0.549      0.632      0.412

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       2/2      6.3G   0.04337   0.07029   0.02061        91       640: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00,  9.46it/s]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 11.84it/s]
                 all        128        929      0.616      0.599      0.634      0.415

3 epochs completed in 0.004 hours.
Optimizer stripped from runs/train/exp2/weights/last.pt, 14.8MB
Optimizer stripped from runs/train/exp2/weights/best.pt, 14.8MB

Validating runs/train/exp2/weights/best.pt...
Fusing layers... 
Model Summary: 213 layers, 7225885 parameters, 0 gradients
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00,  5.33it/s]
                 all        128        929      0.616      0.599      0.634      0.416
              person        128        254      0.715       0.74      0.774      0.508
             bicycle        128          6      0.542      0.667      0.561      0.326
                 car        128         46      0.598       0.37      0.461      0.201
          motorcycle        128          5       0.66        0.6      0.812      0.611
            airplane        128          6          1      0.949      0.995      0.764
                 bus        128          7      0.536      0.714       0.71      0.605
               train        128          3      0.695          1      0.995      0.632
               truck        128         12      0.415      0.333      0.473      0.288
                boat        128          6      0.402      0.333      0.465       0.14
       traffic light        128         14      0.525      0.161      0.243      0.161
           stop sign        128          2      0.575        0.5      0.828      0.663
               bench        128          9      0.704      0.444      0.572      0.237
                bird        128         16        0.8          1      0.988      0.647
                 cat        128          4      0.938       0.75      0.828      0.714
                 dog        128          9       0.76      0.667      0.852      0.527
               horse        128          2      0.577          1      0.995      0.697
            elephant        128         17      0.941      0.882      0.943      0.693
                bear        128          1      0.386          1      0.995      0.995
               zebra        128          4      0.827          1      0.995      0.908
             giraffe        128          9       0.74      0.889      0.851      0.613
            backpack        128          6      0.656      0.333      0.496      0.214
            umbrella        128         18      0.613      0.611      0.731      0.395
             handbag        128         19      0.513      0.105      0.179      0.112
                 tie        128          7      0.605      0.571      0.687      0.461
            suitcase        128          4      0.709          1      0.995       0.54
             frisbee        128          5      0.539        0.8      0.798      0.705
                skis        128          1      0.515          1      0.995      0.497
           snowboard        128          7      0.868      0.714      0.767      0.554
         sports ball        128          6      0.528        0.5      0.581      0.325
                kite        128         10      0.587       0.57      0.564      0.208
        baseball bat        128          4      0.409        0.5      0.282      0.087
      baseball glove        128          7      0.339      0.429      0.366      0.222
          skateboard        128          5      0.794      0.777      0.735        0.5
       tennis racket        128          7      0.442      0.571      0.571      0.324
              bottle        128         18       0.46        0.5      0.477       0.29
          wine glass        128         16      0.604      0.859       0.78      0.404
                 cup        128         36      0.823      0.361      0.504       0.32
                fork        128          6      0.573      0.239      0.341      0.226
               knife        128         16      0.521      0.625      0.674      0.452
               spoon        128         22      0.598        0.5      0.532       0.26
                bowl        128         28      0.654      0.571       0.63      0.448
              banana        128          1      0.146          1      0.166     0.0498
            sandwich        128          2          0          0      0.133      0.105
              orange        128          4          1          0      0.545      0.151
            broccoli        128         11      0.299      0.311      0.236      0.204
              carrot        128         24      0.479      0.583      0.631      0.426
             hot dog        128          2      0.463          1      0.497      0.497
               pizza        128          5      0.599          1      0.824      0.566
               donut        128         14      0.675          1      0.946       0.85
                cake        128          4      0.697          1      0.895      0.692
               chair        128         35      0.407      0.543       0.46      0.221
               couch        128          6          1      0.482      0.829      0.504
        potted plant        128         14      0.796      0.786      0.819       0.46
                 bed        128          3      0.982      0.333      0.753      0.269
        dining table        128         13      0.569      0.462      0.438      0.242
              toilet        128          2      0.387        0.5      0.557       0.49
                  tv        128          2      0.672          1      0.995      0.846
              laptop        128          3          1          0      0.415      0.193
               mouse        128          2          1          0     0.0357     0.0285
              remote        128          8      0.596      0.625      0.636      0.506
          cell phone        128          8      0.578      0.375      0.421      0.195
           microwave        128          3      0.343          1      0.995      0.786
                oven        128          5        0.3        0.4      0.432      0.249
                sink        128          6      0.338      0.167       0.29      0.167
        refrigerator        128          5       0.69        0.8      0.815      0.506
                book        128         29      0.535      0.239      0.296      0.125
               clock        128          9      0.787      0.778      0.895      0.584
                vase        128          2       0.18          1      0.663      0.597
            scissors        128          1          1          0     0.0332    0.00663
          teddy bear        128         21      0.814      0.419      0.608      0.349
          toothbrush        128          5      0.701        0.6      0.739      0.194
Results saved to runs/train/exp2
Destroying process group... 
INFO:torch.distributed.elastic.agent.server.api:[default] worker group successfully finished. Waiting 300 seconds for other agents to finish.
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish
/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/utils/store.py:71: FutureWarning: This is an experimental API and will be changed in future.
  "This is an experimental API and will be changed in future.", FutureWarning
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0004150867462158203 seconds
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 0, "group_rank": 0, "worker_id": "6946", "role": "default", "hostname": "ac-vm2.c.playground-111.internal", "state": "SUCCEEDED", "total_run_time": 45, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [0], \"role_rank\": [0], \"role_world_size\": [2]}", "agent_restarts": 0}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 1, "group_rank": 0, "worker_id": "6947", "role": "default", "hostname": "ac-vm2.c.playground-111.internal", "state": "SUCCEEDED", "total_run_time": 45, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [1], \"role_rank\": [1], \"role_world_size\": [2]}", "agent_restarts": 0}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "ac-vm2.c.playground-111.internal", "state": "SUCCEEDED", "total_run_time": 45, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\"}", "agent_restarts": 0}}
(base) jupyter@ac-vm2:~/yolov5$ 
Zegorax commented 3 years ago

@AyushExel Can you check again using Docker and the env variable I've mentioned earlier ?

AyushExel commented 3 years ago

@Zegorax just tested the latest docker image command - python -m torch.distributed.launch --nproc_per_node 1 train.py --data coco128.yaml --epochs 3

wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: (30 second timeout) 3
wandb: You chose 'Don't visualize my results'
train: weights=yolov5s.pt, cfg=, data=coco128.yaml, hyp=data/hyps/hyp.scratch.yaml, epochs=3, batch_size=16, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, evolve=None, bucket=, cache=None, image_weights=False, device=, multi_scale=False, single_cls=False, adam=False, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, linear_lr=False, label_smoothing=0.0, patience=100, freeze=0, save_period=-1, local_rank=0, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: skipping check (Docker image), for updates see https://github.com/ultralytics/yolov5
YOLOv5 🚀 v6.0-35-ga4fece8 torch 1.9.1+cu102 CUDA:0 (Tesla V100-SXM2-16GB, 16160.5MB)

Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for 1 nodes.
hyperparameters: lr0=0.01, lrf=0.1, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
Weights & Biases: run 'pip install wandb' to automatically track and visualize YOLOv5 🚀 runs (RECOMMENDED)
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/

WARNING: Dataset not found, nonexistent paths: ['/usr/src/datasets/coco128/images/train2017']
Downloading https://github.com/ultralytics/yolov5/releases/download/v1.0/coco128.zip to coco128.zip...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6.66M/6.66M [00:00<00:00, 114MB/s]
Dataset autodownload success, saved to ../datasets

Downloading https://github.com/ultralytics/yolov5/releases/download/v6.0/yolov5s.pt to yolov5s.pt...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14.0M/14.0M [00:00<00:00, 74.4MB/s]

                 from  n    params  module                                  arguments                     
  0                -1  1      3520  models.common.Conv                      [3, 32, 6, 2, 2]              
  1                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]                
  2                -1  1     18816  models.common.C3                        [64, 64, 1]                   
  3                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]               
  4                -1  2    115712  models.common.C3                        [128, 128, 2]                 
  5                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]              
  6                -1  3    625152  models.common.C3                        [256, 256, 3]                 
  7                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]              
  8                -1  1   1182720  models.common.C3                        [512, 512, 1]                 
  9                -1  1    656896  models.common.SPPF                      [512, 512, 5]                 
 10                -1  1    131584  models.common.Conv                      [512, 256, 1, 1]              
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 12           [-1, 6]  1         0  models.common.Concat                    [1]                           
 13                -1  1    361984  models.common.C3                        [512, 256, 1, False]          
 14                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]              
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 16           [-1, 4]  1         0  models.common.Concat                    [1]                           
 17                -1  1     90880  models.common.C3                        [256, 128, 1, False]          
 18                -1  1    147712  models.common.Conv                      [128, 128, 3, 2]              
 19          [-1, 14]  1         0  models.common.Concat                    [1]                           
 20                -1  1    296448  models.common.C3                        [256, 256, 1, False]          
 21                -1  1    590336  models.common.Conv                      [256, 256, 3, 2]              
 22          [-1, 10]  1         0  models.common.Concat                    [1]                           
 23                -1  1   1182720  models.common.C3                        [512, 512, 1, False]          
 24      [17, 20, 23]  1    229245  models.yolo.Detect                      [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
Model Summary: 270 layers, 7235389 parameters, 7235389 gradients, 16.5 GFLOPs

Transferred 349/349 items from yolov5s.pt
Scaled weight_decay = 0.0005
optimizer: SGD with parameter groups 57 weight, 60 weight (no decay), 60 bias
train: Scanning '../datasets/coco128/labels/train2017' images and labels...128 found, 0 missing, 2 empty, 0 corrupted: 100%|███████████████████████████████████████████████████| 128/128 [00:00<00:00, 3608.22it/s]
train: New cache created: ../datasets/coco128/labels/train2017.cache
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████████████████████████████████████████| 128/128 [00:00<?, ?it/s]
Plotting labels... 
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████████████████████████████████████████| 128/128 [00:01<?, ?it/s]
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████████████████████████████████████████| 128/128 [00:01<?, ?it/s]
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████████████████████████████████████████| 128/128 [00:01<?, ?it/s]
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████████████████████████████████████████| 128/128 [00:01<?, ?it/s]
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████████████████████████████████████████| 128/128 [00:01<?, ?it/s]
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████████████████████████████████████████| 128/128 [00:01<?, ?it/s]
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████████████████████████████████████████| 128/128 [00:01<?, ?it/s]
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████████████████████████████████████████| 128/128 [00:01<?, ?it/s]

autoanchor: Analyzing anchors... anchors/target = 4.26, Best Possible Recall (BPR) = 0.9946
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to runs/train/exp
Starting training for 3 epochs...

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       0/2     2.53G   0.04048   0.07428   0.02063       210       640:  12%|█████████████                                                                                           | 1/8 [00:04<00:29,  4.27s/it]Reducer buckets have been rebuilt in this iteration.
       0/2     6.65G   0.04252   0.06146   0.02085       232       640: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:05<00:00,  1.47it/s]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00,  1.30it/s]
                 all        128        929       0.67      0.533      0.621      0.406

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       1/2     6.52G   0.04501    0.0647    0.0191       180       640: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00,  8.40it/s]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  6.39it/s]
                 all        128        929      0.697      0.536       0.63      0.415

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       2/2     6.52G   0.04566   0.06475   0.02024       305       640: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00,  8.41it/s]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  6.32it/s]
                 all        128        929      0.705      0.547      0.634      0.418

3 epochs completed in 0.004 hours.
Optimizer stripped from runs/train/exp/weights/last.pt, 14.9MB
Optimizer stripped from runs/train/exp/weights/best.pt, 14.9MB

Validating runs/train/exp/weights/best.pt...
Fusing layers... 
Model Summary: 213 layers, 7225885 parameters, 0 gradients, 16.5 GFLOPs
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:02<00:00,  1.50it/s]
                 all        128        929      0.705      0.547      0.634      0.418
              person        128        254      0.807      0.681      0.774      0.509
             bicycle        128          6      0.748      0.496      0.545      0.318
                 car        128         46      0.801      0.351      0.484      0.213
          motorcycle        128          5      0.673        0.6        0.8      0.633
            airplane        128          6          1       0.83      0.995      0.721
                 bus        128          7      0.649      0.714      0.694       0.59
               train        128          3       0.81          1      0.995      0.632
               truck        128         12      0.605       0.25      0.445      0.251
                boat        128          6      0.737      0.333       0.47      0.139
       traffic light        128         14      0.588      0.143      0.238      0.149
           stop sign        128          2      0.628        0.5      0.828      0.663
               bench        128          9      0.907      0.444      0.572      0.244
                bird        128         16      0.888      0.991      0.988      0.652
                 cat        128          4          1      0.729      0.836      0.691
                 dog        128          9      0.918      0.667      0.887      0.547
               horse        128          2      0.706          1      0.995      0.697
            elephant        128         17          1      0.927      0.946       0.69
                bear        128          1      0.523          1      0.995      0.995
               zebra        128          4      0.849          1      0.995      0.952
             giraffe        128          9      0.813      0.778      0.869      0.576
            backpack        128          6          1      0.307      0.452      0.201
            umbrella        128         18      0.725      0.556      0.723      0.395
             handbag        128         19       0.66      0.103      0.169       0.11
                 tie        128          7        0.9      0.571      0.691      0.466
            suitcase        128          4          1      0.995      0.995      0.621
             frisbee        128          5      0.655        0.8      0.798      0.694
                skis        128          1      0.618          1      0.995      0.497
           snowboard        128          7          1      0.705      0.766      0.558
         sports ball        128          6       0.66        0.5      0.622      0.341
                kite        128         10      0.558        0.5      0.557      0.204
        baseball bat        128          4      0.392        0.5      0.275      0.136
      baseball glove        128          7      0.467      0.381      0.327      0.197
          skateboard        128          5      0.753      0.612      0.792      0.557
       tennis racket        128          7      0.537      0.571      0.538      0.299
              bottle        128         18       0.65      0.389      0.484      0.286
          wine glass        128         16      0.772      0.875      0.853      0.397
                 cup        128         36      0.853      0.361      0.493      0.294
                fork        128          6      0.379      0.167      0.252      0.194
               knife        128         16      0.893      0.625      0.667      0.447
               spoon        128         22      0.809      0.387      0.529      0.256
                bowl        128         28       0.75      0.571      0.617      0.462
              banana        128          1          0          0      0.142     0.0284
            sandwich        128          2          0          0     0.0957     0.0743
              orange        128          4          1          0      0.578      0.189
            broccoli        128         11      0.422      0.182      0.314      0.273
              carrot        128         24      0.717      0.542      0.635      0.383
             hot dog        128          2      0.399      0.698      0.497      0.465
               pizza        128          5      0.629          1      0.831      0.594
               donut        128         14      0.693          1      0.952      0.823
                cake        128          4       0.73          1      0.895      0.713
               chair        128         35      0.486      0.514      0.496      0.228
               couch        128          6      0.726      0.333      0.801      0.453
        potted plant        128         14      0.793      0.714      0.807      0.448
                 bed        128          3          1          0      0.746      0.275
        dining table        128         13      0.835      0.462      0.476      0.296
              toilet        128          2      0.456        0.5      0.566      0.496
                  tv        128          2      0.752          1      0.995      0.846
              laptop        128          3          1          0      0.426      0.185
               mouse        128          2          1          0     0.0268     0.0215
              remote        128          8      0.714      0.625      0.635      0.506
          cell phone        128          8      0.594      0.195      0.427      0.201
           microwave        128          3      0.403          1      0.995      0.721
                oven        128          5      0.365        0.4      0.427      0.248
                sink        128          6      0.344      0.167      0.265      0.161
        refrigerator        128          5      0.704        0.8      0.813      0.455
                book        128         29      0.603      0.138      0.294      0.132
               clock        128          9      0.871      0.778       0.91      0.599
                vase        128          2      0.287          1      0.663      0.597
            scissors        128          1          1          0     0.0302    0.00603
          teddy bear        128         21      0.851      0.381      0.613      0.344
          toothbrush        128          5          1      0.477      0.705      0.436
Results saved to runs/train/exp
Zegorax commented 3 years ago

@AyushExel Can you try to reproduce it using a zero-interaction method ? (DEBIAN_FRONTEND=noninteractive) and by using only predefined option when launching the script

AyushExel commented 3 years ago

@Zegorax I can't repro. Will you please paste you output?

Zegorax commented 3 years ago

@AyushExel The training happens normally. Only at the end, the process never returns and I have to ctrl-c manually (Therefore, the Jenkins job runs forever)


wandb: - 91.69MB of 91.69MB uploaded (0.00MB deduped)
wandb:                                                                                
wandb: Run history:
wandb:        metrics/mAP_0.5 ▁▁▁▁▁▂▁▂▃▄▃▃▄▄▄▅▆▆▆▅▆▇▇▇▆▇▇▇▇██▇████████
wandb:   metrics/mAP_0.5:0.95 ▁▁▁▁▁▁▁▂▂▃▂▃▃▃▃▄▅▅▅▄▆▆▆▆▅▇▇▇▇▇▇▇████████
wandb:      metrics/precision ▁▁▁▂█▅▆▅▆▅▅▄▅▅▆▆▇▇▇▆▇▇█▇▇██▇██▇▇▇▇▇▇▇█▇█
wandb:         metrics/recall ▁▁▁▁▂▃▂▃▃▄▄▄▅▄▅▆▆▆▆▅▆▆▇▆▆▇▇▇▇▇▇▇███████▇
wandb:         train/box_loss ██▇▆▅▅▄▄▄▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:         train/cls_loss █▇▆▅▄▄▃▃▃▂▃▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:         train/obj_loss █▄▄▅▄▄▄▃▃▃▃▃▃▂▃▂▂▂▂▂▂▂▂▂▂▁▂▂▂▁▁▁▁▁▁▁▁▁▁▁
wandb:           val/box_loss ██▇█▆▅▇▄▄▄▄▄▃▃▃▂▂▂▂▂▂▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:           val/cls_loss █▇▇▇▄▄▃▃▃▃▄▄▃▂▃▂▂▁▁▂▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:           val/obj_loss ▇▇███▆▆▇▅▅▆▄▆▆▅▃▂▃▃▄▃▂▂▂▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁
wandb:                  x/lr0 ▁▂▂▃▄▄▅▆▆▇▇████▇▇▇▆▆▆▅▅▅▅▄▄▄▃▃▃▃▂▂▂▂▂▂▂▂
wandb:                  x/lr1 ▁▂▂▃▄▄▅▆▆▇▇████▇▇▇▆▆▆▅▅▅▅▄▄▄▃▃▃▃▂▂▂▂▂▂▂▂
wandb:                  x/lr2 ██▇▇▆▅▅▄▄▃▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb: 
wandb: Run summary:
wandb:        metrics/mAP_0.5 0.63516
wandb:   metrics/mAP_0.5:0.95 0.36128
wandb:      metrics/precision 0.69888
wandb:         metrics/recall 0.59514
wandb:         train/box_loss 0.03237
wandb:         train/cls_loss 0.00519
wandb:         train/obj_loss 0.00732
wandb:           val/box_loss 0.03155
wandb:           val/cls_loss 0.00958
wandb:           val/obj_loss 0.00619
wandb:                  x/lr0 0.00101
wandb:                  x/lr1 0.00101
wandb:                  x/lr2 0.00101
wandb: 
wandb: Synced 5 W&B file(s), 337 media file(s), 1 artifact file(s) and 1 other file(s)
wandb: Synced model_25-10-2021_16-16-13: https://self-hosted-wandb-url-goes-here
wandb: Find logs at: ./wandb/run-20211025_161621-3az1nhlb/logs/debug.log
wandb: 
Results saved to model/model_25-10-2021_16-16-13
Destroying process group... 
Sending interrupt signal to process
Terminated

script returned exit code 143```
AyushExel commented 3 years ago

@Zegorax that's very strange. On disabling wandb, you should not see wandb termlogs

Zegorax commented 3 years ago

@AyushExel Should I create a new issue ? Because I need to have WandB enabled

AyushExel commented 3 years ago

@Zegorax oh okay.. I thought we were just talking about wandb disabled. I'll check with wandb enabled

AyushExel commented 3 years ago

@Zegorax it worked with wandb enabled

wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: (30 second timeout) 2
wandb: You chose 'Use an existing W&B account'
wandb: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter: 
wandb: Appending key for api.wandb.ai to your netrc file: /home/jupyter/.netrc
train: weights=yolov5s.pt, cfg=, data=coco128.yaml, hyp=data/hyps/hyp.scratch.yaml, epochs=3, batch_size=16, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, evolve=None, bucket=, cache=None, image_weights=False, device=0,1, multi_scale=False, single_cls=False, adam=False, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, linear_lr=False, label_smoothing=0.0, patience=100, freeze=0, save_period=-1, local_rank=0, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: up to date with https://github.com/ultralytics/yolov5 ✅
YOLOv5 🚀 v6.0-35-ga4fece8 torch 1.9.0+cu102 CUDA:0 (Tesla V100-SXM2-16GB, 16160.5MB)
                                             CUDA:1 (Tesla V100-SXM2-16GB, 16160.5MB)

Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for 2 nodes.
hyperparameters: lr0=0.01, lrf=0.1, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/
2021-10-26 07:18:39.051059: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
wandb: Currently logged in as: cayush (use `wandb login --relogin` to force relogin)
2021-10-26 07:18:42.738633: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
wandb: Tracking run with wandb version 0.12.5
wandb: Syncing run iconic-eon-745
wandb: ⭐️ View project at https://wandb.ai/cayush/yoloV5
wandb: 🚀 View run at https://wandb.ai/cayush/yoloV5/runs/dvets0rr
wandb: Run data is saved locally in /home/jupyter/yolov5/wandb/run-20211026_071841-dvets0rr
wandb: Run `wandb offline` to turn off syncing.

InvalidVersionSpec: Invalid version '1.0<2': invalid character(s)

                 from  n    params  module                                  arguments                     
  0                -1  1      3520  models.common.Conv                      [3, 32, 6, 2, 2]              
  1                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]                
  2                -1  1     18816  models.common.C3                        [64, 64, 1]                   
  3                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]               
  4                -1  2    115712  models.common.C3                        [128, 128, 2]                 
  5                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]              
  6                -1  3    625152  models.common.C3                        [256, 256, 3]                 
  7                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]              
  8                -1  1   1182720  models.common.C3                        [512, 512, 1]                 
  9                -1  1    656896  models.common.SPPF                      [512, 512, 5]                 
 10                -1  1    131584  models.common.Conv                      [512, 256, 1, 1]              
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 12           [-1, 6]  1         0  models.common.Concat                    [1]                           
 13                -1  1    361984  models.common.C3                        [512, 256, 1, False]          
 14                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]              
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 16           [-1, 4]  1         0  models.common.Concat                    [1]                           
 17                -1  1     90880  models.common.C3                        [256, 128, 1, False]          
 18                -1  1    147712  models.common.Conv                      [128, 128, 3, 2]              
 19          [-1, 14]  1         0  models.common.Concat                    [1]                           
 20                -1  1    296448  models.common.C3                        [256, 256, 1, False]          
 21                -1  1    590336  models.common.Conv                      [256, 256, 3, 2]              
 22          [-1, 10]  1         0  models.common.Concat                    [1]                           
 23                -1  1   1182720  models.common.C3                        [512, 512, 1, False]          
 24      [17, 20, 23]  1    229245  models.yolo.Detect                      [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
Model Summary: 270 layers, 7235389 parameters, 7235389 gradients

Transferred 349/349 items from yolov5s.pt
Scaled weight_decay = 0.0005
optimizer: SGD with parameter groups 57 weight, 60 weight (no decay), 60 bias
train: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|██████████████████████████████████████████████████████████████████████| 128/128 [00:00<?, ?it/s]
train: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|██████████████████████████████████████████████████████████████████████| 128/128 [00:00<?, ?it/s]
train: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|██████████████████████████████████████████████████████████████████████| 128/128 [00:00<?, ?it/s]
train: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|██████████████████████████████████████████████████████████████████████| 128/128 [00:00<?, ?it/s]
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████████████████████████████████████████████████████████| 128/128 [00:00<?, ?it/s]
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
Plotting labels... 
train: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|██████████████████████████████████████████████████████████████████████| 128/128 [00:00<?, ?it/s]
train: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|██████████████████████████████████████████████████████████████████████| 128/128 [00:01<?, ?it/s]
train: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|██████████████████████████████████████████████████████████████████████| 128/128 [00:01<?, ?it/s]
train: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|██████████████████████████████████████████████████████████████████████| 128/128 [00:01<?, ?it/s]
train: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|██████████████████████████████████████████████████████████████████████| 128/128 [00:02<?, ?it/s]

autoanchor: Analyzing anchors... anchors/target = 4.26, Best Possible Recall (BPR) = 0.9946
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to runs/train/exp3
Starting training for 3 epochs...

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       0/2     1.89G   0.04361   0.07621   0.02057       131       640:  12%|███████████████                                                                                                         | 1/8 [00:04<00:32,  4.64s/it]Reducer buckets have been rebuilt in this iteration.
       0/2     6.28G   0.04354   0.06284   0.02263        95       640: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:05<00:00,  1.45it/s]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:05<00:00,  1.53it/s]
                 all        128        929      0.678      0.535      0.622      0.407

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       1/2     6.43G   0.04465   0.06804   0.02394       117       640: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00,  8.01it/s]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00,  4.01it/s]
                 all        128        929      0.682      0.548      0.632      0.411

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       2/2     6.43G   0.04337   0.07026   0.02062        91       640: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00,  9.37it/s]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:02<00:00,  3.88it/s]
                 all        128        929      0.619      0.598      0.634      0.416

3 epochs completed in 0.006 hours.
Optimizer stripped from runs/train/exp3/weights/last.pt, 14.8MB
Optimizer stripped from runs/train/exp3/weights/best.pt, 14.8MB

Validating runs/train/exp3/weights/best.pt...
Fusing layers... 
Model Summary: 213 layers, 7225885 parameters, 0 gradients
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:02<00:00,  2.83it/s]
                 all        128        929      0.618        0.6      0.635      0.417
              person        128        254       0.72      0.736      0.774      0.508
             bicycle        128          6      0.546      0.667      0.561      0.326
                 car        128         46      0.598       0.37      0.461      0.201
          motorcycle        128          5       0.66        0.6      0.812      0.637
            airplane        128          6          1      0.947      0.995      0.764
                 bus        128          7      0.536      0.714       0.71      0.605
               train        128          3      0.696          1      0.995      0.632
               truck        128         12      0.427      0.333      0.473      0.288
                boat        128          6      0.403      0.333      0.465       0.14
       traffic light        128         14      0.521      0.158       0.24       0.16
           stop sign        128          2      0.576        0.5      0.828      0.663
               bench        128          9      0.702      0.444      0.571      0.236
                bird        128         16      0.801          1      0.988      0.647
                 cat        128          4      0.939       0.75      0.828      0.714
                 dog        128          9      0.761      0.667      0.852      0.527
               horse        128          2      0.579          1      0.995      0.697
            elephant        128         17      0.942      0.882      0.943      0.692
                bear        128          1      0.388          1      0.995      0.995
               zebra        128          4      0.827          1      0.995      0.908
             giraffe        128          9       0.74      0.889      0.851      0.613
            backpack        128          6      0.661      0.333      0.496      0.214
            umbrella        128         18      0.617      0.611      0.731      0.395
             handbag        128         19      0.514      0.105      0.179      0.112
                 tie        128          7      0.605      0.571      0.687      0.461
            suitcase        128          4      0.709          1      0.995       0.54
             frisbee        128          5       0.54        0.8      0.798      0.705
                skis        128          1      0.516          1      0.995      0.497
           snowboard        128          7      0.869      0.714      0.767      0.555
         sports ball        128          6      0.531        0.5      0.581      0.325
                kite        128         10      0.583      0.561      0.564      0.206
        baseball bat        128          4      0.411        0.5      0.282      0.087
      baseball glove        128          7      0.339      0.429      0.366      0.222
          skateboard        128          5      0.795      0.779      0.736        0.5
       tennis racket        128          7      0.442      0.571      0.571      0.324
              bottle        128         18      0.461        0.5      0.476       0.29
          wine glass        128         16      0.678       0.92      0.885      0.415
                 cup        128         36      0.824      0.361      0.504       0.32
                fork        128          6      0.567      0.234      0.341      0.226
               knife        128         16      0.523      0.625      0.674      0.452
               spoon        128         22      0.602        0.5      0.532       0.26
                bowl        128         28      0.668      0.571       0.63      0.448
              banana        128          1      0.147          1      0.166     0.0498
            sandwich        128          2          0          0      0.133      0.105
              orange        128          4          1          0      0.545      0.151
            broccoli        128         11      0.298      0.311      0.236      0.205
              carrot        128         24      0.481      0.583      0.631      0.425
             hot dog        128          2      0.463          1      0.497      0.497
               pizza        128          5      0.599          1      0.824      0.566
               donut        128         14      0.675          1      0.946       0.85
                cake        128          4      0.698          1      0.895      0.704
               chair        128         35      0.408      0.543       0.46      0.221
               couch        128          6          1      0.482      0.829      0.504
        potted plant        128         14      0.795      0.786      0.819      0.467
                 bed        128          3      0.992      0.333      0.753      0.269
        dining table        128         13      0.571      0.462      0.438      0.242
              toilet        128          2      0.388        0.5      0.557       0.49
                  tv        128          2      0.672          1      0.995      0.846
              laptop        128          3          1          0      0.415      0.193
               mouse        128          2          1          0     0.0375       0.03
              remote        128          8      0.596      0.625      0.636      0.506
          cell phone        128          8      0.579      0.375      0.392      0.184
           microwave        128          3      0.343          1      0.995      0.786
                oven        128          5      0.301        0.4      0.432      0.249
                sink        128          6      0.338      0.167      0.294      0.168
        refrigerator        128          5       0.69        0.8      0.815      0.506
                book        128         29      0.524      0.229      0.295      0.125
               clock        128          9      0.787      0.778      0.895      0.588
                vase        128          2      0.181          1      0.663      0.597
            scissors        128          1          1          0     0.0332    0.00663
          teddy bear        128         21      0.814      0.418      0.608      0.349
          toothbrush        128          5      0.703        0.6      0.739      0.191

wandb: Waiting for W&B process to finish, PID 4215... (success).
wandb:                                                                                
wandb: Run history:
wandb:        metrics/mAP_0.5 ▁▆█
wandb:   metrics/mAP_0.5:0.95 ▁▄█
wandb:      metrics/precision ▇█▁
wandb:         metrics/recall ▁▂█
wandb:         train/box_loss ▂█▁
wandb:         train/cls_loss ▅█▁
wandb:         train/obj_loss ▁▆█
wandb:           val/box_loss █▄▁
wandb:           val/cls_loss █▄▁
wandb:           val/obj_loss █▅▁
wandb:                  x/lr0 ▁█▂
wandb:                  x/lr1 ▁█▂
wandb:                  x/lr2 █▅▁
wandb: 
wandb: Run summary:
wandb:        metrics/mAP_0.5 0.63423
wandb:   metrics/mAP_0.5:0.95 0.4157
wandb:      metrics/precision 0.61857
wandb:         metrics/recall 0.59793
wandb:         train/box_loss 0.04337
wandb:         train/cls_loss 0.02062
wandb:         train/obj_loss 0.07026
wandb:           val/box_loss 0.04014
wandb:           val/cls_loss 0.01355
wandb:           val/obj_loss 0.0422
wandb:                  x/lr0 7e-05
wandb:                  x/lr1 7e-05
wandb:                  x/lr2 0.09777
wandb: 
wandb: Synced 6 W&B file(s), 113 media file(s), 1 artifact file(s) and 1 other file(s)
wandb: Synced iconic-eon-745: https://wandb.ai/cayush/yoloV5/runs/dvets0rr
wandb: Find logs at: ./wandb/run-20211026_071841-dvets0rr/logs/debug.log
wandb: 
Results saved to runs/train/exp3
Destroying process group... 
INFO:torch.distributed.elastic.agent.server.api:[default] worker group successfully finished. Waiting 300 seconds for other agents to finish.
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish
/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/utils/store.py:71: FutureWarning: This is an experimental API and will be changed in future.
  "This is an experimental API and will be changed in future.", FutureWarning
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0005068778991699219 seconds
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 0, "group_rank": 0, "worker_id": "4116", "role": "default", "hostname": "ac-vm2.c.playground-111.internal", "state": "SUCCEEDED", "total_run_time": 100, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [0], \"role_rank\": [0], \"role_world_size\": [2]}", "agent_restarts": 0}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 1, "group_rank": 0, "worker_id": "4117", "role": "default", "hostname": "ac-vm2.c.playground-111.internal", "state": "SUCCEEDED", "total_run_time": 100, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [1], \"role_rank\": [1], \"role_world_size\": [2]}", "agent_restarts": 0}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "ac-vm2.c.playground-111.internal", "state": "SUCCEEDED", "total_run_time": 100, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\"}", "agent_restarts": 0}}
(base) jupyter@ac-vm2:~/yolov5$ 

What version of wandb client are you using? Please try to update it using pip install --upgrade wandb and let me know if you still see this problem

Zegorax commented 3 years ago

@AyushExel I'm also using the latest version of W&B. My system is based on a Jenkins job, so everything is always re-installed at each run, and using the latest version of all repos

Zegorax commented 3 years ago

@AyushExel Can try to repro using a non-interactive environment ? By setting WANDB_API_KEY=your-key for example

Zegorax commented 3 years ago

@AyushExel Have you been able to reproduce the problem?

AyushExel commented 3 years ago

@Zegorax yes I ran this in a non-interactive docker environment and the process finished successfully.

wandb: Currently logged in as: cayush (use `wandb login --relogin` to force relogin)
train: weights=yolov5s.pt, cfg=, data=data/coco128.yaml, hyp=data/hyps/hyp.scratch.yaml, epochs=2, batch_size=16, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, evolve=None, bucket=, cache=None, image_weights=False, device=, multi_scale=False, single_cls=False, adam=False, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, linear_lr=False, label_smoothing=0.0, patience=100, freeze=0, save_period=-1, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
YOLOv5 🚀 v6.0-43-g19c8760 torch 1.10.0+cu102 CUDA:0 (Tesla V100-SXM2-16GB, 16160.5MB)

hyperparameters: lr0=0.01, lrf=0.1, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/
github: skipping check (Docker image), for updates see https://github.com/ultralytics/yolov5
wandb: Tracking run with wandb version 0.12.6
wandb: Syncing run visionary-galaxy-746
wandb:  View project at https://wandb.ai/cayush/yoloV5
wandb:  View run at https://wandb.ai/cayush/yoloV5/runs/1m5xl3kf
wandb: Run data is saved locally in /usr/src/app/wandb/run-20211102_113152-1m5xl3kf
wandb: Run `wandb offline` to turn off syncing.
100% 6.66M/6.66M [00:00<00:00, 72.5MB/s]
100% 14.0M/14.0M [00:00<00:00, 89.0MB/s]

                 from  n    params  module                                  arguments                     
  0                -1  1      3520  models.common.Conv                      [3, 32, 6, 2, 2]              
  1                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]                
  2                -1  1     18816  models.common.C3                        [64, 64, 1]                   
  3                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]               
  4                -1  2    115712  models.common.C3                        [128, 128, 2]                 
  5                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]              
  6                -1  3    625152  models.common.C3                        [256, 256, 3]                 
  7                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]              
  8                -1  1   1182720  models.common.C3                        [512, 512, 1]                 
  9                -1  1    656896  models.common.SPPF                      [512, 512, 5]                 
 10                -1  1    131584  models.common.Conv                      [512, 256, 1, 1]              
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 12           [-1, 6]  1         0  models.common.Concat                    [1]                           
 13                -1  1    361984  models.common.C3                        [512, 256, 1, False]          
 14                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]              
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 16           [-1, 4]  1         0  models.common.Concat                    [1]                           
 17                -1  1     90880  models.common.C3                        [256, 128, 1, False]          
 18                -1  1    147712  models.common.Conv                      [128, 128, 3, 2]              
 19          [-1, 14]  1         0  models.common.Concat                    [1]                           
 20                -1  1    296448  models.common.C3                        [256, 256, 1, False]          
 21                -1  1    590336  models.common.Conv                      [256, 256, 3, 2]              
 22          [-1, 10]  1         0  models.common.Concat                    [1]                           
 23                -1  1   1182720  models.common.C3                        [512, 512, 1, False]          
 24      [17, 20, 23]  1    229245  models.yolo.Detect                      [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
Model Summary: 270 layers, 7235389 parameters, 7235389 gradients, 16.5 GFLOPs

Transferred 349/349 items from yolov5s.pt
Scaled weight_decay = 0.0005
optimizer: SGD with parameter groups 57 weight, 60 weight (no decay), 60 bias
DP not recommended, instead use torch.distributed.run for best DDP Multi-GPU results.
See Multi-GPU Tutorial at https://docs.ultralytics.com/yolov5/tutorials/multi_gpu_training to get started.

WARNING: Dataset not found, nonexistent paths: ['/usr/src/datasets/coco128/images/train2017']
Downloading https://github.com/ultralytics/yolov5/releases/download/v1.0/coco128.zip to coco128.zip...
Dataset autodownload success, saved to ../datasets

Downloading https://github.com/ultralytics/yolov5/releases/download/v6.0/yolov5s.pt to yolov5s.pt...

train: Scanning '../datasets/coco128/labels/train2017' images and labels...128 found, 0 missing, 2 empty, 0 corrupted: 100% 128/128 [00:00<00:00, 5023.82it/s]
train: New cache created: ../datasets/coco128/labels/train2017.cache
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100% 128/128 [00:00<?, ?it/s]
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to runs/train/exp
Starting training for 2 epochs...

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       0/1     2.33G   0.04581   0.06708   0.02386       226       640: 100% 8/8 [00:06<00:00,  1.17it/s]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100% 4/4 [00:03<00:00,  1.30it/s]
                 all        128        929      0.682       0.54      0.624      0.415

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       1/1     4.02G   0.04509   0.07335   0.02124       223       640: 100% 8/8 [00:01<00:00,  7.09it/s]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100% 4/4 [00:02<00:00,  1.35it/s]
                 all        128        929      0.695      0.544       0.63      0.417

2 epochs completed in 0.005 hours.

Validating runs/train/exp/weights/best.pt...
Fusing layers... 
Model Summary: 213 layers, 7225885 parameters, 0 gradients, 16.5 GFLOPs
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100% 4/4 [00:05<00:00,  1.27s/it]
                 all        128        929      0.695      0.543      0.629      0.417
              person        128        254      0.815      0.669      0.771      0.507
             bicycle        128          6      0.764      0.545      0.609      0.326
                 car        128         46      0.783      0.348      0.477      0.217
          motorcycle        128          5      0.675        0.6      0.762      0.569
            airplane        128          6          1      0.795      0.995      0.749
                 bus        128          7       0.63      0.714      0.715      0.605
               train        128          3      0.739          1      0.995      0.665
               truck        128         12      0.641      0.333      0.454       0.22
                boat        128          6      0.796      0.333      0.455      0.126
       traffic light        128         14      0.544      0.143      0.237      0.158
           stop sign        128          2       0.63        0.5      0.828      0.713
               bench        128          9      0.999      0.444      0.577      0.234
                bird        128         16      0.841          1      0.985      0.652
                 cat        128          4      0.892       0.75      0.836      0.691
                 dog        128          9      0.858      0.667      0.858      0.541
               horse        128          2      0.688          1      0.995      0.697
            elephant        128         17      0.957      0.882       0.94      0.685
                bear        128          1      0.575          1      0.995      0.995
               zebra        128          4      0.854          1      0.995      0.921
             giraffe        128          9      0.803      0.778      0.912      0.573
            backpack        128          6          1      0.314      0.479      0.204
            umbrella        128         18      0.733      0.556      0.723      0.405
             handbag        128         19      0.612      0.105      0.163      0.111
                 tie        128          7       0.79      0.571      0.701      0.436
            suitcase        128          4          1      0.876      0.995      0.621
             frisbee        128          5      0.624        0.8      0.798      0.723
                skis        128          1      0.608          1      0.995      0.497
           snowboard        128          7      0.957      0.714      0.764      0.557
         sports ball        128          6      0.682        0.5      0.576       0.32
                kite        128         10      0.634      0.522      0.574      0.222
        baseball bat        128          4      0.456      0.434      0.303      0.127
      baseball glove        128          7      0.372      0.429      0.327      0.176
          skateboard        128          5      0.705      0.492      0.734      0.544
       tennis racket        128          7      0.558      0.571      0.537      0.297
              bottle        128         18      0.657      0.426      0.488      0.289
          wine glass        128         16      0.684      0.812       0.79      0.379
                 cup        128         36       0.82      0.333      0.492      0.317
                fork        128          6      0.374      0.167      0.245      0.193
               knife        128         16      0.825      0.625      0.654      0.438
               spoon        128         22      0.832      0.364      0.551      0.276
                bowl        128         28      0.751      0.539      0.636      0.463
              banana        128          1          0          0      0.142     0.0284
            sandwich        128          2          0          0     0.0957      0.072
              orange        128          4          1          0       0.62      0.287
            broccoli        128         11      0.379      0.182      0.287      0.247
              carrot        128         24      0.696      0.478      0.611      0.361
             hot dog        128          2      0.398      0.694      0.497      0.465
               pizza        128          5      0.623          1      0.824      0.561
               donut        128         14      0.701          1      0.963      0.843
                cake        128          4      0.724          1      0.945      0.741
               chair        128         35        0.5      0.514      0.483      0.229
               couch        128          6      0.638      0.333      0.696      0.388
        potted plant        128         14      0.799      0.714      0.778      0.456
                 bed        128          3          1          0      0.641      0.245
        dining table        128         13      0.854      0.452      0.479      0.315
              toilet        128          2      0.511        0.5       0.54      0.528
                  tv        128          2      0.732          1      0.995      0.846
              laptop        128          3          1          0      0.426      0.165
               mouse        128          2          1          0     0.0277     0.0222
              remote        128          8       0.72      0.625      0.635      0.488
          cell phone        128          8       0.45      0.125      0.374      0.198
           microwave        128          3      0.428          1      0.995      0.764
                oven        128          5      0.362        0.4      0.432      0.242
                sink        128          6      0.347      0.167      0.268      0.156
        refrigerator        128          5      0.692        0.8      0.811      0.435
                book        128         29      0.686      0.152      0.293      0.131
               clock        128          9      0.831      0.778      0.885      0.571
                vase        128          2      0.241          1      0.663      0.622
            scissors        128          1          1          0     0.0243    0.00485
          teddy bear        128         21      0.864      0.381      0.618      0.341
          toothbrush        128          5          1      0.583      0.664      0.412
Plotting labels... 

autoanchor: Analyzing anchors... anchors/target = 4.27, Best Possible Recall (BPR) = 0.9935
Optimizer stripped from runs/train/exp/weights/last.pt, 14.9MB
Optimizer stripped from runs/train/exp/weights/best.pt, 14.9MB
wandb: Waiting for W&B process to finish, PID 97... (success).
wandb:                                                                                
wandb: Run history:
wandb:        metrics/mAP_0.5 ▁█
wandb:   metrics/mAP_0.5:0.95 ▁█
wandb:      metrics/precision ▁█
wandb:         metrics/recall ▁█
wandb:         train/box_loss █▁
wandb:         train/cls_loss █▁
wandb:         train/obj_loss ▁█
wandb:           val/box_loss █▁
wandb:           val/cls_loss █▁
wandb:           val/obj_loss █▁
wandb:                  x/lr0 ▁█
wandb:                  x/lr1 ▁█
wandb:                  x/lr2 █▁
wandb: 
wandb: Run summary:
wandb:        metrics/mAP_0.5 0.62979
wandb:   metrics/mAP_0.5:0.95 0.41725
wandb:      metrics/precision 0.69505
wandb:         metrics/recall 0.544
wandb:         train/box_loss 0.04509
wandb:         train/cls_loss 0.02124
wandb:         train/obj_loss 0.07335
wandb:           val/box_loss 0.04124
wandb:           val/cls_loss 0.01407
wandb:           val/obj_loss 0.03989
wandb:                  x/lr0 8e-05
wandb:                  x/lr1 8e-05
wandb:                  x/lr2 0.09858
wandb: 
wandb: Synced 6 W&B file(s), 81 media file(s), 1 artifact file(s) and 1 other file(s)
wandb: Synced visionary-galaxy-746: https://wandb.ai/cayush/yoloV5/runs/1m5xl3kf
wandb: Find logs at: ./wandb/run-20211102_113152-1m5xl3kf/logs/debug.log
wandb: 
Results saved to runs/train/exp

(base) jupyter@ac-vm2:~/yolov5$ 
Davidnet commented 3 years ago

I'm also seeing this behavior thinking it was because I'm training on 2xA100

glenn-jocher commented 3 years ago

@Davidnet you should be able to train DDP 8x A100 successfully in Docker. Can you verify your error is reproducible with the latest Docker image and provide @AyushExel steps to reproduce please? Thanks!

AyushExel commented 3 years ago

@Davidnet yes, please. I'm curious to reproduce this so I can get someone to look into this asap. Please verify with wandb enabled and disabled. If the error is caused by wandb, it should only occur when wandb is enabled. Fixing all DDP problems is a very high priority us.

github-actions[bot] commented 2 years ago

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 🚀 resources:

Access additional Ultralytics ⚡ resources:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!