Closed Yoon5 closed 2 years ago
@Yoon5 thanks for the bug report! If you uninstall wandb before training (pip uninstall wandb
), or login to wandb before training (wandb login API_KEY
), does this resolve the issue for you?
@AyushExel I did some testing and it seems like wandb may be causing issues with DDP. I train all of my DDP models already logged in, but if not logged in and presented with 1,2,3 options query training may crash as above, or if training completes process group is not destroyed and system hangs. The steps I used to reproduce are on a 2-GPU training are here. Can you try to reproduce on your end?
# Pull image
t=ultralytics/yolov5:latest && sudo docker pull $t && sudo docker run -it --ipc=host --gpus all $t
# Train 3 epochs COCO128 with DDP
python -m torch.distributed.launch --nproc_per_node 1 --master_port 2 train.py --data coco128.yaml --epochs 3
Hang looks like this, seems to occur with wandb installed and enabled:
EDIT1: Summary is here:
pip uninstall wandb
: training runs correctlywandb login API_KEY
: training runs correctlyThank you I will try)))))
I tried above lines (# Pull image t=ultralytics/yolov5:latest && sudo docker pull $t && sudo docker run -it --ipc=host --gpus all $t
python -m torch.distributed.launch --nproc_per_node 1 --master_port 2 train.py --data coco128.yaml --epochs 3). And I do not have wandb. I am did not installed it to my env. And I got this after
~/Desktop/yolov5-master$ t=ultralytics/yolov5:latest && sudo docker pull $t && sudo docker run -it --ipc=host --gpus all $t latest: Pulling from ultralytics/yolov5 Digest: sha256:eee5c66aa087376ab6b70b737b6825dcc59bc1059407a6875c8ef627e2e11f9c Status: Image is up to date for ultralytics/yolov5:latest docker.io/ultralytics/yolov5:latest
NVIDIA Release 21.05 (build 22595835) PyTorch Version 1.9.0a0+2ecb2c7
Container image Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
Copyright (c) 2014-2021 Facebook Inc. Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert) Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu) Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu) Copyright (c) 2011-2013 NYU (Clement Farabet) Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston) Copyright (c) 2006 Idiap Research Institute (Samy Bengio) Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz) Copyright (c) 2015 Google Inc. Copyright (c) 2015 Yangqing Jia Copyright (c) 2013-2016 The Caffe contributors All rights reserved.
NVIDIA Deep Learning Profiler (dlprof) Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License. By pulling and using the container, you accept the terms and conditions of this license: https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
ERROR: This container was built for NVIDIA Driver Release 465.19 or later, but version 460.91.03 was detected and compatibility mode is UNAVAILABLE.
[[Forward compatibility was attempted on non supported HW (CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE) cuInit()=804]]
NOTE: MOFED driver for multi-node communication was not detected. Multi-node communication performance may be reduced.
root@1709fa266811:/usr/src/app# python -m torch.distributed.launch --nproc_per_node 1 --master_port 2 train.py --data coco128.yaml --epochs 3
/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torch.distributed.run.
Note that --use_env is set by default in torch.distributed.run.
If your script expects --local_rank
argument to be set, please
change it to read from os.environ['LOCAL_RANK']
instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn( wandb: (1) Create a W&B account wandb: (2) Use an existing W&B account wandb: (3) Don't visualize my results wandb: Enter your choice: (30 second timeout) wandb login API_KEY wandb: WARNING Invalid choice wandb: Enter your choice: (30 second timeout) wandb: W&B disabled due to login timeout. train: weights=yolov5s.pt, cfg=, data=coco128.yaml, hyp=data/hyps/hyp.scratch.yaml, epochs=3, batch_size=16, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, evolve=None, bucket=, cache=None, image_weights=False, device=, multi_scale=False, single_cls=False, adam=False, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, linear_lr=False, label_smoothing=0.0, patience=100, freeze=0, save_period=-1, local_rank=0, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest github: skipping check (Docker image), for updates see https://github.com/ultralytics/yolov5 /opt/conda/lib/python3.8/site-packages/torch/cuda/init.py:106: UserWarning: GeForce RTX 3080 Ti with CUDA capability sm_86 is not compatible with the current PyTorch installation. The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70. If you want to use the GeForce RTX 3080 Ti GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name)) YOLOv5 🚀 v6.0-3-g20a809d torch 1.9.1+cu102 CUDA:0 (GeForce RTX 3080 Ti, 12053.8125MB)
Added key: store_based_barrier_key:1 to store for rank: 0 Rank 0: Completed store-based barrier for 1 nodes. hyperparameters: lr0=0.01, lrf=0.1, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0 TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/
WARNING: Dataset not found, nonexistent paths: ['/usr/src/datasets/coco128/images/train2017'] Downloading https://github.com/ultralytics/yolov5/releases/download/v1.0/coco128.zip to coco128.zip... 100%|██████████████████████████████████████| 6.66M/6.66M [00:00<00:00, 8.76MB/s] Dataset autodownload success, saved to ../datasets
Downloading https://github.com/ultralytics/yolov5/releases/download/v6.0/yolov5s.pt to yolov5s.pt... 100%|██████████████████████████████████████| 14.0M/14.0M [00:04<00:00, 3.64MB/s]
from n params module arguments
0 -1 1 3520 models.common.Conv [3, 32, 6, 2, 2]
1 -1 1 18560 models.common.Conv [32, 64, 3, 2]
2 -1 1 18816 models.common.C3 [64, 64, 1]
3 -1 1 73984 models.common.Conv [64, 128, 3, 2]
4 -1 2 115712 models.common.C3 [128, 128, 2]
5 -1 1 295424 models.common.Conv [128, 256, 3, 2]
6 -1 3 625152 models.common.C3 [256, 256, 3]
7 -1 1 1180672 models.common.Conv [256, 512, 3, 2]
8 -1 1 1182720 models.common.C3 [512, 512, 1]
9 -1 1 656896 models.common.SPPF [512, 512, 5]
10 -1 1 131584 models.common.Conv [512, 256, 1, 1]
11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
12 [-1, 6] 1 0 models.common.Concat [1]
13 -1 1 361984 models.common.C3 [512, 256, 1, False]
14 -1 1 33024 models.common.Conv [256, 128, 1, 1]
15 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
16 [-1, 4] 1 0 models.common.Concat [1]
17 -1 1 90880 models.common.C3 [256, 128, 1, False]
18 -1 1 147712 models.common.Conv [128, 128, 3, 2]
19 [-1, 14] 1 0 models.common.Concat [1]
20 -1 1 296448 models.common.C3 [256, 256, 1, False]
21 -1 1 590336 models.common.Conv [256, 256, 3, 2]
22 [-1, 10] 1 0 models.common.Concat [1]
23 -1 1 1182720 models.common.C3 [512, 512, 1, False]
24 [17, 20, 23] 1 229245 models.yolo.Detect [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
Model Summary: 270 layers, 7235389 parameters, 7235389 gradients, 16.5 GFLOPs
Traceback (most recent call last):
File "train.py", line 620, in
train.py FAILED
Other Failures:
@Yoon5 before you do anything you need to update your nvidia drivers as your error message states:
ERROR: This container was built for NVIDIA Driver Release 465.19 or later, but
version 460.91.03 was detected and compatibility mode is UNAVAILABLE.
@AyushExel I'm manually pushing a new ultralytics/yolov5:latest
image without wandb for now until we can sort this out, so if you pull the image after seeing this message you'll have to pip install wandb in the image to get started testing.
Thank you
@glenn-jocher I'm testing this now
@glenn-jocher The problem doesn't occur for me. I'm running on 2 T4 GPUs and the program exited fine. I've tried this 2 times.
Full trace:
/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torch.distributed.run.
Note that --use_env is set by default in torch.distributed.run.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: (30 second timeout)
wandb: W&B disabled due to login timeout.
train: weights=yolov5s.pt, cfg=, data=coco128.yaml, hyp=data/hyps/hyp.scratch.yaml, epochs=3, batch_size=16, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, evolve=None, bucket=, cache=None, image_weights=False, device=, multi_scale=False, single_cls=False, adam=False, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, linear_lr=False, label_smoothing=0.0, patience=100, freeze=0, save_period=-1, local_rank=0, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: skipping check (Docker image), for updates see https://github.com/ultralytics/yolov5
YOLOv5 🚀 v6.0-4-gb754525 torch 1.9.1+cu102 CUDA:0 (Tesla T4, 15109.75MB)
Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for 1 nodes.
hyperparameters: lr0=0.01, lrf=0.1, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/
WARNING: Dataset not found, nonexistent paths: ['/usr/src/datasets/coco128/images/train2017']
Downloading https://github.com/ultralytics/yolov5/releases/download/v1.0/coco128.zip to coco128.zip...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 6.66M/6.66M [00:00<00:00, 123MB/s]
Dataset autodownload success, saved to ../datasets
Downloading https://github.com/ultralytics/yolov5/releases/download/v6.0/yolov5s.pt to yolov5s.pt...
100%|████████████████████████████████████████████████████████████████████████████████████████████████| 14.0M/14.0M [00:00<00:00, 31.5MB/s]
from n params module arguments
0 -1 1 3520 models.common.Conv [3, 32, 6, 2, 2]
1 -1 1 18560 models.common.Conv [32, 64, 3, 2]
2 -1 1 18816 models.common.C3 [64, 64, 1]
3 -1 1 73984 models.common.Conv [64, 128, 3, 2]
4 -1 2 115712 models.common.C3 [128, 128, 2]
5 -1 1 295424 models.common.Conv [128, 256, 3, 2]
6 -1 3 625152 models.common.C3 [256, 256, 3]
7 -1 1 1180672 models.common.Conv [256, 512, 3, 2]
8 -1 1 1182720 models.common.C3 [512, 512, 1]
9 -1 1 656896 models.common.SPPF [512, 512, 5]
10 -1 1 131584 models.common.Conv [512, 256, 1, 1]
11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
12 [-1, 6] 1 0 models.common.Concat [1]
13 -1 1 361984 models.common.C3 [512, 256, 1, False]
14 -1 1 33024 models.common.Conv [256, 128, 1, 1]
15 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
16 [-1, 4] 1 0 models.common.Concat [1]
17 -1 1 90880 models.common.C3 [256, 128, 1, False]
18 -1 1 147712 models.common.Conv [128, 128, 3, 2]
19 [-1, 14] 1 0 models.common.Concat [1]
20 -1 1 296448 models.common.C3 [256, 256, 1, False]
21 -1 1 590336 models.common.Conv [256, 256, 3, 2]
22 [-1, 10] 1 0 models.common.Concat [1]
23 -1 1 1182720 models.common.C3 [512, 512, 1, False]
24 [17, 20, 23] 1 229245 models.yolo.Detect [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
Model Summary: 270 layers, 7235389 parameters, 7235389 gradients, 16.5 GFLOPs
Transferred 349/349 items from yolov5s.pt
Scaled weight_decay = 0.0005
optimizer: SGD with parameter groups 57 weight, 60 weight (no decay), 60 bias
train: Scanning '../datasets/coco128/labels/train2017' images and labels...128 found, 0 missing, 2 empty, 0 corrupted: 100%|█| 128/128 [00
train: New cache created: ../datasets/coco128/labels/train2017.cache
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|█| 128/12
Plotting labels...
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|█| 128/12
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|█| 128/12
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|█| 128/12
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|█| 128/12
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|█| 128/12
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|█| 128/12
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|█| 128/12
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|█| 128/12
autoanchor: Analyzing anchors... anchors/target = 4.26, Best Possible Recall (BPR) = 0.9946
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to runs/train/exp
Starting training for 3 epochs...
Epoch gpu_mem box obj cls labels img_size
0/2 2.53G 0.04048 0.07431 0.02063 210 640: 12%|███▉ | 1/8 [00:06<00:44, 6.40s/it]Reducer buckets have been rebuilt in this iteration.
0/2 6.65G 0.04252 0.06144 0.02085 232 640: 100%|███████████████████████████████| 8/8 [00:08<00:00, 1.02s/it]
Class Images Labels P R mAP@.5 mAP@.5:.95: 100%|███████████████| 4/4 [00:09<00:00, 2.44s/it]
all 128 929 0.671 0.533 0.621 0.407
Epoch gpu_mem box obj cls labels img_size
1/2 7.05G 0.04501 0.06472 0.0191 180 640: 100%|███████████████████████████████| 8/8 [00:02<00:00, 4.00it/s]
Class Images Labels P R mAP@.5 mAP@.5:.95: 100%|███████████████| 4/4 [00:03<00:00, 1.22it/s]
all 128 929 0.697 0.536 0.631 0.416
Epoch gpu_mem box obj cls labels img_size
2/2 7.05G 0.04566 0.06474 0.02024 305 640: 100%|███████████████████████████████| 8/8 [00:01<00:00, 4.31it/s]
Class Images Labels P R mAP@.5 mAP@.5:.95: 100%|███████████████| 4/4 [00:03<00:00, 1.20it/s]
all 128 929 0.704 0.547 0.633 0.418
3 epochs completed in 0.009 hours.
Optimizer stripped from runs/train/exp/weights/last.pt, 14.9MB
Optimizer stripped from runs/train/exp/weights/best.pt, 14.9MB
Validating runs/train/exp/weights/best.pt...
Fusing layers...
Model Summary: 213 layers, 7225885 parameters, 0 gradients, 16.5 GFLOPs
Class Images Labels P R mAP@.5 mAP@.5:.95: 100%|███████████████| 4/4 [00:05<00:00, 1.28s/it]
all 128 929 0.704 0.547 0.634 0.419
person 128 254 0.807 0.681 0.774 0.508
bicycle 128 6 0.748 0.496 0.545 0.318
car 128 46 0.801 0.351 0.484 0.212
motorcycle 128 5 0.672 0.6 0.8 0.633
airplane 128 6 1 0.831 0.995 0.721
bus 128 7 0.648 0.714 0.694 0.59
train 128 3 0.809 1 0.995 0.632
truck 128 12 0.605 0.25 0.446 0.251
boat 128 6 0.735 0.333 0.489 0.141
traffic light 128 14 0.587 0.143 0.238 0.149
stop sign 128 2 0.627 0.5 0.828 0.663
bench 128 9 0.908 0.444 0.57 0.243
bird 128 16 0.888 0.991 0.988 0.652
cat 128 4 1 0.729 0.836 0.691
dog 128 9 0.916 0.667 0.887 0.547
horse 128 2 0.705 1 0.995 0.697
elephant 128 17 1 0.927 0.946 0.696
bear 128 1 0.52 1 0.995 0.995
zebra 128 4 0.849 1 0.995 0.952
giraffe 128 9 0.813 0.778 0.869 0.576
backpack 128 6 1 0.307 0.452 0.201
umbrella 128 18 0.724 0.556 0.722 0.394
handbag 128 19 0.662 0.104 0.167 0.11
tie 128 7 0.895 0.571 0.693 0.466
suitcase 128 4 1 0.992 0.995 0.621
frisbee 128 5 0.652 0.8 0.798 0.694
skis 128 1 0.616 1 0.995 0.497
snowboard 128 7 1 0.705 0.766 0.558
sports ball 128 6 0.659 0.5 0.622 0.341
kite 128 10 0.557 0.5 0.557 0.204
baseball bat 128 4 0.391 0.5 0.275 0.136
baseball glove 128 7 0.474 0.391 0.327 0.197
skateboard 128 5 0.754 0.614 0.792 0.557
tennis racket 128 7 0.536 0.571 0.538 0.299
bottle 128 18 0.649 0.389 0.484 0.286
wine glass 128 16 0.771 0.875 0.853 0.397
cup 128 36 0.852 0.361 0.493 0.294
fork 128 6 0.378 0.167 0.252 0.194
knife 128 16 0.888 0.625 0.667 0.449
spoon 128 22 0.811 0.391 0.531 0.257
bowl 128 28 0.75 0.571 0.617 0.461
banana 128 1 0 0 0.142 0.0142
sandwich 128 2 0 0 0.0957 0.0743
orange 128 4 1 0 0.578 0.199
broccoli 128 11 0.418 0.182 0.314 0.273
carrot 128 24 0.716 0.542 0.636 0.383
hot dog 128 2 0.4 0.699 0.497 0.465
pizza 128 5 0.629 1 0.831 0.603
donut 128 14 0.692 1 0.952 0.823
cake 128 4 0.73 1 0.895 0.713
chair 128 35 0.458 0.486 0.476 0.232
couch 128 6 0.723 0.333 0.801 0.453
potted plant 128 14 0.791 0.714 0.806 0.447
bed 128 3 1 0 0.746 0.275
dining table 128 13 0.83 0.462 0.476 0.299
toilet 128 2 0.456 0.5 0.566 0.496
tv 128 2 0.752 1 0.995 0.846
laptop 128 3 1 0 0.426 0.185
mouse 128 2 1 0 0.0268 0.0215
remote 128 8 0.71 0.625 0.635 0.506
cell phone 128 8 0.599 0.199 0.429 0.202
microwave 128 3 0.402 1 0.995 0.743
oven 128 5 0.364 0.4 0.427 0.248
sink 128 6 0.344 0.167 0.265 0.161
refrigerator 128 5 0.704 0.8 0.814 0.456
book 128 29 0.601 0.138 0.294 0.133
clock 128 9 0.87 0.778 0.91 0.599
vase 128 2 0.286 1 0.663 0.597
scissors 128 1 1 0 0.0302 0.00603
teddy bear 128 21 0.85 0.381 0.613 0.344
toothbrush 128 5 1 0.477 0.708 0.438
Results saved to runs/train/exp
root@903423287a25:/usr/src/app#
@AyushExel oh interesting. Can you try again and enter wandb: (3) Don't visualize my results
@AyushExel also I just noticed in your output your training is only using 1 GPU. When you use multiple devices they will be listed together. Ah sorry, I see my command to reproduce above was incorrect. This is the correct 2-gpu training command:
python -m torch.distributed.launch --nproc_per_node 2 --master_port 1 train.py --data coco128.yaml --epochs 3 --device 0,1
@glenn-jocher thanks. I tried it again. It's not getting stuck. Here's the traceback:
root@6f70428df512:/usr/src/app# python -m torch.distributed.launch --nproc_per_node 2 --master_port 1 train.py --data coco128.yaml --epochs 3 --device 0,1
/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torch.distributed.run.
Note that --use_env is set by default in torch.distributed.run.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
WARNING:torch.distributed.run:*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: (30 second timeout) wandb: Enter your choice: (30 second timeout)
wandb: W&B disabled due to login timeout.
train: weights=yolov5s.pt, cfg=, data=coco128.yaml, hyp=data/hyps/hyp.scratch.yaml, epochs=3, batch_size=16, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, evolve=None, bucket=, cache=None, image_weights=False, device=0,1, multi_scale=False, single_cls=False, adam=False, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, linear_lr=False, label_smoothing=0.0, patience=100, freeze=0, save_period=-1, local_rank=0, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: skipping check (Docker image), for updates see https://github.com/ultralytics/yolov5
YOLOv5 🚀 v6.0-4-gb754525 torch 1.9.1+cu102 CUDA:0 (Tesla T4, 15109.75MB)
CUDA:1 (Tesla T4, 15109.75MB)
Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for 2 nodes.
hyperparameters: lr0=0.01, lrf=0.1, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/
WARNING: Dataset not found, nonexistent paths: ['/usr/src/datasets/coco128/images/train2017']
Downloading https://github.com/ultralytics/yolov5/releases/download/v1.0/coco128.zip to coco128.zip...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6.66M/6.66M [00:00<00:00, 92.7MB/s]
Dataset autodownload success, saved to ../datasets
Downloading https://github.com/ultralytics/yolov5/releases/download/v6.0/yolov5s.pt to yolov5s.pt...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14.0M/14.0M [00:00<00:00, 76.4MB/s]
from n params module arguments
0 -1 1 3520 models.common.Conv [3, 32, 6, 2, 2]
1 -1 1 18560 models.common.Conv [32, 64, 3, 2]
2 -1 1 18816 models.common.C3 [64, 64, 1]
3 -1 1 73984 models.common.Conv [64, 128, 3, 2]
4 -1 2 115712 models.common.C3 [128, 128, 2]
5 -1 1 295424 models.common.Conv [128, 256, 3, 2]
6 -1 3 625152 models.common.C3 [256, 256, 3]
7 -1 1 1180672 models.common.Conv [256, 512, 3, 2]
8 -1 1 1182720 models.common.C3 [512, 512, 1]
9 -1 1 656896 models.common.SPPF [512, 512, 5]
10 -1 1 131584 models.common.Conv [512, 256, 1, 1]
11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
12 [-1, 6] 1 0 models.common.Concat [1]
13 -1 1 361984 models.common.C3 [512, 256, 1, False]
14 -1 1 33024 models.common.Conv [256, 128, 1, 1]
15 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
16 [-1, 4] 1 0 models.common.Concat [1]
17 -1 1 90880 models.common.C3 [256, 128, 1, False]
18 -1 1 147712 models.common.Conv [128, 128, 3, 2]
19 [-1, 14] 1 0 models.common.Concat [1]
20 -1 1 296448 models.common.C3 [256, 256, 1, False]
21 -1 1 590336 models.common.Conv [256, 256, 3, 2]
22 [-1, 10] 1 0 models.common.Concat [1]
23 -1 1 1182720 models.common.C3 [512, 512, 1, False]
24 [17, 20, 23] 1 229245 models.yolo.Detect [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
Model Summary: 270 layers, 7235389 parameters, 7235389 gradients, 16.5 GFLOPs
Transferred 349/349 items from yolov5s.pt
Scaled weight_decay = 0.0005
optimizer: SGD with parameter groups 57 weight, 60 weight (no decay), 60 bias
train: Scanning '../datasets/coco128/labels/train2017' images and labels...128 found, 0 missing, 2 empty, 0 corrupted: 100%|█| 128/128 [00:00<00:00, 738.
train: New cache created: ../datasets/coco128/labels/train2017.cache
train: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|█| 128/128 [00:00<?, ?
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|█| 128/128 [00:00<?, ?it
Plotting labels...
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|█| 128/128 [00:00<?, ?it
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|█| 128/128 [00:00<?, ?it
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|█| 128/128 [00:00<?, ?it
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|█| 128/128 [00:00<?, ?it
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|█| 128/128 [00:00<?, ?it
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|█| 128/128 [00:00<?, ?it
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|█| 128/128 [00:00<?, ?it
autoanchor: Analyzing anchors... anchors/target = 4.26, Best Possible Recall (BPR) = 0.9946
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to runs/train/exp
Starting training for 3 epochs...
Epoch gpu_mem box obj cls labels img_size
0/2 1.89G 0.04361 0.07621 0.02057 131 640: 12%|█████▊ | 1/8 [00:06<00:43, 6.21s/it]Reducer buckets have been rebuilt in this iteration.
0/2 6.28G 0.04354 0.06285 0.02263 95 640: 100%|██████████████████████████████████████████████| 8/8 [00:07<00:00, 1.07it/s]
Class Images Labels P R mAP@.5 mAP@.5:.95: 100%|██████████████████████████████| 8/8 [00:08<00:00, 1.09s/it]
all 128 929 0.679 0.535 0.621 0.407
Epoch gpu_mem box obj cls labels img_size
1/2 6.43G 0.04465 0.06801 0.02394 117 640: 100%|██████████████████████████████████████████████| 8/8 [00:01<00:00, 5.84it/s]
Class Images Labels P R mAP@.5 mAP@.5:.95: 100%|██████████████████████████████| 8/8 [00:03<00:00, 2.42it/s]
all 128 929 0.681 0.548 0.632 0.411
Epoch gpu_mem box obj cls labels img_size
2/2 6.43G 0.04337 0.07029 0.02062 91 640: 100%|██████████████████████████████████████████████| 8/8 [00:01<00:00, 6.63it/s]
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|█| 128/128 [00:29<?, ?it
Class Images Labels P R mAP@.5 mAP@.5:.95: 100%|██████████████████████████████| 8/8 [00:03<00:00, 2.44it/s]
all 128 929 0.617 0.598 0.634 0.416
3 epochs completed in 0.008 hours.
Optimizer stripped from runs/train/exp/weights/last.pt, 14.8MB
Optimizer stripped from runs/train/exp/weights/best.pt, 14.8MB
Validating runs/train/exp/weights/best.pt...
Fusing layers...
Model Summary: 213 layers, 7225885 parameters, 0 gradients, 16.5 GFLOPs
Class Images Labels P R mAP@.5 mAP@.5:.95: 100%|██████████████████████████████| 8/8 [00:04<00:00, 1.69it/s]
all 128 929 0.616 0.598 0.633 0.416
person 128 254 0.722 0.74 0.775 0.508
bicycle 128 6 0.545 0.667 0.561 0.326
car 128 46 0.598 0.37 0.462 0.2
motorcycle 128 5 0.66 0.6 0.812 0.637
airplane 128 6 1 0.948 0.995 0.764
bus 128 7 0.535 0.714 0.71 0.605
train 128 3 0.696 1 0.995 0.632
truck 128 12 0.432 0.333 0.474 0.288
boat 128 6 0.404 0.333 0.464 0.14
traffic light 128 14 0.523 0.159 0.243 0.161
stop sign 128 2 0.576 0.5 0.828 0.663
bench 128 9 0.704 0.444 0.572 0.237
bird 128 16 0.8 1 0.988 0.647
cat 128 4 0.941 0.75 0.828 0.714
dog 128 9 0.761 0.667 0.852 0.527
horse 128 2 0.58 1 0.995 0.697
elephant 128 17 0.942 0.882 0.943 0.693
bear 128 1 0.386 1 0.995 0.995
zebra 128 4 0.827 1 0.995 0.908
giraffe 128 9 0.741 0.889 0.851 0.597
backpack 128 6 0.661 0.333 0.496 0.211
umbrella 128 18 0.615 0.611 0.731 0.395
handbag 128 19 0.514 0.105 0.18 0.112
tie 128 7 0.606 0.571 0.683 0.463
suitcase 128 4 0.709 1 0.995 0.54
frisbee 128 5 0.54 0.8 0.798 0.705
skis 128 1 0.516 1 0.995 0.497
snowboard 128 7 0.868 0.714 0.767 0.555
sports ball 128 6 0.531 0.5 0.581 0.325
kite 128 10 0.584 0.563 0.564 0.208
baseball bat 128 4 0.411 0.5 0.283 0.0876
baseball glove 128 7 0.339 0.429 0.366 0.222
skateboard 128 5 0.795 0.779 0.735 0.5
tennis racket 128 7 0.442 0.571 0.551 0.314
bottle 128 18 0.461 0.5 0.476 0.289
wine glass 128 16 0.585 0.795 0.74 0.386
cup 128 36 0.822 0.361 0.5 0.319
fork 128 6 0.571 0.237 0.341 0.226
knife 128 16 0.523 0.625 0.674 0.452
spoon 128 22 0.602 0.5 0.532 0.26
bowl 128 28 0.668 0.571 0.63 0.448
banana 128 1 0.147 1 0.166 0.0498
sandwich 128 2 0 0 0.133 0.105
orange 128 4 1 0 0.545 0.151
broccoli 128 11 0.298 0.31 0.236 0.205
carrot 128 24 0.481 0.583 0.63 0.425
hot dog 128 2 0.462 1 0.497 0.497
pizza 128 5 0.599 1 0.824 0.566
donut 128 14 0.675 1 0.946 0.848
cake 128 4 0.698 1 0.895 0.704
chair 128 35 0.408 0.543 0.46 0.221
couch 128 6 1 0.481 0.829 0.504
potted plant 128 14 0.796 0.786 0.82 0.467
bed 128 3 0.992 0.333 0.753 0.269
dining table 128 13 0.571 0.462 0.438 0.242
toilet 128 2 0.388 0.5 0.554 0.487
tv 128 2 0.672 1 0.995 0.846
laptop 128 3 1 0 0.415 0.193
mouse 128 2 1 0 0.0375 0.03
remote 128 8 0.596 0.625 0.636 0.506
cell phone 128 8 0.579 0.375 0.389 0.182
microwave 128 3 0.343 1 0.995 0.786
oven 128 5 0.301 0.4 0.432 0.249
sink 128 6 0.338 0.167 0.294 0.168
refrigerator 128 5 0.69 0.8 0.815 0.506
book 128 29 0.5 0.207 0.295 0.125
clock 128 9 0.787 0.778 0.898 0.589
vase 128 2 0.181 1 0.663 0.597
scissors 128 1 1 0 0.0332 0.00663
teddy bear 128 21 0.814 0.418 0.608 0.35
toothbrush 128 5 0.704 0.6 0.739 0.191
Results saved to runs/train/exp
Destroying process group...
root@6f70428df512:/usr/src/app#
@glenn-jocher Ok I was able to reproduce. It occurs on manually choosing option 3. I think I know the source of the problem. I'll push a fix
@glenn-jocher ok I found the root cause of the problem. The import checks are happening in loggers/init.py which makes the checks at wandb_utils.py redundant. I've moved the checks to init.py now. The PR should fix the problem.
Also, it'd be nice to catch these problems early on during CI checks but it's mostly limited because there's no backdoor to stop/resume runs during tests. I think setting up a revamped testing suite to test DDP, integrations etc. might be worth it!
@Yoon5 good news 😃! Your original issue may now be fixed ✅ in PR #5163 by @AyushExel. To receive this update:
git pull
from within your yolov5/
directory or git clone https://github.com/ultralytics/yolov5
againmodel = torch.hub.load('ultralytics/yolov5', 'yolov5s', force_reload=True)
sudo docker pull ultralytics/yolov5:latest
to update your image Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀!
I have the same problem using Docker. However I run the container using the env variable WANDB_API_KEY, which does not require wandb login. Using this method, the training hangs at the end
@Zegorax @AyushExel this issue should be resolved in #5163, so please ensure you are using the very latest code or Docker image. To pull the latest docker image use sudo docker pull ultralytics/yolov5:latest
. If you are still experiencing an issue with the very latest image please let us know and we will reopen, thanks!
@glenn-jocher I'm using the latest version of the code, the problem is still present even with fix #5163
@Zegorax are reaching the end of training? I tried to reproduce this in the latest version of the repo and I'm getting this error:
Traceback (most recent call last):
File "train.py", line 627, in <module>
main(opt)
File "train.py", line 524, in main
train(opt.hyp, opt, device, callbacks)
File "train.py", line 249, in train
nl = model.model[-1].nl # number of detection layers (to scale hyps)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in __getattr__
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DistributedDataParallel' object has no attribute 'model'
Traceback (most recent call last):
File "train.py", line 627, in <module>
main(opt)
File "train.py", line 524, in main
train(opt.hyp, opt, device, callbacks)
File "train.py", line 249, in train
nl = model.model[-1].nl # number of detection layers (to scale hyps)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in __getattr__
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DistributedDataParallel' object has no attribute 'model'
This seems unrelated to W&B as I've disbaled it
@AyushExel nl
error is probably caused by my recent autobatch PR https://github.com/ultralytics/yolov5/pull/5092. I will investigate.
EDIT: this comes back to our general lack of DDP CI. It's an open issue, still don't have a solution for this.
@glenn-jocher @AyushExel It's the exact same problem as the original issue. The training finishes, but the process hangs and never returns
@Zegorax you're probably on an older version of the repo because the latest version has another bug which won't let the training start. try running git pull
inside youy yolov5 directory.
@glenn-jocher sure. Let me know once the issue is fixed and I'll try to confirm if the wandb issue still exists
@Zegorax you're probably on an older version of the repo because the latest version has another bug which won't let the training start. try running
git pull
inside youy yolov5 directory.@glenn-jocher sure. Let me know once the issue is fixed and I'll try to confirm if the wandb issue still exists
As I said earlier, no I'm not. I'm using the latest YOLOv5 on master branch.
@AyushExel nl
bug fixed in #5332. Verified with:
python -m torch.distributed.run --nproc_per_node 2 --master_port 1 train.py --epochs 3 --device 0,1
Please wait 15 min for Docker Autobuild to complete and deploy this latest merge, then update your Docker image with
t=ultralytics/yolov5:latest && sudo docker pull $t && sudo docker run -it --ipc=host --gpus all $t
@Zegorax @glenn-jocher I just tested using the lastest master branch, I can run DDP with wand disabled without any hang.
I'll test it on docker image once that is available too.
Traceback for command - python -m torch.distributed.launch --nproc_per_node 2 train.py --data coco128.yaml --epochs 3 --device 0,1
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: (30 second timeout) 3
wandb: You chose 'Don't visualize my results'
train: weights=yolov5s.pt, cfg=, data=coco128.yaml, hyp=data/hyps/hyp.scratch.yaml, epochs=3, batch_size=16, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, evolve=None, bucket=, cache=None, image_weights=False, device=0,1, multi_scale=False, single_cls=False, adam=False, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, linear_lr=False, label_smoothing=0.0, patience=100, freeze=0, save_period=-1, local_rank=0, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: up to date with https://github.com/ultralytics/yolov5 ✅
YOLOv5 🚀 v6.0-35-ga4fece8 torch 1.9.0+cu102 CUDA:0 (Tesla V100-SXM2-16GB, 16160.5MB)
CUDA:1 (Tesla V100-SXM2-16GB, 16160.5MB)
Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for 2 nodes.
hyperparameters: lr0=0.01, lrf=0.1, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
Weights & Biases: run 'pip install wandb' to automatically track and visualize YOLOv5 🚀 runs (RECOMMENDED)
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/
2021-10-25 14:05:44.253930: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
from n params module arguments
0 -1 1 3520 models.common.Conv [3, 32, 6, 2, 2]
1 -1 1 18560 models.common.Conv [32, 64, 3, 2]
2 -1 1 18816 models.common.C3 [64, 64, 1]
3 -1 1 73984 models.common.Conv [64, 128, 3, 2]
4 -1 2 115712 models.common.C3 [128, 128, 2]
5 -1 1 295424 models.common.Conv [128, 256, 3, 2]
6 -1 3 625152 models.common.C3 [256, 256, 3]
7 -1 1 1180672 models.common.Conv [256, 512, 3, 2]
8 -1 1 1182720 models.common.C3 [512, 512, 1]
9 -1 1 656896 models.common.SPPF [512, 512, 5]
10 -1 1 131584 models.common.Conv [512, 256, 1, 1]
11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
12 [-1, 6] 1 0 models.common.Concat [1]
13 -1 1 361984 models.common.C3 [512, 256, 1, False]
14 -1 1 33024 models.common.Conv [256, 128, 1, 1]
15 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
16 [-1, 4] 1 0 models.common.Concat [1]
17 -1 1 90880 models.common.C3 [256, 128, 1, False]
18 -1 1 147712 models.common.Conv [128, 128, 3, 2]
19 [-1, 14] 1 0 models.common.Concat [1]
20 -1 1 296448 models.common.C3 [256, 256, 1, False]
21 -1 1 590336 models.common.Conv [256, 256, 3, 2]
22 [-1, 10] 1 0 models.common.Concat [1]
23 -1 1 1182720 models.common.C3 [512, 512, 1, False]
24 [17, 20, 23] 1 229245 models.yolo.Detect [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
Model Summary: 270 layers, 7235389 parameters, 7235389 gradients
Transferred 349/349 items from yolov5s.pt
Scaled weight_decay = 0.0005
optimizer: SGD with parameter groups 57 weight, 60 weight (no decay), 60 bias
train: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|██████████████████████████████████████████████████████████████████████| 128/128 [00:00<?, ?it/s]
train: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|██████████████████████████████████████████████████████████████████████| 128/128 [00:00<?, ?it/s]
train: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|██████████████████████████████████████████████████████████████████████| 128/128 [00:00<?, ?it/s]
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████████████████████████████████████████████████████████| 128/128 [00:00<?, ?it/s][W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████████████████████████████████████████████████████████| 128/128 [00:00<?, ?it/s]
Plotting labels...
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████████████████████████████████████████████████████████| 128/128 [00:01<?, ?it/s]
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████████████████████████████████████████████████████████| 128/128 [00:01<?, ?it/s]
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████████████████████████████████████████████████████████| 128/128 [00:01<?, ?it/s]
autoanchor: Analyzing anchors... anchors/target = 4.26, Best Possible Recall (BPR) = 0.9946
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to runs/train/exp2
Starting training for 3 epochs...
Epoch gpu_mem box obj cls labels img_size
0/2 1.87G 0.04361 0.07627 0.02057 131 640: 12%|███████████████ | 1/8 [00:04<00:32, 4.66s/it]Reducer buckets have been rebuilt in this iteration.
0/2 6.27G 0.04354 0.06284 0.02263 95 640: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:05<00:00, 1.44it/s]
Class Images Labels P R mAP@.5 mAP@.5:.95: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:03<00:00, 2.02it/s]
all 128 929 0.679 0.536 0.623 0.407
Epoch gpu_mem box obj cls labels img_size
1/2 6.3G 0.04465 0.06803 0.02394 117 640: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 9.52it/s]
Class Images Labels P R mAP@.5 mAP@.5:.95: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 12.00it/s]
all 128 929 0.683 0.549 0.632 0.412
Epoch gpu_mem box obj cls labels img_size
2/2 6.3G 0.04337 0.07029 0.02061 91 640: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 9.46it/s]
Class Images Labels P R mAP@.5 mAP@.5:.95: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 11.84it/s]
all 128 929 0.616 0.599 0.634 0.415
3 epochs completed in 0.004 hours.
Optimizer stripped from runs/train/exp2/weights/last.pt, 14.8MB
Optimizer stripped from runs/train/exp2/weights/best.pt, 14.8MB
Validating runs/train/exp2/weights/best.pt...
Fusing layers...
Model Summary: 213 layers, 7225885 parameters, 0 gradients
Class Images Labels P R mAP@.5 mAP@.5:.95: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 5.33it/s]
all 128 929 0.616 0.599 0.634 0.416
person 128 254 0.715 0.74 0.774 0.508
bicycle 128 6 0.542 0.667 0.561 0.326
car 128 46 0.598 0.37 0.461 0.201
motorcycle 128 5 0.66 0.6 0.812 0.611
airplane 128 6 1 0.949 0.995 0.764
bus 128 7 0.536 0.714 0.71 0.605
train 128 3 0.695 1 0.995 0.632
truck 128 12 0.415 0.333 0.473 0.288
boat 128 6 0.402 0.333 0.465 0.14
traffic light 128 14 0.525 0.161 0.243 0.161
stop sign 128 2 0.575 0.5 0.828 0.663
bench 128 9 0.704 0.444 0.572 0.237
bird 128 16 0.8 1 0.988 0.647
cat 128 4 0.938 0.75 0.828 0.714
dog 128 9 0.76 0.667 0.852 0.527
horse 128 2 0.577 1 0.995 0.697
elephant 128 17 0.941 0.882 0.943 0.693
bear 128 1 0.386 1 0.995 0.995
zebra 128 4 0.827 1 0.995 0.908
giraffe 128 9 0.74 0.889 0.851 0.613
backpack 128 6 0.656 0.333 0.496 0.214
umbrella 128 18 0.613 0.611 0.731 0.395
handbag 128 19 0.513 0.105 0.179 0.112
tie 128 7 0.605 0.571 0.687 0.461
suitcase 128 4 0.709 1 0.995 0.54
frisbee 128 5 0.539 0.8 0.798 0.705
skis 128 1 0.515 1 0.995 0.497
snowboard 128 7 0.868 0.714 0.767 0.554
sports ball 128 6 0.528 0.5 0.581 0.325
kite 128 10 0.587 0.57 0.564 0.208
baseball bat 128 4 0.409 0.5 0.282 0.087
baseball glove 128 7 0.339 0.429 0.366 0.222
skateboard 128 5 0.794 0.777 0.735 0.5
tennis racket 128 7 0.442 0.571 0.571 0.324
bottle 128 18 0.46 0.5 0.477 0.29
wine glass 128 16 0.604 0.859 0.78 0.404
cup 128 36 0.823 0.361 0.504 0.32
fork 128 6 0.573 0.239 0.341 0.226
knife 128 16 0.521 0.625 0.674 0.452
spoon 128 22 0.598 0.5 0.532 0.26
bowl 128 28 0.654 0.571 0.63 0.448
banana 128 1 0.146 1 0.166 0.0498
sandwich 128 2 0 0 0.133 0.105
orange 128 4 1 0 0.545 0.151
broccoli 128 11 0.299 0.311 0.236 0.204
carrot 128 24 0.479 0.583 0.631 0.426
hot dog 128 2 0.463 1 0.497 0.497
pizza 128 5 0.599 1 0.824 0.566
donut 128 14 0.675 1 0.946 0.85
cake 128 4 0.697 1 0.895 0.692
chair 128 35 0.407 0.543 0.46 0.221
couch 128 6 1 0.482 0.829 0.504
potted plant 128 14 0.796 0.786 0.819 0.46
bed 128 3 0.982 0.333 0.753 0.269
dining table 128 13 0.569 0.462 0.438 0.242
toilet 128 2 0.387 0.5 0.557 0.49
tv 128 2 0.672 1 0.995 0.846
laptop 128 3 1 0 0.415 0.193
mouse 128 2 1 0 0.0357 0.0285
remote 128 8 0.596 0.625 0.636 0.506
cell phone 128 8 0.578 0.375 0.421 0.195
microwave 128 3 0.343 1 0.995 0.786
oven 128 5 0.3 0.4 0.432 0.249
sink 128 6 0.338 0.167 0.29 0.167
refrigerator 128 5 0.69 0.8 0.815 0.506
book 128 29 0.535 0.239 0.296 0.125
clock 128 9 0.787 0.778 0.895 0.584
vase 128 2 0.18 1 0.663 0.597
scissors 128 1 1 0 0.0332 0.00663
teddy bear 128 21 0.814 0.419 0.608 0.349
toothbrush 128 5 0.701 0.6 0.739 0.194
Results saved to runs/train/exp2
Destroying process group...
INFO:torch.distributed.elastic.agent.server.api:[default] worker group successfully finished. Waiting 300 seconds for other agents to finish.
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish
/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/utils/store.py:71: FutureWarning: This is an experimental API and will be changed in future.
"This is an experimental API and will be changed in future.", FutureWarning
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0004150867462158203 seconds
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 0, "group_rank": 0, "worker_id": "6946", "role": "default", "hostname": "ac-vm2.c.playground-111.internal", "state": "SUCCEEDED", "total_run_time": 45, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [0], \"role_rank\": [0], \"role_world_size\": [2]}", "agent_restarts": 0}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 1, "group_rank": 0, "worker_id": "6947", "role": "default", "hostname": "ac-vm2.c.playground-111.internal", "state": "SUCCEEDED", "total_run_time": 45, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [1], \"role_rank\": [1], \"role_world_size\": [2]}", "agent_restarts": 0}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "ac-vm2.c.playground-111.internal", "state": "SUCCEEDED", "total_run_time": 45, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\"}", "agent_restarts": 0}}
(base) jupyter@ac-vm2:~/yolov5$
@AyushExel Can you check again using Docker and the env variable I've mentioned earlier ?
@Zegorax just tested the latest docker image
command - python -m torch.distributed.launch --nproc_per_node 1 train.py --data coco128.yaml --epochs 3
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: (30 second timeout) 3
wandb: You chose 'Don't visualize my results'
train: weights=yolov5s.pt, cfg=, data=coco128.yaml, hyp=data/hyps/hyp.scratch.yaml, epochs=3, batch_size=16, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, evolve=None, bucket=, cache=None, image_weights=False, device=, multi_scale=False, single_cls=False, adam=False, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, linear_lr=False, label_smoothing=0.0, patience=100, freeze=0, save_period=-1, local_rank=0, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: skipping check (Docker image), for updates see https://github.com/ultralytics/yolov5
YOLOv5 🚀 v6.0-35-ga4fece8 torch 1.9.1+cu102 CUDA:0 (Tesla V100-SXM2-16GB, 16160.5MB)
Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for 1 nodes.
hyperparameters: lr0=0.01, lrf=0.1, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
Weights & Biases: run 'pip install wandb' to automatically track and visualize YOLOv5 🚀 runs (RECOMMENDED)
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/
WARNING: Dataset not found, nonexistent paths: ['/usr/src/datasets/coco128/images/train2017']
Downloading https://github.com/ultralytics/yolov5/releases/download/v1.0/coco128.zip to coco128.zip...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6.66M/6.66M [00:00<00:00, 114MB/s]
Dataset autodownload success, saved to ../datasets
Downloading https://github.com/ultralytics/yolov5/releases/download/v6.0/yolov5s.pt to yolov5s.pt...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14.0M/14.0M [00:00<00:00, 74.4MB/s]
from n params module arguments
0 -1 1 3520 models.common.Conv [3, 32, 6, 2, 2]
1 -1 1 18560 models.common.Conv [32, 64, 3, 2]
2 -1 1 18816 models.common.C3 [64, 64, 1]
3 -1 1 73984 models.common.Conv [64, 128, 3, 2]
4 -1 2 115712 models.common.C3 [128, 128, 2]
5 -1 1 295424 models.common.Conv [128, 256, 3, 2]
6 -1 3 625152 models.common.C3 [256, 256, 3]
7 -1 1 1180672 models.common.Conv [256, 512, 3, 2]
8 -1 1 1182720 models.common.C3 [512, 512, 1]
9 -1 1 656896 models.common.SPPF [512, 512, 5]
10 -1 1 131584 models.common.Conv [512, 256, 1, 1]
11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
12 [-1, 6] 1 0 models.common.Concat [1]
13 -1 1 361984 models.common.C3 [512, 256, 1, False]
14 -1 1 33024 models.common.Conv [256, 128, 1, 1]
15 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
16 [-1, 4] 1 0 models.common.Concat [1]
17 -1 1 90880 models.common.C3 [256, 128, 1, False]
18 -1 1 147712 models.common.Conv [128, 128, 3, 2]
19 [-1, 14] 1 0 models.common.Concat [1]
20 -1 1 296448 models.common.C3 [256, 256, 1, False]
21 -1 1 590336 models.common.Conv [256, 256, 3, 2]
22 [-1, 10] 1 0 models.common.Concat [1]
23 -1 1 1182720 models.common.C3 [512, 512, 1, False]
24 [17, 20, 23] 1 229245 models.yolo.Detect [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
Model Summary: 270 layers, 7235389 parameters, 7235389 gradients, 16.5 GFLOPs
Transferred 349/349 items from yolov5s.pt
Scaled weight_decay = 0.0005
optimizer: SGD with parameter groups 57 weight, 60 weight (no decay), 60 bias
train: Scanning '../datasets/coco128/labels/train2017' images and labels...128 found, 0 missing, 2 empty, 0 corrupted: 100%|███████████████████████████████████████████████████| 128/128 [00:00<00:00, 3608.22it/s]
train: New cache created: ../datasets/coco128/labels/train2017.cache
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████████████████████████████████████████| 128/128 [00:00<?, ?it/s]
Plotting labels...
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████████████████████████████████████████| 128/128 [00:01<?, ?it/s]
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████████████████████████████████████████| 128/128 [00:01<?, ?it/s]
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████████████████████████████████████████| 128/128 [00:01<?, ?it/s]
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████████████████████████████████████████| 128/128 [00:01<?, ?it/s]
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████████████████████████████████████████| 128/128 [00:01<?, ?it/s]
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████████████████████████████████████████| 128/128 [00:01<?, ?it/s]
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████████████████████████████████████████| 128/128 [00:01<?, ?it/s]
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████████████████████████████████████████| 128/128 [00:01<?, ?it/s]
autoanchor: Analyzing anchors... anchors/target = 4.26, Best Possible Recall (BPR) = 0.9946
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to runs/train/exp
Starting training for 3 epochs...
Epoch gpu_mem box obj cls labels img_size
0/2 2.53G 0.04048 0.07428 0.02063 210 640: 12%|█████████████ | 1/8 [00:04<00:29, 4.27s/it]Reducer buckets have been rebuilt in this iteration.
0/2 6.65G 0.04252 0.06146 0.02085 232 640: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:05<00:00, 1.47it/s]
Class Images Labels P R mAP@.5 mAP@.5:.95: 100%|████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00, 1.30it/s]
all 128 929 0.67 0.533 0.621 0.406
Epoch gpu_mem box obj cls labels img_size
1/2 6.52G 0.04501 0.0647 0.0191 180 640: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 8.40it/s]
Class Images Labels P R mAP@.5 mAP@.5:.95: 100%|████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 6.39it/s]
all 128 929 0.697 0.536 0.63 0.415
Epoch gpu_mem box obj cls labels img_size
2/2 6.52G 0.04566 0.06475 0.02024 305 640: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 8.41it/s]
Class Images Labels P R mAP@.5 mAP@.5:.95: 100%|████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 6.32it/s]
all 128 929 0.705 0.547 0.634 0.418
3 epochs completed in 0.004 hours.
Optimizer stripped from runs/train/exp/weights/last.pt, 14.9MB
Optimizer stripped from runs/train/exp/weights/best.pt, 14.9MB
Validating runs/train/exp/weights/best.pt...
Fusing layers...
Model Summary: 213 layers, 7225885 parameters, 0 gradients, 16.5 GFLOPs
Class Images Labels P R mAP@.5 mAP@.5:.95: 100%|████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:02<00:00, 1.50it/s]
all 128 929 0.705 0.547 0.634 0.418
person 128 254 0.807 0.681 0.774 0.509
bicycle 128 6 0.748 0.496 0.545 0.318
car 128 46 0.801 0.351 0.484 0.213
motorcycle 128 5 0.673 0.6 0.8 0.633
airplane 128 6 1 0.83 0.995 0.721
bus 128 7 0.649 0.714 0.694 0.59
train 128 3 0.81 1 0.995 0.632
truck 128 12 0.605 0.25 0.445 0.251
boat 128 6 0.737 0.333 0.47 0.139
traffic light 128 14 0.588 0.143 0.238 0.149
stop sign 128 2 0.628 0.5 0.828 0.663
bench 128 9 0.907 0.444 0.572 0.244
bird 128 16 0.888 0.991 0.988 0.652
cat 128 4 1 0.729 0.836 0.691
dog 128 9 0.918 0.667 0.887 0.547
horse 128 2 0.706 1 0.995 0.697
elephant 128 17 1 0.927 0.946 0.69
bear 128 1 0.523 1 0.995 0.995
zebra 128 4 0.849 1 0.995 0.952
giraffe 128 9 0.813 0.778 0.869 0.576
backpack 128 6 1 0.307 0.452 0.201
umbrella 128 18 0.725 0.556 0.723 0.395
handbag 128 19 0.66 0.103 0.169 0.11
tie 128 7 0.9 0.571 0.691 0.466
suitcase 128 4 1 0.995 0.995 0.621
frisbee 128 5 0.655 0.8 0.798 0.694
skis 128 1 0.618 1 0.995 0.497
snowboard 128 7 1 0.705 0.766 0.558
sports ball 128 6 0.66 0.5 0.622 0.341
kite 128 10 0.558 0.5 0.557 0.204
baseball bat 128 4 0.392 0.5 0.275 0.136
baseball glove 128 7 0.467 0.381 0.327 0.197
skateboard 128 5 0.753 0.612 0.792 0.557
tennis racket 128 7 0.537 0.571 0.538 0.299
bottle 128 18 0.65 0.389 0.484 0.286
wine glass 128 16 0.772 0.875 0.853 0.397
cup 128 36 0.853 0.361 0.493 0.294
fork 128 6 0.379 0.167 0.252 0.194
knife 128 16 0.893 0.625 0.667 0.447
spoon 128 22 0.809 0.387 0.529 0.256
bowl 128 28 0.75 0.571 0.617 0.462
banana 128 1 0 0 0.142 0.0284
sandwich 128 2 0 0 0.0957 0.0743
orange 128 4 1 0 0.578 0.189
broccoli 128 11 0.422 0.182 0.314 0.273
carrot 128 24 0.717 0.542 0.635 0.383
hot dog 128 2 0.399 0.698 0.497 0.465
pizza 128 5 0.629 1 0.831 0.594
donut 128 14 0.693 1 0.952 0.823
cake 128 4 0.73 1 0.895 0.713
chair 128 35 0.486 0.514 0.496 0.228
couch 128 6 0.726 0.333 0.801 0.453
potted plant 128 14 0.793 0.714 0.807 0.448
bed 128 3 1 0 0.746 0.275
dining table 128 13 0.835 0.462 0.476 0.296
toilet 128 2 0.456 0.5 0.566 0.496
tv 128 2 0.752 1 0.995 0.846
laptop 128 3 1 0 0.426 0.185
mouse 128 2 1 0 0.0268 0.0215
remote 128 8 0.714 0.625 0.635 0.506
cell phone 128 8 0.594 0.195 0.427 0.201
microwave 128 3 0.403 1 0.995 0.721
oven 128 5 0.365 0.4 0.427 0.248
sink 128 6 0.344 0.167 0.265 0.161
refrigerator 128 5 0.704 0.8 0.813 0.455
book 128 29 0.603 0.138 0.294 0.132
clock 128 9 0.871 0.778 0.91 0.599
vase 128 2 0.287 1 0.663 0.597
scissors 128 1 1 0 0.0302 0.00603
teddy bear 128 21 0.851 0.381 0.613 0.344
toothbrush 128 5 1 0.477 0.705 0.436
Results saved to runs/train/exp
@AyushExel Can you try to reproduce it using a zero-interaction method ? (DEBIAN_FRONTEND=noninteractive) and by using only predefined option when launching the script
@Zegorax I can't repro. Will you please paste you output?
@AyushExel The training happens normally. Only at the end, the process never returns and I have to ctrl-c manually (Therefore, the Jenkins job runs forever)
wandb: - 91.69MB of 91.69MB uploaded (0.00MB deduped)
wandb:
wandb: Run history:
wandb: metrics/mAP_0.5 ▁▁▁▁▁▂▁▂▃▄▃▃▄▄▄▅▆▆▆▅▆▇▇▇▆▇▇▇▇██▇████████
wandb: metrics/mAP_0.5:0.95 ▁▁▁▁▁▁▁▂▂▃▂▃▃▃▃▄▅▅▅▄▆▆▆▆▅▇▇▇▇▇▇▇████████
wandb: metrics/precision ▁▁▁▂█▅▆▅▆▅▅▄▅▅▆▆▇▇▇▆▇▇█▇▇██▇██▇▇▇▇▇▇▇█▇█
wandb: metrics/recall ▁▁▁▁▂▃▂▃▃▄▄▄▅▄▅▆▆▆▆▅▆▆▇▆▆▇▇▇▇▇▇▇███████▇
wandb: train/box_loss ██▇▆▅▅▄▄▄▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb: train/cls_loss █▇▆▅▄▄▃▃▃▂▃▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb: train/obj_loss █▄▄▅▄▄▄▃▃▃▃▃▃▂▃▂▂▂▂▂▂▂▂▂▂▁▂▂▂▁▁▁▁▁▁▁▁▁▁▁
wandb: val/box_loss ██▇█▆▅▇▄▄▄▄▄▃▃▃▂▂▂▂▂▂▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb: val/cls_loss █▇▇▇▄▄▃▃▃▃▄▄▃▂▃▂▂▁▁▂▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb: val/obj_loss ▇▇███▆▆▇▅▅▆▄▆▆▅▃▂▃▃▄▃▂▂▂▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁
wandb: x/lr0 ▁▂▂▃▄▄▅▆▆▇▇████▇▇▇▆▆▆▅▅▅▅▄▄▄▃▃▃▃▂▂▂▂▂▂▂▂
wandb: x/lr1 ▁▂▂▃▄▄▅▆▆▇▇████▇▇▇▆▆▆▅▅▅▅▄▄▄▃▃▃▃▂▂▂▂▂▂▂▂
wandb: x/lr2 ██▇▇▆▅▅▄▄▃▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:
wandb: Run summary:
wandb: metrics/mAP_0.5 0.63516
wandb: metrics/mAP_0.5:0.95 0.36128
wandb: metrics/precision 0.69888
wandb: metrics/recall 0.59514
wandb: train/box_loss 0.03237
wandb: train/cls_loss 0.00519
wandb: train/obj_loss 0.00732
wandb: val/box_loss 0.03155
wandb: val/cls_loss 0.00958
wandb: val/obj_loss 0.00619
wandb: x/lr0 0.00101
wandb: x/lr1 0.00101
wandb: x/lr2 0.00101
wandb:
wandb: Synced 5 W&B file(s), 337 media file(s), 1 artifact file(s) and 1 other file(s)
wandb: Synced model_25-10-2021_16-16-13: https://self-hosted-wandb-url-goes-here
wandb: Find logs at: ./wandb/run-20211025_161621-3az1nhlb/logs/debug.log
wandb:
Results saved to model/model_25-10-2021_16-16-13
Destroying process group...
Sending interrupt signal to process
Terminated
script returned exit code 143```
@Zegorax that's very strange. On disabling wandb, you should not see wandb termlogs
@AyushExel Should I create a new issue ? Because I need to have WandB enabled
@Zegorax oh okay.. I thought we were just talking about wandb disabled. I'll check with wandb enabled
@Zegorax it worked with wandb enabled
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: (30 second timeout) 2
wandb: You chose 'Use an existing W&B account'
wandb: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:
wandb: Appending key for api.wandb.ai to your netrc file: /home/jupyter/.netrc
train: weights=yolov5s.pt, cfg=, data=coco128.yaml, hyp=data/hyps/hyp.scratch.yaml, epochs=3, batch_size=16, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, evolve=None, bucket=, cache=None, image_weights=False, device=0,1, multi_scale=False, single_cls=False, adam=False, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, linear_lr=False, label_smoothing=0.0, patience=100, freeze=0, save_period=-1, local_rank=0, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: up to date with https://github.com/ultralytics/yolov5 ✅
YOLOv5 🚀 v6.0-35-ga4fece8 torch 1.9.0+cu102 CUDA:0 (Tesla V100-SXM2-16GB, 16160.5MB)
CUDA:1 (Tesla V100-SXM2-16GB, 16160.5MB)
Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for 2 nodes.
hyperparameters: lr0=0.01, lrf=0.1, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/
2021-10-26 07:18:39.051059: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
wandb: Currently logged in as: cayush (use `wandb login --relogin` to force relogin)
2021-10-26 07:18:42.738633: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
wandb: Tracking run with wandb version 0.12.5
wandb: Syncing run iconic-eon-745
wandb: ⭐️ View project at https://wandb.ai/cayush/yoloV5
wandb: 🚀 View run at https://wandb.ai/cayush/yoloV5/runs/dvets0rr
wandb: Run data is saved locally in /home/jupyter/yolov5/wandb/run-20211026_071841-dvets0rr
wandb: Run `wandb offline` to turn off syncing.
InvalidVersionSpec: Invalid version '1.0<2': invalid character(s)
from n params module arguments
0 -1 1 3520 models.common.Conv [3, 32, 6, 2, 2]
1 -1 1 18560 models.common.Conv [32, 64, 3, 2]
2 -1 1 18816 models.common.C3 [64, 64, 1]
3 -1 1 73984 models.common.Conv [64, 128, 3, 2]
4 -1 2 115712 models.common.C3 [128, 128, 2]
5 -1 1 295424 models.common.Conv [128, 256, 3, 2]
6 -1 3 625152 models.common.C3 [256, 256, 3]
7 -1 1 1180672 models.common.Conv [256, 512, 3, 2]
8 -1 1 1182720 models.common.C3 [512, 512, 1]
9 -1 1 656896 models.common.SPPF [512, 512, 5]
10 -1 1 131584 models.common.Conv [512, 256, 1, 1]
11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
12 [-1, 6] 1 0 models.common.Concat [1]
13 -1 1 361984 models.common.C3 [512, 256, 1, False]
14 -1 1 33024 models.common.Conv [256, 128, 1, 1]
15 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
16 [-1, 4] 1 0 models.common.Concat [1]
17 -1 1 90880 models.common.C3 [256, 128, 1, False]
18 -1 1 147712 models.common.Conv [128, 128, 3, 2]
19 [-1, 14] 1 0 models.common.Concat [1]
20 -1 1 296448 models.common.C3 [256, 256, 1, False]
21 -1 1 590336 models.common.Conv [256, 256, 3, 2]
22 [-1, 10] 1 0 models.common.Concat [1]
23 -1 1 1182720 models.common.C3 [512, 512, 1, False]
24 [17, 20, 23] 1 229245 models.yolo.Detect [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
Model Summary: 270 layers, 7235389 parameters, 7235389 gradients
Transferred 349/349 items from yolov5s.pt
Scaled weight_decay = 0.0005
optimizer: SGD with parameter groups 57 weight, 60 weight (no decay), 60 bias
train: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|██████████████████████████████████████████████████████████████████████| 128/128 [00:00<?, ?it/s]
train: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|██████████████████████████████████████████████████████████████████████| 128/128 [00:00<?, ?it/s]
train: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|██████████████████████████████████████████████████████████████████████| 128/128 [00:00<?, ?it/s]
train: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|██████████████████████████████████████████████████████████████████████| 128/128 [00:00<?, ?it/s]
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████████████████████████████████████████████████████████| 128/128 [00:00<?, ?it/s]
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
Plotting labels...
train: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|██████████████████████████████████████████████████████████████████████| 128/128 [00:00<?, ?it/s]
train: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|██████████████████████████████████████████████████████████████████████| 128/128 [00:01<?, ?it/s]
train: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|██████████████████████████████████████████████████████████████████████| 128/128 [00:01<?, ?it/s]
train: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|██████████████████████████████████████████████████████████████████████| 128/128 [00:01<?, ?it/s]
train: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|██████████████████████████████████████████████████████████████████████| 128/128 [00:02<?, ?it/s]
autoanchor: Analyzing anchors... anchors/target = 4.26, Best Possible Recall (BPR) = 0.9946
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to runs/train/exp3
Starting training for 3 epochs...
Epoch gpu_mem box obj cls labels img_size
0/2 1.89G 0.04361 0.07621 0.02057 131 640: 12%|███████████████ | 1/8 [00:04<00:32, 4.64s/it]Reducer buckets have been rebuilt in this iteration.
0/2 6.28G 0.04354 0.06284 0.02263 95 640: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:05<00:00, 1.45it/s]
Class Images Labels P R mAP@.5 mAP@.5:.95: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:05<00:00, 1.53it/s]
all 128 929 0.678 0.535 0.622 0.407
Epoch gpu_mem box obj cls labels img_size
1/2 6.43G 0.04465 0.06804 0.02394 117 640: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 8.01it/s]
Class Images Labels P R mAP@.5 mAP@.5:.95: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.01it/s]
all 128 929 0.682 0.548 0.632 0.411
Epoch gpu_mem box obj cls labels img_size
2/2 6.43G 0.04337 0.07026 0.02062 91 640: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 9.37it/s]
Class Images Labels P R mAP@.5 mAP@.5:.95: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:02<00:00, 3.88it/s]
all 128 929 0.619 0.598 0.634 0.416
3 epochs completed in 0.006 hours.
Optimizer stripped from runs/train/exp3/weights/last.pt, 14.8MB
Optimizer stripped from runs/train/exp3/weights/best.pt, 14.8MB
Validating runs/train/exp3/weights/best.pt...
Fusing layers...
Model Summary: 213 layers, 7225885 parameters, 0 gradients
Class Images Labels P R mAP@.5 mAP@.5:.95: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:02<00:00, 2.83it/s]
all 128 929 0.618 0.6 0.635 0.417
person 128 254 0.72 0.736 0.774 0.508
bicycle 128 6 0.546 0.667 0.561 0.326
car 128 46 0.598 0.37 0.461 0.201
motorcycle 128 5 0.66 0.6 0.812 0.637
airplane 128 6 1 0.947 0.995 0.764
bus 128 7 0.536 0.714 0.71 0.605
train 128 3 0.696 1 0.995 0.632
truck 128 12 0.427 0.333 0.473 0.288
boat 128 6 0.403 0.333 0.465 0.14
traffic light 128 14 0.521 0.158 0.24 0.16
stop sign 128 2 0.576 0.5 0.828 0.663
bench 128 9 0.702 0.444 0.571 0.236
bird 128 16 0.801 1 0.988 0.647
cat 128 4 0.939 0.75 0.828 0.714
dog 128 9 0.761 0.667 0.852 0.527
horse 128 2 0.579 1 0.995 0.697
elephant 128 17 0.942 0.882 0.943 0.692
bear 128 1 0.388 1 0.995 0.995
zebra 128 4 0.827 1 0.995 0.908
giraffe 128 9 0.74 0.889 0.851 0.613
backpack 128 6 0.661 0.333 0.496 0.214
umbrella 128 18 0.617 0.611 0.731 0.395
handbag 128 19 0.514 0.105 0.179 0.112
tie 128 7 0.605 0.571 0.687 0.461
suitcase 128 4 0.709 1 0.995 0.54
frisbee 128 5 0.54 0.8 0.798 0.705
skis 128 1 0.516 1 0.995 0.497
snowboard 128 7 0.869 0.714 0.767 0.555
sports ball 128 6 0.531 0.5 0.581 0.325
kite 128 10 0.583 0.561 0.564 0.206
baseball bat 128 4 0.411 0.5 0.282 0.087
baseball glove 128 7 0.339 0.429 0.366 0.222
skateboard 128 5 0.795 0.779 0.736 0.5
tennis racket 128 7 0.442 0.571 0.571 0.324
bottle 128 18 0.461 0.5 0.476 0.29
wine glass 128 16 0.678 0.92 0.885 0.415
cup 128 36 0.824 0.361 0.504 0.32
fork 128 6 0.567 0.234 0.341 0.226
knife 128 16 0.523 0.625 0.674 0.452
spoon 128 22 0.602 0.5 0.532 0.26
bowl 128 28 0.668 0.571 0.63 0.448
banana 128 1 0.147 1 0.166 0.0498
sandwich 128 2 0 0 0.133 0.105
orange 128 4 1 0 0.545 0.151
broccoli 128 11 0.298 0.311 0.236 0.205
carrot 128 24 0.481 0.583 0.631 0.425
hot dog 128 2 0.463 1 0.497 0.497
pizza 128 5 0.599 1 0.824 0.566
donut 128 14 0.675 1 0.946 0.85
cake 128 4 0.698 1 0.895 0.704
chair 128 35 0.408 0.543 0.46 0.221
couch 128 6 1 0.482 0.829 0.504
potted plant 128 14 0.795 0.786 0.819 0.467
bed 128 3 0.992 0.333 0.753 0.269
dining table 128 13 0.571 0.462 0.438 0.242
toilet 128 2 0.388 0.5 0.557 0.49
tv 128 2 0.672 1 0.995 0.846
laptop 128 3 1 0 0.415 0.193
mouse 128 2 1 0 0.0375 0.03
remote 128 8 0.596 0.625 0.636 0.506
cell phone 128 8 0.579 0.375 0.392 0.184
microwave 128 3 0.343 1 0.995 0.786
oven 128 5 0.301 0.4 0.432 0.249
sink 128 6 0.338 0.167 0.294 0.168
refrigerator 128 5 0.69 0.8 0.815 0.506
book 128 29 0.524 0.229 0.295 0.125
clock 128 9 0.787 0.778 0.895 0.588
vase 128 2 0.181 1 0.663 0.597
scissors 128 1 1 0 0.0332 0.00663
teddy bear 128 21 0.814 0.418 0.608 0.349
toothbrush 128 5 0.703 0.6 0.739 0.191
wandb: Waiting for W&B process to finish, PID 4215... (success).
wandb:
wandb: Run history:
wandb: metrics/mAP_0.5 ▁▆█
wandb: metrics/mAP_0.5:0.95 ▁▄█
wandb: metrics/precision ▇█▁
wandb: metrics/recall ▁▂█
wandb: train/box_loss ▂█▁
wandb: train/cls_loss ▅█▁
wandb: train/obj_loss ▁▆█
wandb: val/box_loss █▄▁
wandb: val/cls_loss █▄▁
wandb: val/obj_loss █▅▁
wandb: x/lr0 ▁█▂
wandb: x/lr1 ▁█▂
wandb: x/lr2 █▅▁
wandb:
wandb: Run summary:
wandb: metrics/mAP_0.5 0.63423
wandb: metrics/mAP_0.5:0.95 0.4157
wandb: metrics/precision 0.61857
wandb: metrics/recall 0.59793
wandb: train/box_loss 0.04337
wandb: train/cls_loss 0.02062
wandb: train/obj_loss 0.07026
wandb: val/box_loss 0.04014
wandb: val/cls_loss 0.01355
wandb: val/obj_loss 0.0422
wandb: x/lr0 7e-05
wandb: x/lr1 7e-05
wandb: x/lr2 0.09777
wandb:
wandb: Synced 6 W&B file(s), 113 media file(s), 1 artifact file(s) and 1 other file(s)
wandb: Synced iconic-eon-745: https://wandb.ai/cayush/yoloV5/runs/dvets0rr
wandb: Find logs at: ./wandb/run-20211026_071841-dvets0rr/logs/debug.log
wandb:
Results saved to runs/train/exp3
Destroying process group...
INFO:torch.distributed.elastic.agent.server.api:[default] worker group successfully finished. Waiting 300 seconds for other agents to finish.
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish
/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/utils/store.py:71: FutureWarning: This is an experimental API and will be changed in future.
"This is an experimental API and will be changed in future.", FutureWarning
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0005068778991699219 seconds
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 0, "group_rank": 0, "worker_id": "4116", "role": "default", "hostname": "ac-vm2.c.playground-111.internal", "state": "SUCCEEDED", "total_run_time": 100, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [0], \"role_rank\": [0], \"role_world_size\": [2]}", "agent_restarts": 0}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 1, "group_rank": 0, "worker_id": "4117", "role": "default", "hostname": "ac-vm2.c.playground-111.internal", "state": "SUCCEEDED", "total_run_time": 100, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [1], \"role_rank\": [1], \"role_world_size\": [2]}", "agent_restarts": 0}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "ac-vm2.c.playground-111.internal", "state": "SUCCEEDED", "total_run_time": 100, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\"}", "agent_restarts": 0}}
(base) jupyter@ac-vm2:~/yolov5$
What version of wandb client are you using? Please try to update it using pip install --upgrade wandb
and let me know if you still see this problem
@AyushExel I'm also using the latest version of W&B. My system is based on a Jenkins job, so everything is always re-installed at each run, and using the latest version of all repos
@AyushExel Can try to repro using a non-interactive environment ? By setting WANDB_API_KEY=your-key for example
@AyushExel Have you been able to reproduce the problem?
@Zegorax yes I ran this in a non-interactive docker environment and the process finished successfully.
wandb: Currently logged in as: cayush (use `wandb login --relogin` to force relogin)
train: weights=yolov5s.pt, cfg=, data=data/coco128.yaml, hyp=data/hyps/hyp.scratch.yaml, epochs=2, batch_size=16, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, evolve=None, bucket=, cache=None, image_weights=False, device=, multi_scale=False, single_cls=False, adam=False, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, linear_lr=False, label_smoothing=0.0, patience=100, freeze=0, save_period=-1, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
YOLOv5 🚀 v6.0-43-g19c8760 torch 1.10.0+cu102 CUDA:0 (Tesla V100-SXM2-16GB, 16160.5MB)
hyperparameters: lr0=0.01, lrf=0.1, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/
github: skipping check (Docker image), for updates see https://github.com/ultralytics/yolov5
wandb: Tracking run with wandb version 0.12.6
wandb: Syncing run visionary-galaxy-746
wandb: View project at https://wandb.ai/cayush/yoloV5
wandb: View run at https://wandb.ai/cayush/yoloV5/runs/1m5xl3kf
wandb: Run data is saved locally in /usr/src/app/wandb/run-20211102_113152-1m5xl3kf
wandb: Run `wandb offline` to turn off syncing.
100% 6.66M/6.66M [00:00<00:00, 72.5MB/s]
100% 14.0M/14.0M [00:00<00:00, 89.0MB/s]
from n params module arguments
0 -1 1 3520 models.common.Conv [3, 32, 6, 2, 2]
1 -1 1 18560 models.common.Conv [32, 64, 3, 2]
2 -1 1 18816 models.common.C3 [64, 64, 1]
3 -1 1 73984 models.common.Conv [64, 128, 3, 2]
4 -1 2 115712 models.common.C3 [128, 128, 2]
5 -1 1 295424 models.common.Conv [128, 256, 3, 2]
6 -1 3 625152 models.common.C3 [256, 256, 3]
7 -1 1 1180672 models.common.Conv [256, 512, 3, 2]
8 -1 1 1182720 models.common.C3 [512, 512, 1]
9 -1 1 656896 models.common.SPPF [512, 512, 5]
10 -1 1 131584 models.common.Conv [512, 256, 1, 1]
11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
12 [-1, 6] 1 0 models.common.Concat [1]
13 -1 1 361984 models.common.C3 [512, 256, 1, False]
14 -1 1 33024 models.common.Conv [256, 128, 1, 1]
15 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
16 [-1, 4] 1 0 models.common.Concat [1]
17 -1 1 90880 models.common.C3 [256, 128, 1, False]
18 -1 1 147712 models.common.Conv [128, 128, 3, 2]
19 [-1, 14] 1 0 models.common.Concat [1]
20 -1 1 296448 models.common.C3 [256, 256, 1, False]
21 -1 1 590336 models.common.Conv [256, 256, 3, 2]
22 [-1, 10] 1 0 models.common.Concat [1]
23 -1 1 1182720 models.common.C3 [512, 512, 1, False]
24 [17, 20, 23] 1 229245 models.yolo.Detect [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
Model Summary: 270 layers, 7235389 parameters, 7235389 gradients, 16.5 GFLOPs
Transferred 349/349 items from yolov5s.pt
Scaled weight_decay = 0.0005
optimizer: SGD with parameter groups 57 weight, 60 weight (no decay), 60 bias
DP not recommended, instead use torch.distributed.run for best DDP Multi-GPU results.
See Multi-GPU Tutorial at https://docs.ultralytics.com/yolov5/tutorials/multi_gpu_training to get started.
WARNING: Dataset not found, nonexistent paths: ['/usr/src/datasets/coco128/images/train2017']
Downloading https://github.com/ultralytics/yolov5/releases/download/v1.0/coco128.zip to coco128.zip...
Dataset autodownload success, saved to ../datasets
Downloading https://github.com/ultralytics/yolov5/releases/download/v6.0/yolov5s.pt to yolov5s.pt...
train: Scanning '../datasets/coco128/labels/train2017' images and labels...128 found, 0 missing, 2 empty, 0 corrupted: 100% 128/128 [00:00<00:00, 5023.82it/s]
train: New cache created: ../datasets/coco128/labels/train2017.cache
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100% 128/128 [00:00<?, ?it/s]
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to runs/train/exp
Starting training for 2 epochs...
Epoch gpu_mem box obj cls labels img_size
0/1 2.33G 0.04581 0.06708 0.02386 226 640: 100% 8/8 [00:06<00:00, 1.17it/s]
Class Images Labels P R mAP@.5 mAP@.5:.95: 100% 4/4 [00:03<00:00, 1.30it/s]
all 128 929 0.682 0.54 0.624 0.415
Epoch gpu_mem box obj cls labels img_size
1/1 4.02G 0.04509 0.07335 0.02124 223 640: 100% 8/8 [00:01<00:00, 7.09it/s]
Class Images Labels P R mAP@.5 mAP@.5:.95: 100% 4/4 [00:02<00:00, 1.35it/s]
all 128 929 0.695 0.544 0.63 0.417
2 epochs completed in 0.005 hours.
Validating runs/train/exp/weights/best.pt...
Fusing layers...
Model Summary: 213 layers, 7225885 parameters, 0 gradients, 16.5 GFLOPs
Class Images Labels P R mAP@.5 mAP@.5:.95: 100% 4/4 [00:05<00:00, 1.27s/it]
all 128 929 0.695 0.543 0.629 0.417
person 128 254 0.815 0.669 0.771 0.507
bicycle 128 6 0.764 0.545 0.609 0.326
car 128 46 0.783 0.348 0.477 0.217
motorcycle 128 5 0.675 0.6 0.762 0.569
airplane 128 6 1 0.795 0.995 0.749
bus 128 7 0.63 0.714 0.715 0.605
train 128 3 0.739 1 0.995 0.665
truck 128 12 0.641 0.333 0.454 0.22
boat 128 6 0.796 0.333 0.455 0.126
traffic light 128 14 0.544 0.143 0.237 0.158
stop sign 128 2 0.63 0.5 0.828 0.713
bench 128 9 0.999 0.444 0.577 0.234
bird 128 16 0.841 1 0.985 0.652
cat 128 4 0.892 0.75 0.836 0.691
dog 128 9 0.858 0.667 0.858 0.541
horse 128 2 0.688 1 0.995 0.697
elephant 128 17 0.957 0.882 0.94 0.685
bear 128 1 0.575 1 0.995 0.995
zebra 128 4 0.854 1 0.995 0.921
giraffe 128 9 0.803 0.778 0.912 0.573
backpack 128 6 1 0.314 0.479 0.204
umbrella 128 18 0.733 0.556 0.723 0.405
handbag 128 19 0.612 0.105 0.163 0.111
tie 128 7 0.79 0.571 0.701 0.436
suitcase 128 4 1 0.876 0.995 0.621
frisbee 128 5 0.624 0.8 0.798 0.723
skis 128 1 0.608 1 0.995 0.497
snowboard 128 7 0.957 0.714 0.764 0.557
sports ball 128 6 0.682 0.5 0.576 0.32
kite 128 10 0.634 0.522 0.574 0.222
baseball bat 128 4 0.456 0.434 0.303 0.127
baseball glove 128 7 0.372 0.429 0.327 0.176
skateboard 128 5 0.705 0.492 0.734 0.544
tennis racket 128 7 0.558 0.571 0.537 0.297
bottle 128 18 0.657 0.426 0.488 0.289
wine glass 128 16 0.684 0.812 0.79 0.379
cup 128 36 0.82 0.333 0.492 0.317
fork 128 6 0.374 0.167 0.245 0.193
knife 128 16 0.825 0.625 0.654 0.438
spoon 128 22 0.832 0.364 0.551 0.276
bowl 128 28 0.751 0.539 0.636 0.463
banana 128 1 0 0 0.142 0.0284
sandwich 128 2 0 0 0.0957 0.072
orange 128 4 1 0 0.62 0.287
broccoli 128 11 0.379 0.182 0.287 0.247
carrot 128 24 0.696 0.478 0.611 0.361
hot dog 128 2 0.398 0.694 0.497 0.465
pizza 128 5 0.623 1 0.824 0.561
donut 128 14 0.701 1 0.963 0.843
cake 128 4 0.724 1 0.945 0.741
chair 128 35 0.5 0.514 0.483 0.229
couch 128 6 0.638 0.333 0.696 0.388
potted plant 128 14 0.799 0.714 0.778 0.456
bed 128 3 1 0 0.641 0.245
dining table 128 13 0.854 0.452 0.479 0.315
toilet 128 2 0.511 0.5 0.54 0.528
tv 128 2 0.732 1 0.995 0.846
laptop 128 3 1 0 0.426 0.165
mouse 128 2 1 0 0.0277 0.0222
remote 128 8 0.72 0.625 0.635 0.488
cell phone 128 8 0.45 0.125 0.374 0.198
microwave 128 3 0.428 1 0.995 0.764
oven 128 5 0.362 0.4 0.432 0.242
sink 128 6 0.347 0.167 0.268 0.156
refrigerator 128 5 0.692 0.8 0.811 0.435
book 128 29 0.686 0.152 0.293 0.131
clock 128 9 0.831 0.778 0.885 0.571
vase 128 2 0.241 1 0.663 0.622
scissors 128 1 1 0 0.0243 0.00485
teddy bear 128 21 0.864 0.381 0.618 0.341
toothbrush 128 5 1 0.583 0.664 0.412
Plotting labels...
autoanchor: Analyzing anchors... anchors/target = 4.27, Best Possible Recall (BPR) = 0.9935
Optimizer stripped from runs/train/exp/weights/last.pt, 14.9MB
Optimizer stripped from runs/train/exp/weights/best.pt, 14.9MB
wandb: Waiting for W&B process to finish, PID 97... (success).
wandb:
wandb: Run history:
wandb: metrics/mAP_0.5 ▁█
wandb: metrics/mAP_0.5:0.95 ▁█
wandb: metrics/precision ▁█
wandb: metrics/recall ▁█
wandb: train/box_loss █▁
wandb: train/cls_loss █▁
wandb: train/obj_loss ▁█
wandb: val/box_loss █▁
wandb: val/cls_loss █▁
wandb: val/obj_loss █▁
wandb: x/lr0 ▁█
wandb: x/lr1 ▁█
wandb: x/lr2 █▁
wandb:
wandb: Run summary:
wandb: metrics/mAP_0.5 0.62979
wandb: metrics/mAP_0.5:0.95 0.41725
wandb: metrics/precision 0.69505
wandb: metrics/recall 0.544
wandb: train/box_loss 0.04509
wandb: train/cls_loss 0.02124
wandb: train/obj_loss 0.07335
wandb: val/box_loss 0.04124
wandb: val/cls_loss 0.01407
wandb: val/obj_loss 0.03989
wandb: x/lr0 8e-05
wandb: x/lr1 8e-05
wandb: x/lr2 0.09858
wandb:
wandb: Synced 6 W&B file(s), 81 media file(s), 1 artifact file(s) and 1 other file(s)
wandb: Synced visionary-galaxy-746: https://wandb.ai/cayush/yoloV5/runs/1m5xl3kf
wandb: Find logs at: ./wandb/run-20211102_113152-1m5xl3kf/logs/debug.log
wandb:
Results saved to runs/train/exp
(base) jupyter@ac-vm2:~/yolov5$
I'm also seeing this behavior thinking it was because I'm training on 2xA100
@Davidnet you should be able to train DDP 8x A100 successfully in Docker. Can you verify your error is reproducible with the latest Docker image and provide @AyushExel steps to reproduce please? Thanks!
@Davidnet yes, please. I'm curious to reproduce this so I can get someone to look into this asap. Please verify with wandb enabled and disabled. If the error is caused by wandb, it should only occur when wandb is enabled. Fixing all DDP problems is a very high priority us.
👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.
Access additional YOLOv5 🚀 resources:
Access additional Ultralytics ⚡ resources:
Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!
Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!
Hello, when I try to training using multi gpu based on docker file images. I got the below error. I use Ubuntu 18.04, python 3.8. <<<<<<<<<<<<<<<<>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>