Open urban-drummer opened 1 month ago
Did you look at and follow YOLOv5 Multi-GPU Tutorial?
您是否查看并遵循了YOLOv5 多 GPU 教程? 作者您好,我严格遵循了多卡训练的流程,设定断点确定开启的是ddp模式,此外尝试了指定单卡训练得到报错的结果都是如上max_pool3d_with_indices_backward_cuda
感谢您的帮助,目前我将torch.use_deterministic_algorithms(True, warn_only=True)添加到scaler.scale(loss).backward()前模型可以运行但伴随着警告,正在进一步的确认错误,大概率是版本兼容性的问题
segment/train: weights=/media/dell/lhx/yolo/ASF-YOLO/yolov5l-seg.pt, cfg=/media/dell/lhx/yolo/ASF-YOLO/models/segment/asf-yolo.yaml, data=/media/dell/lhx/yolo/ASF-YOLO/data/bcc.yaml, hyp=/media/dell/lhx/yolo/ASF-YOLO/data/hyps/hyp.scratch-low.yaml, epochs=100, batch_size=8, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=0,1,2,3, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=../runs_2/train-seg, name=improve, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, mask_ratio=4, no_overlap=False YOLOv5 2024-5-30 Python-3.8.0 torch-2.3.1+cu121 CUDA:0 (NVIDIA GeForce RTX 3090, 24260MiB) CUDA:1 (NVIDIA GeForce RTX 3090, 24260MiB) CUDA:2 (NVIDIA GeForce RTX 3090, 24260MiB) CUDA:3 (NVIDIA GeForce RTX 3090, 24260MiB)
hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0 TensorBoard: Start with 'tensorboard --logdir ../runs_2/train-seg', view at http://localhost:6006/ Overriding model.yaml nc=80 with nc=1
0 -1 1 7040 models.common.Conv [3, 64, 6, 2, 2]
1 -1 1 73984 models.common.Conv [64, 128, 3, 2]
2 -1 3 156928 models.common.C3 [128, 128, 3]
3 -1 1 295424 models.common.Conv [128, 256, 3, 2]
4 -1 6 1118208 models.common.C3 [256, 256, 6]
5 -1 1 1180672 models.common.Conv [256, 512, 3, 2]
6 -1 9 6433792 models.common.C3 [512, 512, 9]
7 -1 1 4720640 models.common.Conv [512, 1024, 3, 2]
8 -1 3 9971712 models.common.C3 [1024, 1024, 3]
9 -1 1 2624512 models.common.SPPF [1024, 1024, 5]
10 -1 1 525312 models.common.Conv [1024, 512, 1, 1]
11 4 1 132096 models.common.Conv [256, 512, 1, 1]
12 [-1, 6, -2] 1 0 models.common.Zoom_cat [512]
13 -1 3 3019776 models.common.C3 [1536, 512, 3, False]
14 -1 1 131584 models.common.Conv [512, 256, 1, 1]
15 2 1 33280 models.common.Conv [128, 256, 1, 1]
16 [-1, 4, -2] 1 0 models.common.Zoom_cat [256]
17 -1 3 756224 models.common.C3 [768, 256, 3, False]
18 -1 1 590336 models.common.Conv [256, 256, 3, 2]
19 [-1, 14] 1 0 models.common.Concat [1]
20 -1 3 2495488 models.common.C3 [512, 512, 3, False]
21 -1 1 2360320 models.common.Conv [512, 512, 3, 2]
22 [-1, 10] 1 0 models.common.Concat [1]
23 -1 3 9971712 models.common.C3 [1024, 1024, 3, False]
24 [4, 6, 8] 1 460544 models.common.ScalSeq [256]
25 [17, -1] 1 12325 models.common.attention_model [256]
26 [-1, 20, 23] 1 1393558 models.yolo.Segment [1, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], 32, 256, [256, 512, 1024]] asf-yolo summary: 407 layers, 48465467 parameters, 48465467 gradients, 155.4 GFLOPs
Transferred 602/671 items from /media/dell/lhx/yolo/ASF-YOLO/yolov5l-seg.pt AMP: checks passed ✅ optimizer: SGD(lr=0.01) with parameter groups 110 weight(decay=0.0), 116 weight(decay=0.0005), 114 bias WARNING ⚠️ DP not recommended, use torch.distributed.run for best DDP Multi-GPU results. See Multi-GPU Tutorial at https://github.com/ultralytics/yolov5/issues/475 to get started. train: Scanning /media/dell/lhx/yolo/ASF-YOLO/datasets/BCC/labels/train.cache... 128 images, 0 backgrounds, 0 corrupt: 100%|██████████| 128/128 00:00 val: Scanning /media/dell/lhx/yolo/ASF-YOLO/datasets/BCC/labels/val.cache... 32 images, 0 backgrounds, 0 corrupt: 100%|██████████| 32/32 00:00
AutoAnchor: 4.36 anchors/target, 0.970 Best Possible Recall (BPR). Anchors are a poor fit to dataset ⚠️, attempting to improve... AutoAnchor: WARNING ⚠️ Extremely small objects found: 47 of 1235 labels are <3 pixels in size AutoAnchor: Running kmeans for 9 anchors on 1235 points... AutoAnchor: Evolving anchors with Genetic Algorithm: fitness = 0.7403: 100%|██████████| 1000/1000 00:00 AutoAnchor: thr=0.25: 0.9571 best possible recall, 6.31 anchors past thr AutoAnchor: n=9, img_size=640, metric_all=0.391/0.743-mean/best, past_thr=0.495-mean: 25,43, 88,52, 51,155, 92,121, 163,129, 116,183, 236,232, 160,418, 350,452 AutoAnchor: Done ⚠️ (original anchors better than new anchors, proceeding with original anchors) Plotting labels to ../runs_2/train-seg/improve3/labels.jpg... Image sizes 640 train, 640 val Using 8 dataloader workers Logging results to ../runs_2/train-seg/improve3 Starting training for 100 epochs...
0%| | 0/16 00:03 Traceback (most recent call last): File "/media/dell/lhx/yolo/ASF-YOLO/segment/train.py", line 658, in
main(opt)
File "/media/dell/lhx/yolo/ASF-YOLO/segment/train.py", line 554, in main
train(opt.hyp, opt, device, callbacks)
File "/media/dell/lhx/yolo/ASF-YOLO/segment/train.py", line 317, in train
scaler.scale(loss).backward()
File "/home/leihaoxiang/.conda/envs/yolo/lib/python3.8/site-packages/torch/_tensor.py", line 525, in backward
torch.autograd.backward(
File "/home/leihaoxiang/.conda/envs/yolo/lib/python3.8/site-packages/torch/autograd/init.py", line 267, in backward
_engine_run_backward(
File "/home/leihaoxiang/.conda/envs/yolo/lib/python3.8/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: max_pool3d_with_indices_backward_cuda does not have a deterministic implementation, but you set 'torch.use_deterministic_algorithms(True)'. You can turn off determinism just for this operation, or you can use the 'warn_only=True' option, if that's acceptable for your application. You can also file an issue at https://github.com/pytorch/pytorch/issues to help us prioritize adding deterministic support for this operation.
进程已结束,退出代码1