作者您好，我也遇到了确定性算法的警告导致模型不能运行，同评论区的问题一样，我是4张3090一起训练de设置了0,1,2,3.指定单卡直接显示cuda错误，希望您能给出建议

urban-drummer commented 1 month ago

segment/train: weights=/media/dell/lhx/yolo/ASF-YOLO/yolov5l-seg.pt, cfg=/media/dell/lhx/yolo/ASF-YOLO/models/segment/asf-yolo.yaml, data=/media/dell/lhx/yolo/ASF-YOLO/data/bcc.yaml, hyp=/media/dell/lhx/yolo/ASF-YOLO/data/hyps/hyp.scratch-low.yaml, epochs=100, batch_size=8, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=0,1,2,3, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=../runs_2/train-seg, name=improve, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, mask_ratio=4, no_overlap=False YOLOv5  2024-5-30 Python-3.8.0 torch-2.3.1+cu121 CUDA:0 (NVIDIA GeForce RTX 3090, 24260MiB) CUDA:1 (NVIDIA GeForce RTX 3090, 24260MiB) CUDA:2 (NVIDIA GeForce RTX 3090, 24260MiB) CUDA:3 (NVIDIA GeForce RTX 3090, 24260MiB)

hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0 TensorBoard: Start with 'tensorboard --logdir ../runs_2/train-seg', view at http://localhost:6006/ Overriding model.yaml nc=80 with nc=1

             from  n    params  module                                  arguments

0 -1 1 7040 models.common.Conv [3, 64, 6, 2, 2]
1 -1 1 73984 models.common.Conv [64, 128, 3, 2]
2 -1 3 156928 models.common.C3 [128, 128, 3]
3 -1 1 295424 models.common.Conv [128, 256, 3, 2]
4 -1 6 1118208 models.common.C3 [256, 256, 6]
5 -1 1 1180672 models.common.Conv [256, 512, 3, 2]
6 -1 9 6433792 models.common.C3 [512, 512, 9]
7 -1 1 4720640 models.common.Conv [512, 1024, 3, 2]
8 -1 3 9971712 models.common.C3 [1024, 1024, 3]
9 -1 1 2624512 models.common.SPPF [1024, 1024, 5]
10 -1 1 525312 models.common.Conv [1024, 512, 1, 1]
11 4 1 132096 models.common.Conv [256, 512, 1, 1]
12 [-1, 6, -2] 1 0 models.common.Zoom_cat [512]
13 -1 3 3019776 models.common.C3 [1536, 512, 3, False]
14 -1 1 131584 models.common.Conv [512, 256, 1, 1]
15 2 1 33280 models.common.Conv [128, 256, 1, 1]
16 [-1, 4, -2] 1 0 models.common.Zoom_cat [256]
17 -1 3 756224 models.common.C3 [768, 256, 3, False]
18 -1 1 590336 models.common.Conv [256, 256, 3, 2]
19 [-1, 14] 1 0 models.common.Concat [1]
20 -1 3 2495488 models.common.C3 [512, 512, 3, False]
21 -1 1 2360320 models.common.Conv [512, 512, 3, 2]
22 [-1, 10] 1 0 models.common.Concat [1]
23 -1 3 9971712 models.common.C3 [1024, 1024, 3, False]
24 [4, 6, 8] 1 460544 models.common.ScalSeq [256]
25 [17, -1] 1 12325 models.common.attention_model [256]
26 [-1, 20, 23] 1 1393558 models.yolo.Segment [1, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], 32, 256, [256, 512, 1024]] asf-yolo summary: 407 layers, 48465467 parameters, 48465467 gradients, 155.4 GFLOPs

Transferred 602/671 items from /media/dell/lhx/yolo/ASF-YOLO/yolov5l-seg.pt AMP: checks passed ✅ optimizer: SGD(lr=0.01) with parameter groups 110 weight(decay=0.0), 116 weight(decay=0.0005), 114 bias WARNING ⚠️ DP not recommended, use torch.distributed.run for best DDP Multi-GPU results. See Multi-GPU Tutorial at https://github.com/ultralytics/yolov5/issues/475 to get started. train: Scanning /media/dell/lhx/yolo/ASF-YOLO/datasets/BCC/labels/train.cache... 128 images, 0 backgrounds, 0 corrupt: 100%|██████████| 128/128 00:00 val: Scanning /media/dell/lhx/yolo/ASF-YOLO/datasets/BCC/labels/val.cache... 32 images, 0 backgrounds, 0 corrupt: 100%|██████████| 32/32 00:00

AutoAnchor: 4.36 anchors/target, 0.970 Best Possible Recall (BPR). Anchors are a poor fit to dataset ⚠️, attempting to improve... AutoAnchor: WARNING ⚠️ Extremely small objects found: 47 of 1235 labels are <3 pixels in size AutoAnchor: Running kmeans for 9 anchors on 1235 points... AutoAnchor: Evolving anchors with Genetic Algorithm: fitness = 0.7403: 100%|██████████| 1000/1000 00:00 AutoAnchor: thr=0.25: 0.9571 best possible recall, 6.31 anchors past thr AutoAnchor: n=9, img_size=640, metric_all=0.391/0.743-mean/best, past_thr=0.495-mean: 25,43, 88,52, 51,155, 92,121, 163,129, 116,183, 236,232, 160,418, 350,452 AutoAnchor: Done ⚠️ (original anchors better than new anchors, proceeding with original anchors) Plotting labels to ../runs_2/train-seg/improve3/labels.jpg... Image sizes 640 train, 640 val Using 8 dataloader workers Logging results to ../runs_2/train-seg/improve3 Starting training for 100 epochs...

  Epoch    GPU_mem   box_loss   seg_loss   obj_loss   cls_loss  Instances       Size

0%| | 0/16 00:03 Traceback (most recent call last): File "/media/dell/lhx/yolo/ASF-YOLO/segment/train.py", line 658, in main(opt) File "/media/dell/lhx/yolo/ASF-YOLO/segment/train.py", line 554, in main train(opt.hyp, opt, device, callbacks) File "/media/dell/lhx/yolo/ASF-YOLO/segment/train.py", line 317, in train scaler.scale(loss).backward() File "/home/leihaoxiang/.conda/envs/yolo/lib/python3.8/site-packages/torch/_tensor.py", line 525, in backward torch.autograd.backward( File "/home/leihaoxiang/.conda/envs/yolo/lib/python3.8/site-packages/torch/autograd/init.py", line 267, in backward _engine_run_backward( File "/home/leihaoxiang/.conda/envs/yolo/lib/python3.8/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: max_pool3d_with_indices_backward_cuda does not have a deterministic implementation, but you set 'torch.use_deterministic_algorithms(True)'. You can turn off determinism just for this operation, or you can use the 'warn_only=True' option, if that's acceptable for your application. You can also file an issue at https://github.com/pytorch/pytorch/issues to help us prioritize adding deterministic support for this operation.

进程已结束,退出代码1

mkang315 commented 1 month ago

Did you look at and follow YOLOv5 Multi-GPU Tutorial?

urban-drummer commented 1 month ago

您是否查看并遵循了YOLOv5 多 GPU 教程？作者您好，我严格遵循了多卡训练的流程，设定断点确定开启的是ddp模式，此外尝试了指定单卡训练得到报错的结果都是如上max_pool3d_with_indices_backward_cuda

urban-drummer commented 1 month ago

感谢您的帮助，目前我将torch.use_deterministic_algorithms(True, warn_only=True)添加到scaler.scale(loss).backward()前模型可以运行但伴随着警告，正在进一步的确认错误，大概率是版本兼容性的问题

mkang315 commented 1 month ago

Sorry for the inconvenience. If there is still a reported error, you may try to put our 'models' in the folder of 'models' of YOLOv5 and add some of the import dependencies from ours. Our code was generated based on that of YOLOv5. Thanks!

mkang315 / ASF-YOLO

作者您好，我也遇到了确定性算法的警告导致模型不能运行，同评论区的问题一样，我是4张3090一起训练de设置了0,1,2,3.指定单卡直接显示cuda错误，希望您能给出建议 #16