the training is so lowly

Linranran commented 3 years ago

❔Question

the training is so lowly and the utility of gpu is 0 in long time.

Additional context

train: weights=weights/yolov5s.pt, cfg=, data=train_dataset/phone_screen.yaml, hyp=data/hyps/hyp.scratch.yaml, epochs=300, batch_size=16, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, evolve=None, bucket=, cache=None, image_weights=False, device=1, multi_scale=False, single_cls=False, adam=False, sync_bn=False, workers=8, project=runs/train, entity=None, name=parts4top5, exist_ok=False, quad=False, linear_lr=False, label_smoothing=0.0, upload_dataset=False, bbox_interval=-1, save_period=-1, artifact_alias=latest, local_rank=-1, freeze=0 github: skipping check (not a git repository), for updates see https://github.com/ultralytics/yolov5 YOLOv5 🚀 2021-8-5 torch 1.8.1+cu102 CUDA:1 (Tesla T4, 15109.75MB)

hyperparameters: lr0=0.01, lrf=0.2, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0 Weights & Biases: run 'pip install wandb' to automatically track and visualize YOLOv5 🚀 runs (RECOMMENDED) TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/ Overriding model.yaml nc=80 with nc=5

             from  n    params  module                                  arguments

0 -1 1 1 -1 1 2 -1 1 3 -1 1 4 5 6 7 8 9 10 11 -1 1 12 [-1, 6] 1 13 14 15 -1 1 16 [-1, 4] 1 17 18 19 [-1, 14] 1 20 21 22 [-1, 10] 1 23 24 [17, 20, 23] 1 Model Summary: 3520 models.common.Focus [3, 32, 3]
18560 models.common.Conv [32, 64, 3, 2]
18816 models.common.C3 [64, 64, 1]
73984 models.common.Conv [64, 128, 3, 2]
-1 1 156928 models.common.C3 [128, 128, 3]
-1 1 295424 models.common.Conv [128, 256, 3, 2]
-1 1 625152 models.common.C3 [256, 256, 3]
-1 1 1180672 models.common.Conv [256, 512, 3, 2]
-1 1 656896 models.common.SPP [512, 512, [5, 9, 13]]
-1 1 1182720 models.common.C3 [512, 512, 1, False]
-1 1 131584 models.common.Conv [512, 256, 1, 1]
0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
0 models.common.Concat [1]
-1 1 361984 models.common.C3 [512, 256, 1, False]
-1 1 33024 models.common.Conv [256, 128, 1, 1]
0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
0 models.common.Concat [1]
-1 1 90880 models.common.C3 [256, 128, 1, False]
-1 1 147712 models.common.Conv [128, 128, 3, 2]
0 models.common.Concat [1]
-1 1 296448 models.common.C3 [256, 256, 1, False]
-1 1 590336 models.common.Conv [256, 256, 3, 2]
0 models.common.Concat [1]
-1 1 1182720 models.common.C3 [512, 512, 1, False]
26970 models.yolo.Detect [5, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]] 283 layers, 7074330 parameters, 7074330 gradients, 16.4 GFLOPs

Transferred 356/362 items from weights/yolov5s.pt Scaled weight_decay = 0.0005 optimizer: SGD with parameter groups 59 weight, 62 weight (no decay), 62 bias train: Scanning '/home/algo/linranran/CODE/phone_screen/train_dataset/part4_top5_train.cache' images and labels... 1563 found, 2236 missing, 0 empty, 1 corrupted: 100%|██████████████████████████████████████████████████████████████████████████| 3800/3800 [00:00<?, ?it/s]train: WARNING: Ignoring corrupted image and/or label /home/algo/linranran/datasets/phone_screen_aug/parts/1/images/white_353052090117114-16-20-52.jpg: cannot identify image file '/home/algo/linranran/datasets/phone_screen_aug/parts/1/images/white_353052090117114-16-20-52.jpg' train: Scanning '/home/algo/linranran/CODE/phone_screen/train_dataset/part4_top5_train.cache' images and labels... 1563 found, 2236 missing, 0 empty, 1 corrupted: 100%|██████████████████████████████████████████████████████████████████████████| 3800/3800 [00:00<?, ?it/s] OMP: Info #155: KMP_AFFINITY: Initial OS proc set respected: 0-39 OMP: Info #216: KMP_AFFINITY: decoding x2APIC ids. OMP: Info #157: KMP_AFFINITY: 40 available OS procs OMP: Info #158: KMP_AFFINITY: Uniform topology OMP: Info #287: KMP_AFFINITY: topology layer "LL cache" is equivalent to "socket". OMP: Info #287: KMP_AFFINITY: topology layer "L3 cache" is equivalent to "socket". OMP: Info #287: KMP_AFFINITY: topology layer "L2 cache" is equivalent to "core". OMP: Info #287: KMP_AFFINITY: topology layer "L1 cache" is equivalent to "core". OMP: Info #192: KMP_AFFINITY: 2 sockets x 10 cores/socket x 2 threads/core (20 total cores) OMP: Info #218: KMP_AFFINITY: OS proc to physical thread map: OMP: Info #172: KMP_AFFINITY: OS proc 0 maps to socket 0 core 0 thread 0 OMP: Info #172: KMP_AFFINITY: OS proc 20 maps to socket 0 core 0 thread 1 OMP: Info #172: KMP_AFFINITY: OS proc 1 maps to socket 0 core 1 thread 0

OMP: Info #172: KMP_AFFINITY: OS proc 37 maps to socket 1 core 10 thread 1 OMP: Info #172: KMP_AFFINITY: OS proc 18 maps to socket 1 core 11 thread 0 OMP: Info #172: KMP_AFFINITY: OS proc 38 maps to socket 1 core 11 thread 1 OMP: Info #172: KMP_AFFINITY: OS proc 19 maps to socket 1 core 12 thread 0 OMP: Info #172: KMP_AFFINITY: OS proc 39 maps to socket 1 core 12 thread 1 OMP: Info #254: KMP_AFFINITY: pid 113594 tid 113594 thread 0 bound to OS proc set 0

autoanchor: Analyzing anchors... anchors/target = 3.69, Best Possible Recall (BPR) = 0.9985 Image sizes 640 train, 640 val Using 16 dataloader workers Logging results to runs/train/parts4top513 Starting training for 300 epochs...

 Epoch   gpu_mem       box       obj       cls    labels  img_size

val: Scanning '/home/algo/linranran/CODE/phone_screen/train_dataset/part4_top5_val.cache' images and labels... 77 found, 122 missing, 0 empty, 0 corrupted: 100%|???????????????????????????????????????????????????????????????????????????????????| 199/199 [00:47<?, ?it/s]

1628330554(1) 1628330598(1)

glenn-jocher commented 3 years ago

@Linranran 👋 Hello! Thanks for asking about training speed issues. YOLOv5 🚀 can be trained on CPU (slowest), single-GPU, or multi-GPU (fastest). If you would like to increase your training speed some options are:

Increase --batch-size
Reduce --img-size
Reduce model size, i.e. from YOLOv5x -> YOLOv5l -> YOLOv5m -> YOLOv5s
Train with multi-GPU DDP at larger --batch-size
Train with a cached dataset: python train.py --cache
Train on faster GPUs, i.e.: P100 -> V100 -> A100
Train on free GPU backends with up to 16GB of CUDA memory:

github-actions[bot] commented 3 years ago

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 🚀 resources:

Wiki – https://github.com/ultralytics/yolov5/wiki
Tutorials – https://docs.ultralytics.com/yolov5
Docs – https://docs.ultralytics.com

Access additional Ultralytics ⚡ resources:

Ultralytics HUB – https://ultralytics.com
Vision API – https://ultralytics.com/yolov5
About Us – https://ultralytics.com/about
Join Our Team – https://ultralytics.com/work
Contact Us – https://ultralytics.com/contact

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!

LegendSun0 commented 2 years ago

❔问题

训练这么低，gpu的效用长时间为0。

附加上下文

训练：weights=weights/yolov5s.pt, cfg=, data=train_dataset/phone_screen.yaml, hyp=data/hyps/hyp.scratch.yaml, epochs=300, batch_size=16, imgsz=640, rect=False, resume =False，nosave=False，noval=False，noautoanchor=False，evolve=None，bucket=，cache=None，image_weights=False，device=1，multi_scale=False，single_cls=False，adam=False，sync_bn=False， worker=8，project=runs/train，entity=None，name=parts4top5，exist_ok=False，quad=False，linear_lr=False，label_smoothing=0.0，upload_dataset=False，bbox_interval=-1，save_period=-1，artifact_alias= latest, local_rank=-1, freeze=0 github：跳过检查（不是 git 存储库），有关更新，请参阅https://github.com/ultralytics/yolov5 YOLOv5🚀2021-8-5 火炬 1.8.1+cu102 CUDA:1 (Tesla T4, 15109.75MB)

超参数：lr0=0.01，lrf=0.2，动量=0.937，weight_decay=0.0005，warmup_epochs=3.0，warmup_momentum=0.8，warmup_bias_lr=0.1，box=0.05，cls=0.5，cls_pw=1.0，obj=1.0，obj_pw=1.0， iou_t=0.2，anchor_t=4.0，fl_gamma=0.0，hsv_h=0.015，hsv_s=0.7，hsv_v=0.4，degrees=0.0，translate=0.1，scale=0.5，shear=0.0，perspective=0.0，flipud=0.0，fliplr= 0.5，mosaic=1.0，mixup=0.0，copy_paste=0.0 权重和偏差：运行“pip install wandb”以自动跟踪和可视化 YOLOv5🚀运行（推荐） TensorBoard：从 'tensorboard --logdir runs/train' 开始，查看http://localhost:6006/ 覆盖 model.yaml nc=80 和 nc=5
             from  n    params  module                                  arguments                     
0 -1 1 3520 models.common.Focus [3, 32, 3] 1 -1 1 18560 models.common.Conv [32, 64, 3, 2] 2 -1 1 18816 models.common.C3 [64, 64 , 1] 3 -1 1 73984 models.common.Conv [64, 128, 3, 2] 4 -1 1 156928 models.common.C3 [128, 128, 3] 5 -1 1 295424 models.common.Conv [ 128, 256, 3, 2] 6 -1 1 625152 models.common.C3 [256, 256, 3] 7 -1 1 1180672 models.common.Conv [256, 512, 3, 2] 8 -1 1 656896 models.common.SPP [512, 512, [5, 9, 13]] 9 -1 1 1182720 models.common.C3 [512, 512, 1, False] 10 -1 1 131584 models.common .Conv [512, 256, 1, 1] 11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 12 [-1, 6] 1 0 models.common.Concat [1 ] 13 -1 1 361984 models.common.C3 [512, 256, 1, False] 14 -1 1 33024 models.common.Conv [256, 128, 1, 1] 15 -1 1 0 torch.nn.modules。 upsampling.Upsample [None, 2, 'nearest'] 16 [-1, 4] 1 0 models.common.Concat [1] 17 -1 1 90880 models.common.C3 [256, 128, 1, False] 18 -1 1 147712 models.common.Conv [128, 128 , 3, 2] 19 [-1, 14] 1 0 models.common.Concat [1] 20 -1 1 296448 models.common.C3 [256, 256, 1, False] 21 -1 1 590336 models.common。 Conv [256, 256, 3, 2] 22 [-1, 10] 1 0 models.common.Concat [1] 23 -1 1 1182720 models.common.C3 [512, 512, 1, False] 24 [17, 20, 23] 1 26970 models.yolo.Detect [5, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]] 模型总结：283层，7074330个参数，7074330个梯度，16.4 GFLOPs

从 weights/yolov5s.pt 转移 356/362 个项目 Scaled weight_decay = 0.0005 优化器：SGD 参数组为 59 权重、62 权重（无衰减）、62 偏差训练：扫描 '/home/algo/linranran/CODE/phone_screen/train_dataset/ part4_top5_train.cache' 图像和标签... 1563 找到，2236 丢失，0 空，1 损坏：100%|█████████████████████████ █████████████████████████████████████████████████| 3800/3800 [00:00<?, ?it/s]train：警告：忽略损坏的图像和/或标签/home/algo/linranran/datasets/phone_screen_aug/parts/1/images/white_353052090117114-16-20-52 .jpg：无法识别图像文件'/home/algo/linranran/datasets/phone_screen_aug/parts/1/images/white_353052090117114-16-20-52.jpg' 火车：正在扫描“/home/algo/linranran/CODE/phone_screen/train_dataset/part4_top5_train.cache”图像和标签...找到 1563 个，丢失 2236 个，空的 0 个，损坏的 1 个：100%|████████ ██████████████████████████████████████████████████ ████████████████| 3800/3800 [00:00<?, ?it/s] OMP：信息#155：KMP_AFFINITY：尊重初始 OS proc 集：0-39 OMP：信息#216：KMP_AFFINITY：解码 x2APIC id。 OMP：信息#157：KMP_AFFINITY：40 个可用的操作系统进程 OMP：信息#158：KMP_AFFINITY：统一拓扑 OMP：信息#287：KMP_AFFINITY：拓扑层“LL 缓存”等同于“套接字”。 OMP：信息#287：KMP_AFFINITY：拓扑层“L3缓存”相当于“socket”。 OMP：信息#287：KMP_AFFINITY：拓扑层“L2 缓存”等同于“核心”。 OMP：信息#287：KMP_AFFINITY：拓扑层“L1 缓存”等同于“核心”。 OMP：信息#192：KMP_AFFINITY：2 个套接字 x 10 个内核/套接字 x 2 个线程/内核（总共 20 个内核） OMP：信息#218：KMP_AFFINITY：OS proc 到物理线程映射： OMP：Info #172：KMP_AFFINITY：OS proc 0 映射到套接字 0 内核 0 线程 0 OMP：信息#172：KMP_AFFINITY：OS proc 20 映射到套接字 0 内核 0 线程 1 OMP：信息#172: KMP_AFFINITY: OS proc 1 映射到 socket 0 core 1 thread 0

OMP：信息#172：KMP_AFFINITY：OS proc 37 映射到套接字 1 内核 10 线程 1 OMP：信息#172：KMP_AFFINITY：OS proc 18 映射到套接字 1 内核 11 线程 0 OMP：信息#172：KMP_AFFINITY：OS proc 38 映射到套接字 1 内核 11 线程 1 OMP：信息#172：KMP_AFFINITY：OS proc 19 映射到套接字 1 内核 12 线程 0 OMP：信息#172：KMP_AFFINITY：OS proc 39 映射到套接字 1 内核 12 线程 1 OMP：信息#254 : KMP_AFFINITY: pid 113594 tid 113594 线程 0 绑定到 OS proc set 0

autoanchor：分析锚点...锚点/目标 = 3.69，最佳可能召回 (BPR) = 0.9985 图像大小 640 训练，640 验证使用 16 个数据加载器工作人员将结果记录到运行/训练/parts4top513 开始训练 300 个 epoch...
 Epoch   gpu_mem       box       obj       cls    labels  img_size
val：正在扫描“/home/algo/linranran/CODE/phone_screen/train_dataset/part4_top5_val.cache”图像和标签......找到77个，丢失122个，空0个，损坏0个：100％|?????????? ??????????????????????????????????????????????????? ???????????????????????????| 199/199 [00:47<?, ?it/s]

have you solved the problem?

ultralytics / yolov5

the training is so lowly #4333

❔Question

Additional context

❔问题

附加上下文