Closed JinHisAndy closed 1 year ago
π Hello @JinHisAndy, thank you for your interest in YOLOv5 π! Please visit our βοΈ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.
If this is a π Bug Report, please provide a minimum reproducible example to help us debug it.
If this is a custom training β Question, please provide as much information as possible, including dataset image examples and training logs, and verify you are following our Tips for Best Training Results.
Python>=3.7.0 with all requirements.txt installed including PyTorch>=1.7. To get started:
git clone https://github.com/ultralytics/yolov5 # clone
cd yolov5
pip install -r requirements.txt # install
YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):
If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training, validation, inference, export and benchmarks on macOS, Windows, and Ubuntu every 24 hours and on every commit.
We're excited to announce the launch of our latest state-of-the-art (SOTA) object detection model for 2023 - YOLOv8 π!
Designed to be fast, accurate, and easy to use, YOLOv8 is an ideal choice for a wide range of object detection, image segmentation and image classification tasks. With YOLOv8, you'll be able to quickly and accurately detect objects in real-time, streamline your workflows, and achieve new levels of accuracy in your projects.
Check out our YOLOv8 Docs for details and get started with:
pip install ultralytics
I also encountered the same problem as you. When I was training for YOLOv5 on a 3090, after a certain number of epochs, the loss became NAN. The old version of YOLOv5 that I downloaded a year ago did not have these issues
Hello @szxysdt,
I'm sorry to hear that you're experiencing such issue while training your custom dataset using the latest version of YOLOv5. Would you please provide more details on how you're setting up the training? It would be helpful to know about the dataset size, batch size, image size, and learning rate you are using. Additionally, have you tried to lower the learning rate or adjust the batch size?
Please also note that the latest implementation of YOLOv5 is vastly different from the earlier implementation in terms of architecture and features. It is possible that the issue you are experiencing may not be related to the implementation itself.
Thank you for your patience and I look forward to your response!
@glenn-jocher Thank you for taking the time to reply during your busy scheduleπ This is some information output from the console when I train the YOLO model:
train: weights=./runs/train/exp21/weights/best.pt, cfg=, data=data/coco.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=100, batch_size=-1, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=ram, image_weights=False, device=0, multi_scale=False, single_cls=False, optimizer=AdamW, sync_bn=False, workers=16, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=2, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
^C
YOLOv5 π v7.0-166-g54e9515 Python-3.7.0 torch-1.13.1+cu117 CUDA:0 (NVIDIA GeForce RTX 3090, 24260MiB)
hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
ClearML: run 'pip install clearml' to automatically track, visualize and remotely train YOLOv5 π in ClearML
Comet: run 'pip install comet_ml' to automatically track and visualize YOLOv5 π runs in Comet
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/
from n params module arguments
0 -1 1 5280 models.common.Conv [3, 48, 6, 2, 2]
1 -1 1 41664 models.common.Conv [48, 96, 3, 2]
2 -1 2 65280 models.common.C3 [96, 96, 2]
3 -1 1 166272 models.common.Conv [96, 192, 3, 2]
4 -1 4 444672 models.common.C3 [192, 192, 4]
5 -1 1 664320 models.common.Conv [192, 384, 3, 2]
6 -1 6 2512896 models.common.C3 [384, 384, 6]
7 -1 1 1991808 models.common.Conv [384, 576, 3, 2]
8 -1 2 2327040 models.common.C3 [576, 576, 2]
9 -1 1 3982848 models.common.Conv [576, 768, 3, 2]
10 -1 2 4134912 models.common.C3 [768, 768, 2]
11 -1 1 1476864 models.common.SPPF [768, 768, 5]
12 -1 1 443520 models.common.Conv [768, 576, 1, 1]
13 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
14 [-1, 8] 1 0 models.common.Concat [1]
15 -1 2 2658816 models.common.C3 [1152, 576, 2, False]
16 -1 1 221952 models.common.Conv [576, 384, 1, 1]
17 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
18 [-1, 6] 1 0 models.common.Concat [1]
19 -1 2 1182720 models.common.C3 [768, 384, 2, False]
20 -1 1 74112 models.common.Conv [384, 192, 1, 1]
21 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
22 [-1, 4] 1 0 models.common.Concat [1]
23 -1 2 296448 models.common.C3 [384, 192, 2, False]
24 -1 1 332160 models.common.Conv [192, 192, 3, 2]
25 [-1, 20] 1 0 models.common.Concat [1]
26 -1 2 1035264 models.common.C3 [384, 384, 2, False]
27 -1 1 1327872 models.common.Conv [384, 384, 3, 2]
28 [-1, 16] 1 0 models.common.Concat [1]
29 -1 2 2437632 models.common.C3 [768, 576, 2, False]
30 -1 1 2987136 models.common.Conv [576, 576, 3, 2]
31 [-1, 12] 1 0 models.common.Concat [1]
32 -1 2 4429824 models.common.C3 [1152, 768, 2, False]
33 [23, 26, 29, 32] 1 490620 models.yolo.Detect [80, [[19, 27, 44, 40, 38, 94], [96, 68, 86, 152, 180, 137], [140, 301, 303, 264, 238, 542], [436, 615, 739, 380, 925, 792]], [192, 384, 576, 768]]
Model summary: 379 layers, 35731932 parameters, 35731932 gradients, 50.3 GFLOPs
Transferred 627/627 items from runs/train/exp21/weights/best.pt
AMP: checks passed β
AutoBatch: Computing optimal batch size for --imgsz 640
AutoBatch: CUDA:0 (NVIDIA GeForce RTX 3090) 23.69G total, 0.51G reserved, 0.27G allocated, 22.91G free
Params GFLOPs GPU_mem (GB) forward (ms) backward (ms) input output
35731932 50.28 0.889 165.3 69.7 (1, 3, 640, 640) list
35731932 100.6 1.179 99.13 51.84 (2, 3, 640, 640) list
35731932 201.1 1.801 92.85 55.26 (4, 3, 640, 640) list
35731932 402.3 3.213 92.01 61.32 (8, 3, 640, 640) list
35731932 804.5 5.815 96.23 66.06 (16, 3, 640, 640) list
AutoBatch: Using batch-size 53 for CUDA:0 18.85G/23.69G (80%) β
optimizer: AdamW(lr=0.01) with parameter groups 103 weight(decay=0.0), 107 weight(decay=0.0004140625), 107 bias
train: Scanning /szxy-workspace/datasets/coco/train2017.cache... 117266 images, 1021 backgrounds, 0 corrupt: 100%|ββββββββββ| 118287/118287 [00:00<?, ?it/s]
train: WARNING β οΈ /szxy-workspace/datasets/coco/images/train2017/000000099844.jpg: 2 duplicate labels removed
train: WARNING β οΈ /szxy-workspace/datasets/coco/images/train2017/000000201706.jpg: 1 duplicate labels removed
train: WARNING β οΈ /szxy-workspace/datasets/coco/images/train2017/000000214087.jpg: 1 duplicate labels removed
train: WARNING β οΈ /szxy-workspace/datasets/coco/images/train2017/000000522365.jpg: 1 duplicate labels removed
train: 95.1GB RAM required, 57.2/62.8GB available, not caching images β οΈ
val: Scanning /szxy-workspace/datasets/coco/val2017.cache... 4952 images, 48 backgrounds, 0 corrupt: 100%|ββββββββββ| 5000/5000 [00:00<?, ?it/s]
val: Caching images (4.1GB ram): 100%|ββββββββββ| 5000/5000 [00:05<00:00, 942.74it/s]
AutoAnchor: 5.57 anchors/target, 0.996 Best Possible Recall (BPR). Current anchors are a good fit to dataset β
Plotting labels to runs/train/exp/labels.jpg...
Image sizes 640 train, 640 val
Using 16 dataloader workers
Logging results to runs/train/exp
Starting training for 100 epochs...
Epoch GPU_mem box_loss obj_loss cls_loss Instances Size
0/99 19.6G 0.04325 0.04784 0.02086 524 640: 100%|ββββββββββ| 2232/2232 [19:54<00:00, 1.87it/s]
Class Images Instances P R mAP50 mAP50-95: 100%|ββββββββββ| 48/48 [00:47<00:00, 1.02it/s]
all 5000 36335 0.631 0.469 0.513 0.336
Epoch GPU_mem box_loss obj_loss cls_loss Instances Size
1/99 17.7G 0.04249 0.04701 0.01966 589 640: 100%|ββββββββββ| 2232/2232 [19:41<00:00, 1.89it/s]
Class Images Instances P R mAP50 mAP50-95: 100%|ββββββββββ| 48/48 [00:44<00:00, 1.09it/s]
all 5000 36335 0.63 0.481 0.523 0.347
Epoch GPU_mem box_loss obj_loss cls_loss Instances Size
2/99 17.8G 0.04261 0.04698 0.01949 573 640: 100%|ββββββββββ| 2232/2232 [19:46<00:00, 1.88it/s]
Class Images Instances P R mAP50 mAP50-95: 100%|ββββββββββ| 48/48 [00:43<00:00, 1.09it/s]
all 5000 36335 0.658 0.486 0.54 0.357
Epoch GPU_mem box_loss obj_loss cls_loss Instances Size
3/99 17.8G 0.04274 0.04718 0.01952 658 640: 100%|ββββββββββ| 2232/2232 [19:43<00:00, 1.89it/s]
Class Images Instances P R mAP50 mAP50-95: 100%|ββββββββββ| 48/48 [00:42<00:00, 1.12it/s]
all 5000 36335 0.647 0.502 0.547 0.366
Epoch GPU_mem box_loss obj_loss cls_loss Instances Size
4/99 17.8G 0.04264 0.04698 0.01929 595 640: 100%|ββββββββββ| 2232/2232 [19:36<00:00, 1.90it/s]
Class Images Instances P R mAP50 mAP50-95: 100%|ββββββββββ| 48/48 [00:43<00:00, 1.10it/s]
all 5000 36335 0.647 0.502 0.55 0.368
Epoch GPU_mem box_loss obj_loss cls_loss Instances Size
5/99 17.8G nan nan nan 572 640: 100%|ββββββββββ| 2232/2232 [19:20<00:00, 1.92it/s]
Class Images Instances P R mAP50 mAP50-95: 100%|ββββββββββ| 48/48 [00:13<00:00, 3.60it/s]
all 5000 36335 0 0 0 0
Epoch GPU_mem box_loss obj_loss cls_loss Instances Size
6/99 17.8G nan nan nan 517 640: 100%|ββββββββββ| 2232/2232 [18:16<00:00, 2.04it/s]
Class Images Instances P R mAP50 mAP50-95: 100%|ββββββββββ| 48/48 [00:13<00:00, 3.65it/s]
all 5000 36335 0 0 0 0
I am using Coco2017 for training. In the future, I will process the dataset and clean out some unnecessary classes before retraining. This training is used to verify the training speed and MAP of the Yolo model. However, after training for a few epochs, the loss will become NAN and the MAP value will suddenly return to zero
After cleaning the dataset, the probability of loss becoming NAN decreases (it may still experience loss becoming NAN, but it can return to normal in the next epoch) Guessing is caused by an incorrect dataset (as the original incorrect dataset was lost, the cause of this problem cannot be determined)
@szxysdt thank you for sharing your findings regarding the probability of loss becoming NAN decreasing after cleaning the dataset. It's great to see that the issue can potentially be resolved through dataset cleaning. Guessing can certainly be problematic, and it's unfortunate that the cause of the incorrect dataset cannot be determined due to loss. Nonetheless, we appreciate your efforts in troubleshooting the issue and sharing your insights. If you have any further findings or questions, don't hesitate to let us know.
π Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.
For additional resources and information, please see the links below:
Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!
Thank you for your contributions to YOLO π and Vision AI β
@glenn-jocher Hi, I get nan during training after a few iters, it seems like something is incorrect with dp. In the first place I set device 0,1 while I don't realize it using dp rather than ddp (v8 will aotumatically using ddp). Then I get normal training process using single-gpu or ddp.
Thank you for sharing your experience, @wcyjerry. It's great to hear that you were able to resolve the issue by setting the appropriate device and using ddp for training. If you encounter any further challenges or have additional feedback, please feel free to share. We appreciate your contributions to the YOLOv5 community!
Search before asking
Question
yolov5:7.0 when training custom data, the box_lossγobj_loss and clas_loss are "nan"
Additional
No response