yolov5:7.0 when training custom data, the box_loss、obj_loss and clas_loss are "nan"

JinHisAndy commented 1 year ago

Search before asking

[X] I have searched the YOLOv5 issues and discussions and found no similar questions.

Question

Additional

No response

github-actions[bot] commented 1 year ago

👋 Hello @JinHisAndy, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide a minimum reproducible example to help us debug it.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset image examples and training logs, and verify you are following our Tips for Best Training Results.

Requirements

Python>=3.7.0 with all requirements.txt installed including PyTorch>=1.7. To get started:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training, validation, inference, export and benchmarks on macOS, Windows, and Ubuntu every 24 hours and on every commit.

Introducing YOLOv8 🚀

We're excited to announce the launch of our latest state-of-the-art (SOTA) object detection model for 2023 - YOLOv8 🚀!

Designed to be fast, accurate, and easy to use, YOLOv8 is an ideal choice for a wide range of object detection, image segmentation and image classification tasks. With YOLOv8, you'll be able to quickly and accurately detect objects in real-time, streamline your workflows, and achieve new levels of accuracy in your projects.

Check out our YOLOv8 Docs for details and get started with:

pip install ultralytics

szxysdt commented 1 year ago

I also encountered the same problem as you. When I was training for YOLOv5 on a 3090, after a certain number of epochs, the loss became NAN. The old version of YOLOv5 that I downloaded a year ago did not have these issues

glenn-jocher commented 1 year ago

Hello @szxysdt,

I'm sorry to hear that you're experiencing such issue while training your custom dataset using the latest version of YOLOv5. Would you please provide more details on how you're setting up the training? It would be helpful to know about the dataset size, batch size, image size, and learning rate you are using. Additionally, have you tried to lower the learning rate or adjust the batch size?

Please also note that the latest implementation of YOLOv5 is vastly different from the earlier implementation in terms of architecture and features. It is possible that the issue you are experiencing may not be related to the implementation itself.

Thank you for your patience and I look forward to your response!

szxysdt commented 1 year ago

@glenn-jocher Thank you for taking the time to reply during your busy schedule😀 This is some information output from the console when I train the YOLO model:

train: weights=./runs/train/exp21/weights/best.pt, cfg=, data=data/coco.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=100, batch_size=-1, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=ram, image_weights=False, device=0, multi_scale=False, single_cls=False, optimizer=AdamW, sync_bn=False, workers=16, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=2, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
^C
YOLOv5 🚀 v7.0-166-g54e9515 Python-3.7.0 torch-1.13.1+cu117 CUDA:0 (NVIDIA GeForce RTX 3090, 24260MiB)

hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
ClearML: run 'pip install clearml' to automatically track, visualize and remotely train YOLOv5 🚀 in ClearML
Comet: run 'pip install comet_ml' to automatically track and visualize YOLOv5 🚀 runs in Comet
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/

                 from  n    params  module                                  arguments                     
  0                -1  1      5280  models.common.Conv                      [3, 48, 6, 2, 2]              
  1                -1  1     41664  models.common.Conv                      [48, 96, 3, 2]                
  2                -1  2     65280  models.common.C3                        [96, 96, 2]                   
  3                -1  1    166272  models.common.Conv                      [96, 192, 3, 2]               
  4                -1  4    444672  models.common.C3                        [192, 192, 4]                 
  5                -1  1    664320  models.common.Conv                      [192, 384, 3, 2]              
  6                -1  6   2512896  models.common.C3                        [384, 384, 6]                 
  7                -1  1   1991808  models.common.Conv                      [384, 576, 3, 2]              
  8                -1  2   2327040  models.common.C3                        [576, 576, 2]                 
  9                -1  1   3982848  models.common.Conv                      [576, 768, 3, 2]              
 10                -1  2   4134912  models.common.C3                        [768, 768, 2]                 
 11                -1  1   1476864  models.common.SPPF                      [768, 768, 5]                 
 12                -1  1    443520  models.common.Conv                      [768, 576, 1, 1]              
 13                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 14           [-1, 8]  1         0  models.common.Concat                    [1]                           
 15                -1  2   2658816  models.common.C3                        [1152, 576, 2, False]         
 16                -1  1    221952  models.common.Conv                      [576, 384, 1, 1]              
 17                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 18           [-1, 6]  1         0  models.common.Concat                    [1]                           
 19                -1  2   1182720  models.common.C3                        [768, 384, 2, False]          
 20                -1  1     74112  models.common.Conv                      [384, 192, 1, 1]              
 21                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 22           [-1, 4]  1         0  models.common.Concat                    [1]                           
 23                -1  2    296448  models.common.C3                        [384, 192, 2, False]          
 24                -1  1    332160  models.common.Conv                      [192, 192, 3, 2]              
 25          [-1, 20]  1         0  models.common.Concat                    [1]                           
 26                -1  2   1035264  models.common.C3                        [384, 384, 2, False]          
 27                -1  1   1327872  models.common.Conv                      [384, 384, 3, 2]              
 28          [-1, 16]  1         0  models.common.Concat                    [1]                           
 29                -1  2   2437632  models.common.C3                        [768, 576, 2, False]          
 30                -1  1   2987136  models.common.Conv                      [576, 576, 3, 2]              
 31          [-1, 12]  1         0  models.common.Concat                    [1]                           
 32                -1  2   4429824  models.common.C3                        [1152, 768, 2, False]         
 33  [23, 26, 29, 32]  1    490620  models.yolo.Detect                      [80, [[19, 27, 44, 40, 38, 94], [96, 68, 86, 152, 180, 137], [140, 301, 303, 264, 238, 542], [436, 615, 739, 380, 925, 792]], [192, 384, 576, 768]]
Model summary: 379 layers, 35731932 parameters, 35731932 gradients, 50.3 GFLOPs

Transferred 627/627 items from runs/train/exp21/weights/best.pt
AMP: checks passed ✅
AutoBatch: Computing optimal batch size for --imgsz 640
AutoBatch: CUDA:0 (NVIDIA GeForce RTX 3090) 23.69G total, 0.51G reserved, 0.27G allocated, 22.91G free
      Params      GFLOPs  GPU_mem (GB)  forward (ms) backward (ms)                   input                  output
    35731932       50.28         0.889         165.3          69.7        (1, 3, 640, 640)                    list
    35731932       100.6         1.179         99.13         51.84        (2, 3, 640, 640)                    list
    35731932       201.1         1.801         92.85         55.26        (4, 3, 640, 640)                    list
    35731932       402.3         3.213         92.01         61.32        (8, 3, 640, 640)                    list
    35731932       804.5         5.815         96.23         66.06       (16, 3, 640, 640)                    list
AutoBatch: Using batch-size 53 for CUDA:0 18.85G/23.69G (80%) ✅
optimizer: AdamW(lr=0.01) with parameter groups 103 weight(decay=0.0), 107 weight(decay=0.0004140625), 107 bias
train: Scanning /szxy-workspace/datasets/coco/train2017.cache... 117266 images, 1021 backgrounds, 0 corrupt: 100%|██████████| 118287/118287 [00:00<?, ?it/s]
train: WARNING ⚠️ /szxy-workspace/datasets/coco/images/train2017/000000099844.jpg: 2 duplicate labels removed
train: WARNING ⚠️ /szxy-workspace/datasets/coco/images/train2017/000000201706.jpg: 1 duplicate labels removed
train: WARNING ⚠️ /szxy-workspace/datasets/coco/images/train2017/000000214087.jpg: 1 duplicate labels removed
train: WARNING ⚠️ /szxy-workspace/datasets/coco/images/train2017/000000522365.jpg: 1 duplicate labels removed
train: 95.1GB RAM required, 57.2/62.8GB available, not caching images ⚠️ 
val: Scanning /szxy-workspace/datasets/coco/val2017.cache... 4952 images, 48 backgrounds, 0 corrupt: 100%|██████████| 5000/5000 [00:00<?, ?it/s]
val: Caching images (4.1GB ram): 100%|██████████| 5000/5000 [00:05<00:00, 942.74it/s] 

AutoAnchor: 5.57 anchors/target, 0.996 Best Possible Recall (BPR). Current anchors are a good fit to dataset ✅
Plotting labels to runs/train/exp/labels.jpg... 
Image sizes 640 train, 640 val
Using 16 dataloader workers
Logging results to runs/train/exp
Starting training for 100 epochs...

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
       0/99      19.6G    0.04325    0.04784    0.02086        524        640: 100%|██████████| 2232/2232 [19:54<00:00,  1.87it/s]
                 Class     Images  Instances          P          R      mAP50   mAP50-95: 100%|██████████| 48/48 [00:47<00:00,  1.02it/s]
                   all       5000      36335      0.631      0.469      0.513      0.336

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
       1/99      17.7G    0.04249    0.04701    0.01966        589        640: 100%|██████████| 2232/2232 [19:41<00:00,  1.89it/s]
                 Class     Images  Instances          P          R      mAP50   mAP50-95: 100%|██████████| 48/48 [00:44<00:00,  1.09it/s]
                   all       5000      36335       0.63      0.481      0.523      0.347

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
       2/99      17.8G    0.04261    0.04698    0.01949        573        640: 100%|██████████| 2232/2232 [19:46<00:00,  1.88it/s]
                 Class     Images  Instances          P          R      mAP50   mAP50-95: 100%|██████████| 48/48 [00:43<00:00,  1.09it/s]
                   all       5000      36335      0.658      0.486       0.54      0.357

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
       3/99      17.8G    0.04274    0.04718    0.01952        658        640: 100%|██████████| 2232/2232 [19:43<00:00,  1.89it/s]
                 Class     Images  Instances          P          R      mAP50   mAP50-95: 100%|██████████| 48/48 [00:42<00:00,  1.12it/s]
                   all       5000      36335      0.647      0.502      0.547      0.366

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
       4/99      17.8G    0.04264    0.04698    0.01929        595        640: 100%|██████████| 2232/2232 [19:36<00:00,  1.90it/s]
                 Class     Images  Instances          P          R      mAP50   mAP50-95: 100%|██████████| 48/48 [00:43<00:00,  1.10it/s]
                   all       5000      36335      0.647      0.502       0.55      0.368

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
       5/99      17.8G        nan        nan        nan        572        640: 100%|██████████| 2232/2232 [19:20<00:00,  1.92it/s]
                 Class     Images  Instances          P          R      mAP50   mAP50-95: 100%|██████████| 48/48 [00:13<00:00,  3.60it/s]
                   all       5000      36335          0          0          0          0

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
       6/99      17.8G        nan        nan        nan        517        640: 100%|██████████| 2232/2232 [18:16<00:00,  2.04it/s]
                 Class     Images  Instances          P          R      mAP50   mAP50-95: 100%|██████████| 48/48 [00:13<00:00,  3.65it/s]
                   all       5000      36335          0          0          0          0

I am using Coco2017 for training. In the future, I will process the dataset and clean out some unnecessary classes before retraining. This training is used to verify the training speed and MAP of the Yolo model. However, after training for a few epochs, the loss will become NAN and the MAP value will suddenly return to zero

szxysdt commented 1 year ago

After cleaning the dataset, the probability of loss becoming NAN decreases (it may still experience loss becoming NAN, but it can return to normal in the next epoch) Guessing is caused by an incorrect dataset (as the original incorrect dataset was lost, the cause of this problem cannot be determined)

glenn-jocher commented 1 year ago

@szxysdt thank you for sharing your findings regarding the probability of loss becoming NAN decreasing after cleaning the dataset. It's great to see that the issue can potentially be resolved through dataset cleaning. Guessing can certainly be problematic, and it's unfortunate that the cause of the incorrect dataset cannot be determined due to loss. Nonetheless, we appreciate your efforts in troubleshooting the issue and sharing your insights. If you have any further findings or questions, don't hesitate to let us know.

github-actions[bot] commented 1 year ago

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Docs: https://docs.ultralytics.com
HUB: https://hub.ultralytics.com
Community: https://community.ultralytics.com

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐

wcyjerry commented 1 year ago

@glenn-jocher Hi, I get nan during training after a few iters, it seems like something is incorrect with dp. In the first place I set device 0,1 while I don't realize it using dp rather than ddp (v8 will aotumatically using ddp). Then I get normal training process using single-gpu or ddp.

glenn-jocher commented 11 months ago

Thank you for sharing your experience, @wcyjerry. It's great to hear that you were able to resolve the issue by setting the appropriate device and using ddp for training. If you encounter any further challenges or have additional feedback, please feel free to share. We appreciate your contributions to the YOLOv5 community!

ultralytics / yolov5