WARNING: non-finite loss, ending training tensor([nan, nan, nan, nan], device='cuda:0')

alontrais commented 4 years ago

I cloned the newest version, when I run the train script I get this warning: WARNING: non-finite loss, ending training tensor([nan, nan, nan, nan], device='cuda:0')

Caching labels (20 found, 0 missing, 0 empty, 0 duplicate, for 20 images): 100%|█████████████████████| 20/20 [00:00<00:00, 465.89it/s] Caching labels (10 found, 0 missing, 0 empty, 0 duplicate, for 10 images): 100%|█████████████████████| 10/10 [00:00<00:00, 468.26it/s] Model Summary: 225 layers, 6.25787e+07 parameters, 6.25787e+07 gradients Using 0 dataloader workers Starting training for 2730 epochs...

 Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
0/2729     14.9G      8.15   1.4e+03       111  1.52e+03       982   2.5e+03: 100%|███████████████| 20/20 [02:39<00:00,  7.95s/it]
           Class    Images   Targets         P         R   mAP@0.5        F1: 100%|█████████████████| 5/5 [00:32<00:00,  6.52s/it]
             all        10  1.08e+04         0         0         0         0

 Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
1/2729     14.9G      7.21  1.23e+03      82.3  1.32e+03  1.42e+03   2.5e+03: 100%|███████████████| 20/20 [02:14<00:00,  6.71s/it]
           Class    Images   Targets         P         R   mAP@0.5        F1: 100%|█████████████████| 5/5 [00:11<00:00,  2.24s/it]
             all        10  1.08e+04         0         0         0         0

 Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
2/2729     14.9G       7.7  5.06e+04      27.3  5.06e+04  1.31e+03   2.5e+03: 100%|███████████████| 20/20 [02:14<00:00,  6.73s/it]
           Class    Images   Targets         P         R   mAP@0.5        F1: 100%|█████████████████| 5/5 [00:18<00:00,  3.66s/it]
             all        10  1.08e+04   0.00031    0.0185  1.09e-05  0.000611

Model Bias Summary: layer regression objectness classification 89 0.00+/-0.02 -5.43+/-0.01 -5.00+/-0.01 101 -0.01+/-0.02 -5.83+/-0.05 -4.97+/-0.06 113 -0.02+/-0.10 -17.67+/-22.01 -2.75+/-2.99

 Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
3/2729     14.9G      8.58  5.33e+04      24.8  5.33e+04  1.63e+03   2.5e+03: 100%|███████████████| 20/20 [02:14<00:00,  6.72s/it]
           Class    Images   Targets         P         R   mAP@0.5        F1: 100%|█████████████████| 5/5 [00:11<00:00,  2.21s/it]
             all        10  1.08e+04         0         0         0         0

 Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
4/2729     14.9G      9.05 -7.83e+03      31.4 -7.79e+03  1.24e+03   2.5e+03: 100%|███████████████| 20/20 [02:15<00:00,  6.79s/it]
           Class    Images   Targets         P         R   mAP@0.5        F1: 100%|█████████████████| 5/5 [00:11<00:00,  2.22s/it]
             all        10  1.08e+04         0         0         0         0

 Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
5/2729     14.9G      15.1  -3.9e+06  1.19e+03  -3.9e+06  1.74e+03   2.5e+03:  85%|████████████▊  | 17/20 [01:53<00:20,  6.81s/it]WARNING: non-finite loss, ending training  tensor([nan, nan, nan, nan], device='cuda:0')
5/2729     14.9G      15.1  -3.9e+06  1.19e+03  -3.9e+06  1.74e+03   2.5e+03:  85%|████████████▊  | 17/20 [01:57<00:20,  6.92s/it]

When I run the previous version a month ago, I don't get this warning

What could be the problem?

glenn-jocher commented 4 years ago

Hello, thank you for your interest in our work! This issue seems to lack the minimum requirements for a proper response, or is insufficiently detailed for us to help you. In general you can use https://github.com/ultralytics/yolov3/compare to display differences between repository versions.

Please note that most technical problems are due to:

Your changes to the default repository. If your issue is not reproducible in a fresh git clone version of this repository we can not debug it. Before going further run this code and ensure your issue persists:

sudo rm -rf yolov3  # remove existing
git clone https://github.com/ultralytics/yolov3 && cd yolov3 # clone latest
python3 detect.py  # verify detection
python3 train.py  # verify training (a few batches only)
# CODE TO REPRODUCE YOUR ISSUE HERE

Your custom data. If your issue is not reproducible with COCO data we can not debug it. Visit our Custom Training Tutorial for exact details on how to format your custom data. Examine train_batch0.jpg and test_batch0.jpg for a sanity check of training and testing data.
Your environment. If your issue is not reproducible in a GCP Quickstart Guide VM we can not debug it. Ensure you meet the requirements specified in the README: Unix, MacOS, or Windows with Python >= 3.7, PyTorch >= 1.3 etc. You can also use our Google Colab Notebook and our Docker Image to test your code in working environment.

If none of these apply to you, we suggest you close this issue and raise a new one using the Bug Report template, providing screenshots and minimum viable code to reproduce your issue. Thank you!

glenn-jocher commented 4 years ago

@alontrais ah one thing pops out from your terminal output. Your images are very large, which is not a use case we have tested. The current objectness loss construction is probably causing this. You can try red = 'mean' on L372 when building your loss, which may help. https://github.com/ultralytics/yolov3/blob/0958d81580a8a0086ac3326feeba4f6db20b70a5/utils/utils.py#L366-L373

glenn-jocher commented 4 years ago

@alontrais also, irrespective of the loss reduction, your obj loss is simply far too large. The 3 loss components should be more or less evenly balanced, so you can manually adjust (lower, in your case) your objetness loss hyperparameter on L28 to compensate: https://github.com/ultralytics/yolov3/blob/0958d81580a8a0086ac3326feeba4f6db20b70a5/train.py#L23-L43

qwe3208620 commented 4 years ago

I encountered the same problem, it happened after changing the learning rate decay function, the hyperparameters suddenly increased during training

glenn-jocher commented 4 years ago

@qwe3208620 the hyperparameters are static during training except for the LR (and momentum if you use SGD). The LR scheduler modifies the LR, and a prebiasing step for the first 3 epochs increases bias-only LR and momentum.

guancheng817 commented 4 years ago

hi, I encountered the same problem when I trained my custom dataset, shown as the following pic.. and I lower the giou weight and lr, but it doesn't work.

glenn-jocher commented 4 years ago

@guancheng817 you might want to disable apex mixed precision training if you are using it.

yangxu351 commented 4 years ago

I encountered this problem too, I'm wondering if it's related to the empty label files or zero filled label files? @glenn-jocher 29/179 9.72G 5.06 0.906 0 5.97 5 608 29/179 9.72G 5.05 0.91 0 5.96 11 608 WARNING: non-finite loss, ending training tensor([nan, nan, 0., nan], device='cuda:0')

glenn-jocher commented 4 years ago

@yangxu351 no definitely not. If there are no labels then there is no loss other than obj loss.

If it occurs repeatably perhaps you could print the loss components and trace the origin? Your message says it occurs in the first two only, with the last loss being zero? The 3 are box, obj and cls.

So that means it's probably originating in the GIoU calculation, which feeds the obj loss as well.

glenn-jocher commented 4 years ago

The giou computation is in utils.py. It has protected divides in place though, so I don't understand how it could generate an nan:

def bbox_iou(box1, box2, x1y1x2y2=True, GIoU=False, DIoU=False, CIoU=False):
    # Returns the IoU of box1 to box2. box1 is 4, box2 is nx4
    box2 = box2.t()

    # Get the coordinates of bounding boxes
    if x1y1x2y2:  # x1, y1, x2, y2 = box1
        b1_x1, b1_y1, b1_x2, b1_y2 = box1[0], box1[1], box1[2], box1[3]
        b2_x1, b2_y1, b2_x2, b2_y2 = box2[0], box2[1], box2[2], box2[3]
    else:  # transform from xywh to xyxy
        b1_x1, b1_x2 = box1[0] - box1[2] / 2, box1[0] + box1[2] / 2
        b1_y1, b1_y2 = box1[1] - box1[3] / 2, box1[1] + box1[3] / 2
        b2_x1, b2_x2 = box2[0] - box2[2] / 2, box2[0] + box2[2] / 2
        b2_y1, b2_y2 = box2[1] - box2[3] / 2, box2[1] + box2[3] / 2

    # Intersection area
    inter = (torch.min(b1_x2, b2_x2) - torch.max(b1_x1, b2_x1)).clamp(0) * \
            (torch.min(b1_y2, b2_y2) - torch.max(b1_y1, b2_y1)).clamp(0)

    # Union Area
    w1, h1 = b1_x2 - b1_x1, b1_y2 - b1_y1
    w2, h2 = b2_x2 - b2_x1, b2_y2 - b2_y1
    union = (w1 * h1 + 1e-16) + w2 * h2 - inter

    iou = inter / union  # iou
    if GIoU or DIoU or CIoU:
        cw = torch.max(b1_x2, b2_x2) - torch.min(b1_x1, b2_x1)  # convex (smallest enclosing box) width
        ch = torch.max(b1_y2, b2_y2) - torch.min(b1_y1, b2_y1)  # convex height
        if GIoU:  # Generalized IoU https://arxiv.org/pdf/1902.09630.pdf
            c_area = cw * ch + 1e-16  # convex area
            return iou - (c_area - union) / c_area  # GIoU

yangxu351 commented 4 years ago

@glenn-jocher I noticed that all empty label files are filled with zero in the load_mosaic with labels = np.zeros((0, 5), dtype=np.float32). WARNING: non-finite loss, ending training tensor([nan, nan, 0., nan], device='cuda:0') for single-class task, the lcls is 0, but I cannot figure out why other loss will be nan

glenn-jocher commented 4 years ago

@yangxu351 no those are not zeros, those are placeholders for the concatenation operation. a 0,5 array has no values.

yangxu351 commented 4 years ago

@glenn-jocher Thanks for your reply. I'm still struggling with the nan losses. If the label file is empty, the tbox[i] is also empty when computing giou = bbox_iou(pbox.t(), tbox[i], x1y1x2y2=False, GIoU=True), is it ok if I set the giou=0 when the tbox[i] is empty?

glenn-jocher commented 4 years ago

@yangxu351 giou loss is only generated for labels. If there is no label there is no GIoU loss, only obj loss.

rinabuoy commented 4 years ago

I encounter the same issue.

glenn-jocher commented 4 years ago

@rinabuoy can you reliably reproduce the issue, and if so send us code? Thanks!

rinabuoy commented 4 years ago

I just pulled the latest version and trained a single-class detector. Whether I set mixed_precision = True or not, this error will appear.

glenn-jocher commented 4 years ago

@rinabuoy thanks! Do you have a colab notebook that you could share so we can run to reproduce the error?

The error is caused because GIoU returns nan, though we can't reproduce it on COCO so its very hard/impossible to debug.

rinabuoy commented 4 years ago

Ok here you go

https://drive.google.com/drive/folders/12aNIVPz3RFUpTbmniwdLkb7gcMh22DLt?usp=sharing

Thank you

yangxu351 commented 4 years ago

When I changed the “giou” = 1.0 in hyp, the nan loss problem no longer occurs.

发自我的iPad

在 2020年4月3日，下午9:13，Glenn Jocher notifications@github.com 写道：

@rinabuoy thanks! Do you have a colab notebook that you could share so we can run to reproduce the error?

The error is caused because GIoU returns nan, though we can't reproduce it on COCO so its very hard/impossible to debug.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

rinabuoy commented 4 years ago

With “giou” = 1.0, it ran epoch 40 and is still running. Will see if that error pops up in later epoch.

Thanks @yangxu351

yangxu351 commented 4 years ago

no, it seems fixed

Yang Xu, Ph.D. student,

Department of Computer and Communication Engineering, University of Science and Technology Beijing,

30 Xueyuan Road, Haidian District, Beijing 100083,

P.R. China.

Tel:(+86) 18518590836

Email:xuyang_ustb@xs.ustb.edu.cn / xuyang_91@qq.com

On Fri, Apr 3, 2020 at 11:53 PM -0400, "Rina Buoy" notifications@github.com wrote:

With “giou” = 1.0, it ran epoch 40 and is still running. Will see if that error pops up in later epoch.

Thanks @yangxu351

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

lampsonSong commented 4 years ago

I met the same problem, but it does not work for me by changing hyp['giou] to 1.0. I finally found it is related to momentum when using SGD. With pure SGD (momentum=0.), my code runs correctly. Can NOT figure out why. There is no division in momentum computation at all.

glenn-jocher commented 4 years ago

@lampsonSong nan originates in the GIoU computation somehow. But I have not been able to reproduce, and again there are no divide by zeros in GIoU. It should be stable.

github-actions[bot] commented 4 years ago

This issue is stale because it has been open 30 days with no activity. Remove Stale label or comment or this will be closed in 5 days.

kthomas441 commented 4 years ago

Hey, I'm getting this error while training on a custom dataset. I've trained similar datasets before and haven't received this error. I have not changed any hyperparameters since then.

Command I ran python train.py --data data/just_edges.data --cfg cfg/just_edges.cfg --weights weights/yolov3.pt --adam --epochs 500 --batch-size 32 --notest --cache-images --img-size 640

I ran into this error before due to using the --evolve hyperparameter, which I've now removed. I've reduced the LR to 0.001 and the GIOU hyp to 1. This is running on a 4 GPU virtual machine running K80s.

Thanks for your help and kudos to the amazing work done on this repo!

glenn-jocher commented 4 years ago

@kthomas441 recommend you try https://github.com/ultralytics/yolov5

ultralytics / yolov3

WARNING: non-finite loss, ending training tensor([nan, nan, nan, nan], device='cuda:0') #842