Closed alontrais closed 4 years ago
Hello, thank you for your interest in our work! This issue seems to lack the minimum requirements for a proper response, or is insufficiently detailed for us to help you. In general you can use https://github.com/ultralytics/yolov3/compare to display differences between repository versions.
Please note that most technical problems are due to:
git clone
version of this repository we can not debug it. Before going further run this code and ensure your issue persists:
sudo rm -rf yolov3 # remove existing
git clone https://github.com/ultralytics/yolov3 && cd yolov3 # clone latest
python3 detect.py # verify detection
python3 train.py # verify training (a few batches only)
# CODE TO REPRODUCE YOUR ISSUE HERE
train_batch0.jpg
and test_batch0.jpg
for a sanity check of training and testing data.If none of these apply to you, we suggest you close this issue and raise a new one using the Bug Report template, providing screenshots and minimum viable code to reproduce your issue. Thank you!
@alontrais ah one thing pops out from your terminal output. Your images are very large, which is not a use case we have tested. The current objectness loss construction is probably causing this. You can try red = 'mean'
on L372 when building your loss, which may help.
https://github.com/ultralytics/yolov3/blob/0958d81580a8a0086ac3326feeba4f6db20b70a5/utils/utils.py#L366-L373
@alontrais also, irrespective of the loss reduction, your obj loss is simply far too large. The 3 loss components should be more or less evenly balanced, so you can manually adjust (lower, in your case) your objetness loss hyperparameter on L28 to compensate: https://github.com/ultralytics/yolov3/blob/0958d81580a8a0086ac3326feeba4f6db20b70a5/train.py#L23-L43
I encountered the same problem, it happened after changing the learning rate decay function, the hyperparameters suddenly increased during training
@qwe3208620 the hyperparameters are static during training except for the LR (and momentum if you use SGD). The LR scheduler modifies the LR, and a prebiasing step for the first 3 epochs increases bias-only LR and momentum.
hi, I encountered the same problem when I trained my custom dataset, shown as the following pic.. and I lower the giou weight and lr, but it doesn't work.
@guancheng817 you might want to disable apex mixed precision training if you are using it.
I encountered this problem too, I'm wondering if it's related to the empty label files or zero filled label files? @glenn-jocher 29/179 9.72G 5.06 0.906 0 5.97 5 608 29/179 9.72G 5.05 0.91 0 5.96 11 608 WARNING: non-finite loss, ending training tensor([nan, nan, 0., nan], device='cuda:0')
@yangxu351 no definitely not. If there are no labels then there is no loss other than obj loss.
If it occurs repeatably perhaps you could print the loss components and trace the origin? Your message says it occurs in the first two only, with the last loss being zero? The 3 are box, obj and cls.
So that means it's probably originating in the GIoU calculation, which feeds the obj loss as well.
The giou computation is in utils.py. It has protected divides in place though, so I don't understand how it could generate an nan:
def bbox_iou(box1, box2, x1y1x2y2=True, GIoU=False, DIoU=False, CIoU=False):
# Returns the IoU of box1 to box2. box1 is 4, box2 is nx4
box2 = box2.t()
# Get the coordinates of bounding boxes
if x1y1x2y2: # x1, y1, x2, y2 = box1
b1_x1, b1_y1, b1_x2, b1_y2 = box1[0], box1[1], box1[2], box1[3]
b2_x1, b2_y1, b2_x2, b2_y2 = box2[0], box2[1], box2[2], box2[3]
else: # transform from xywh to xyxy
b1_x1, b1_x2 = box1[0] - box1[2] / 2, box1[0] + box1[2] / 2
b1_y1, b1_y2 = box1[1] - box1[3] / 2, box1[1] + box1[3] / 2
b2_x1, b2_x2 = box2[0] - box2[2] / 2, box2[0] + box2[2] / 2
b2_y1, b2_y2 = box2[1] - box2[3] / 2, box2[1] + box2[3] / 2
# Intersection area
inter = (torch.min(b1_x2, b2_x2) - torch.max(b1_x1, b2_x1)).clamp(0) * \
(torch.min(b1_y2, b2_y2) - torch.max(b1_y1, b2_y1)).clamp(0)
# Union Area
w1, h1 = b1_x2 - b1_x1, b1_y2 - b1_y1
w2, h2 = b2_x2 - b2_x1, b2_y2 - b2_y1
union = (w1 * h1 + 1e-16) + w2 * h2 - inter
iou = inter / union # iou
if GIoU or DIoU or CIoU:
cw = torch.max(b1_x2, b2_x2) - torch.min(b1_x1, b2_x1) # convex (smallest enclosing box) width
ch = torch.max(b1_y2, b2_y2) - torch.min(b1_y1, b2_y1) # convex height
if GIoU: # Generalized IoU https://arxiv.org/pdf/1902.09630.pdf
c_area = cw * ch + 1e-16 # convex area
return iou - (c_area - union) / c_area # GIoU
@glenn-jocher I noticed that all empty label files are filled with zero in the load_mosaic with labels = np.zeros((0, 5), dtype=np.float32). WARNING: non-finite loss, ending training tensor([nan, nan, 0., nan], device='cuda:0') for single-class task, the lcls is 0, but I cannot figure out why other loss will be nan
@yangxu351 no those are not zeros, those are placeholders for the concatenation operation. a 0,5 array has no values.
@glenn-jocher Thanks for your reply. I'm still struggling with the nan losses. If the label file is empty, the tbox[i] is also empty when computing giou = bbox_iou(pbox.t(), tbox[i], x1y1x2y2=False, GIoU=True), is it ok if I set the giou=0 when the tbox[i] is empty?
@yangxu351 giou loss is only generated for labels. If there is no label there is no GIoU loss, only obj loss.
I encounter the same issue.
@rinabuoy can you reliably reproduce the issue, and if so send us code? Thanks!
I just pulled the latest version and trained a single-class detector. Whether I set mixed_precision = True or not, this error will appear.
@rinabuoy thanks! Do you have a colab notebook that you could share so we can run to reproduce the error?
The error is caused because GIoU returns nan, though we can't reproduce it on COCO so its very hard/impossible to debug.
Ok here you go
https://drive.google.com/drive/folders/12aNIVPz3RFUpTbmniwdLkb7gcMh22DLt?usp=sharing
Thank you
When I changed the “giou” = 1.0 in hyp, the nan loss problem no longer occurs.
发自我的iPad
在 2020年4月3日,下午9:13,Glenn Jocher notifications@github.com 写道:
@rinabuoy thanks! Do you have a colab notebook that you could share so we can run to reproduce the error?
The error is caused because GIoU returns nan, though we can't reproduce it on COCO so its very hard/impossible to debug.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
With “giou” = 1.0, it ran epoch 40 and is still running. Will see if that error pops up in later epoch.
Thanks @yangxu351
no, it seems fixed
Yang Xu, Ph.D. student,
Department of Computer and Communication Engineering, University of Science and Technology Beijing,
30 Xueyuan Road, Haidian District, Beijing 100083,
P.R. China.
Tel:(+86) 18518590836
Email:xuyang_ustb@xs.ustb.edu.cn / xuyang_91@qq.com
On Fri, Apr 3, 2020 at 11:53 PM -0400, "Rina Buoy" notifications@github.com wrote:
With “giou” = 1.0, it ran epoch 40 and is still running. Will see if that error pops up in later epoch.
Thanks @yangxu351
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
I met the same problem, but it does not work for me by changing hyp['giou] to 1.0. I finally found it is related to momentum when using SGD. With pure SGD (momentum=0.), my code runs correctly. Can NOT figure out why. There is no division in momentum computation at all.
@lampsonSong nan originates in the GIoU computation somehow. But I have not been able to reproduce, and again there are no divide by zeros in GIoU. It should be stable.
This issue is stale because it has been open 30 days with no activity. Remove Stale label or comment or this will be closed in 5 days.
Hey, I'm getting this error while training on a custom dataset. I've trained similar datasets before and haven't received this error. I have not changed any hyperparameters since then.
Command I ran python train.py --data data/just_edges.data --cfg cfg/just_edges.cfg --weights weights/yolov3.pt --adam --epochs 500 --batch-size 32 --notest --cache-images --img-size 640
I ran into this error before due to using the --evolve hyperparameter, which I've now removed. I've reduced the LR to 0.001 and the GIOU hyp to 1. This is running on a 4 GPU virtual machine running K80s.
Thanks for your help and kudos to the amazing work done on this repo!
@kthomas441 recommend you try https://github.com/ultralytics/yolov5
I cloned the newest version, when I run the train script I get this warning: WARNING: non-finite loss, ending training tensor([nan, nan, nan, nan], device='cuda:0')
Caching labels (20 found, 0 missing, 0 empty, 0 duplicate, for 20 images): 100%|█████████████████████| 20/20 [00:00<00:00, 465.89it/s] Caching labels (10 found, 0 missing, 0 empty, 0 duplicate, for 10 images): 100%|█████████████████████| 10/10 [00:00<00:00, 468.26it/s] Model Summary: 225 layers, 6.25787e+07 parameters, 6.25787e+07 gradients Using 0 dataloader workers Starting training for 2730 epochs...
Model Bias Summary: layer regression objectness classification 89 0.00+/-0.02 -5.43+/-0.01 -5.00+/-0.01 101 -0.01+/-0.02 -5.83+/-0.05 -4.97+/-0.06 113 -0.02+/-0.10 -17.67+/-22.01 -2.75+/-2.99
When I run the previous version a month ago, I don't get this warning
What could be the problem?