longcw / yolo2-pytorch

YOLOv2 in PyTorch
1.54k stars 421 forks source link

Nan losses during training #14

Open 9thDimension opened 7 years ago

9thDimension commented 7 years ago

I've been interested in using YOLO architecture for an object detection task. As a first step I decided to clone this repo and run the examples. Upon running the train.py I get these results...

Note that I had to compile the 'roi_pooling' and 'reorg' modules (not 100% sure what these are for) with slightly different flags (-arch=sm_30) to match my laptop's GPU. In the future I intend to use AWS for full-scale training. Also I'm a new to the PyTorch framework.

/usr/bin/python2.7 /home/hal9000/Sources/yolo2-pytorch/train.py
voc_2007_trainval gt roidb loaded from /home/hal9000/Sources/yolo2-pytorch/data/cache/voc_2007_trainval_gt_roidb.pkl
load data succ...
('0-convolutional/kernel:0', (32L, 3L, 3L, 3L), (3, 3, 3, 32))
('0-convolutional/gamma:0', (32L,), (32,))
('0-convolutional/biases:0', (32L,), (32,))
('0-convolutional/moving_mean:0', (32L,), (32,))
('0-convolutional/moving_variance:0', (32L,), (32,))
('1-convolutional/kernel:0', (64L, 32L, 3L, 3L), (3, 3, 32, 64))
('1-convolutional/gamma:0', (64L,), (64,))
('1-convolutional/biases:0', (64L,), (64,))
('1-convolutional/moving_mean:0', (64L,), (64,))
('1-convolutional/moving_variance:0', (64L,), (64,))
('2-convolutional/kernel:0', (128L, 64L, 3L, 3L), (3, 3, 64, 128))
('2-convolutional/gamma:0', (128L,), (128,))
('2-convolutional/biases:0', (128L,), (128,))
('2-convolutional/moving_mean:0', (128L,), (128,))
('2-convolutional/moving_variance:0', (128L,), (128,))
('3-convolutional/kernel:0', (64L, 128L, 1L, 1L), (1, 1, 128, 64))
('3-convolutional/gamma:0', (64L,), (64,))
('3-convolutional/biases:0', (64L,), (64,))
('3-convolutional/moving_mean:0', (64L,), (64,))
('3-convolutional/moving_variance:0', (64L,), (64,))
('4-convolutional/kernel:0', (128L, 64L, 3L, 3L), (3, 3, 64, 128))
('4-convolutional/gamma:0', (128L,), (128,))
('4-convolutional/biases:0', (128L,), (128,))
('4-convolutional/moving_mean:0', (128L,), (128,))
('4-convolutional/moving_variance:0', (128L,), (128,))
('5-convolutional/kernel:0', (256L, 128L, 3L, 3L), (3, 3, 128, 256))
('5-convolutional/gamma:0', (256L,), (256,))
('5-convolutional/biases:0', (256L,), (256,))
('5-convolutional/moving_mean:0', (256L,), (256,))
('5-convolutional/moving_variance:0', (256L,), (256,))
('6-convolutional/kernel:0', (128L, 256L, 1L, 1L), (1, 1, 256, 128))
('6-convolutional/gamma:0', (128L,), (128,))
('6-convolutional/biases:0', (128L,), (128,))
('6-convolutional/moving_mean:0', (128L,), (128,))
('6-convolutional/moving_variance:0', (128L,), (128,))
('7-convolutional/kernel:0', (256L, 128L, 3L, 3L), (3, 3, 128, 256))
('7-convolutional/gamma:0', (256L,), (256,))
('7-convolutional/biases:0', (256L,), (256,))
('7-convolutional/moving_mean:0', (256L,), (256,))
('7-convolutional/moving_variance:0', (256L,), (256,))
('8-convolutional/kernel:0', (512L, 256L, 3L, 3L), (3, 3, 256, 512))
('8-convolutional/gamma:0', (512L,), (512,))
('8-convolutional/biases:0', (512L,), (512,))
('8-convolutional/moving_mean:0', (512L,), (512,))
('8-convolutional/moving_variance:0', (512L,), (512,))
('9-convolutional/kernel:0', (256L, 512L, 1L, 1L), (1, 1, 512, 256))
('9-convolutional/gamma:0', (256L,), (256,))
('9-convolutional/biases:0', (256L,), (256,))
('9-convolutional/moving_mean:0', (256L,), (256,))
('9-convolutional/moving_variance:0', (256L,), (256,))
('10-convolutional/kernel:0', (512L, 256L, 3L, 3L), (3, 3, 256, 512))
('10-convolutional/gamma:0', (512L,), (512,))
('10-convolutional/biases:0', (512L,), (512,))
('10-convolutional/moving_mean:0', (512L,), (512,))
('10-convolutional/moving_variance:0', (512L,), (512,))
('11-convolutional/kernel:0', (256L, 512L, 1L, 1L), (1, 1, 512, 256))
('11-convolutional/gamma:0', (256L,), (256,))
('11-convolutional/biases:0', (256L,), (256,))
('11-convolutional/moving_mean:0', (256L,), (256,))
('11-convolutional/moving_variance:0', (256L,), (256,))
('12-convolutional/kernel:0', (512L, 256L, 3L, 3L), (3, 3, 256, 512))
('12-convolutional/gamma:0', (512L,), (512,))
('12-convolutional/biases:0', (512L,), (512,))
('12-convolutional/moving_mean:0', (512L,), (512,))
('12-convolutional/moving_variance:0', (512L,), (512,))
('13-convolutional/kernel:0', (1024L, 512L, 3L, 3L), (3, 3, 512, 1024))
('13-convolutional/gamma:0', (1024L,), (1024,))
('13-convolutional/biases:0', (1024L,), (1024,))
('13-convolutional/moving_mean:0', (1024L,), (1024,))
('13-convolutional/moving_variance:0', (1024L,), (1024,))
('14-convolutional/kernel:0', (512L, 1024L, 1L, 1L), (1, 1, 1024, 512))
('14-convolutional/gamma:0', (512L,), (512,))
('14-convolutional/biases:0', (512L,), (512,))
('14-convolutional/moving_mean:0', (512L,), (512,))
('14-convolutional/moving_variance:0', (512L,), (512,))
('15-convolutional/kernel:0', (1024L, 512L, 3L, 3L), (3, 3, 512, 1024))
('15-convolutional/gamma:0', (1024L,), (1024,))
('15-convolutional/biases:0', (1024L,), (1024,))
('15-convolutional/moving_mean:0', (1024L,), (1024,))
('15-convolutional/moving_variance:0', (1024L,), (1024,))
('16-convolutional/kernel:0', (512L, 1024L, 1L, 1L), (1, 1, 1024, 512))
('16-convolutional/gamma:0', (512L,), (512,))
('16-convolutional/biases:0', (512L,), (512,))
('16-convolutional/moving_mean:0', (512L,), (512,))
('16-convolutional/moving_variance:0', (512L,), (512,))
('17-convolutional/kernel:0', (1024L, 512L, 3L, 3L), (3, 3, 512, 1024))
('17-convolutional/gamma:0', (1024L,), (1024,))
('17-convolutional/biases:0', (1024L,), (1024,))
('17-convolutional/moving_mean:0', (1024L,), (1024,))
('17-convolutional/moving_variance:0', (1024L,), (1024,))
load net succ...
epoch 0 start...
epoch: 0, step: 0, loss: 202.515, bbox_loss: 0.393, iou_loss: 201.162, cls_loss: 0.960 (2.10 s/batch)
epoch: 0, step: 10, loss: 11.230, bbox_loss: 0.704, iou_loss: 9.571, cls_loss: 0.955 (1.68 s/batch)
epoch: 0, step: 20, loss: 10.714, bbox_loss: 0.628, iou_loss: 9.156, cls_loss: 0.930 (1.69 s/batch)
epoch: 0, step: 30, loss: 11.044, bbox_loss: 1.107, iou_loss: 9.015, cls_loss: 0.922 (1.69 s/batch)
epoch: 0, step: 40, loss: 11.974, bbox_loss: 1.063, iou_loss: 9.978, cls_loss: 0.932 (1.69 s/batch)
epoch: 0, step: 50, loss: 12.925, bbox_loss: 2.811, iou_loss: 9.193, cls_loss: 0.921 (1.75 s/batch)
epoch: 0, step: 60, loss: 15.750, bbox_loss: 4.775, iou_loss: 10.045, cls_loss: 0.930 (1.60 s/batch)
epoch: 0, step: 70, loss: 16.090, bbox_loss: 7.294, iou_loss: 7.848, cls_loss: 0.948 (2.36 s/batch)
epoch: 0, step: 80, loss: 11.889, bbox_loss: 2.038, iou_loss: 8.919, cls_loss: 0.932 (2.43 s/batch)
epoch: 0, step: 90, loss: 14.579, bbox_loss: 4.674, iou_loss: 8.968, cls_loss: 0.936 (5.90 s/batch)
epoch: 0, step: 100, loss: 15.649, bbox_loss: 4.315, iou_loss: 10.388, cls_loss: 0.947 (9.22 s/batch)
epoch: 0, step: 110, loss: 37.384, bbox_loss: 26.982, iou_loss: 9.473, cls_loss: 0.928 (5.92 s/batch)
epoch: 0, step: 120, loss: 70.079, bbox_loss: 58.585, iou_loss: 10.556, cls_loss: 0.938 (2.78 s/batch)
epoch: 0, step: 130, loss: 15.072, bbox_loss: 5.575, iou_loss: 8.587, cls_loss: 0.910 (1.69 s/batch)
epoch: 0, step: 140, loss: 759.924, bbox_loss: 748.659, iou_loss: 10.314, cls_loss: 0.950 (1.70 s/batch)
epoch: 0, step: 150, loss: 822.332, bbox_loss: 810.596, iou_loss: 10.792, cls_loss: 0.945 (1.72 s/batch)
epoch: 0, step: 160, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.68 s/batch)
epoch: 0, step: 170, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.80 s/batch)
epoch: 0, step: 180, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.81 s/batch)
epoch: 0, step: 190, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 200, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 210, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 220, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 230, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (2.14 s/batch)
epoch: 0, step: 240, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (3.00 s/batch)
epoch: 0, step: 250, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 260, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 270, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 280, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.74 s/batch)
epoch: 0, step: 290, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 300, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 310, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.70 s/batch)
epoch: 0, step: 320, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 330, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 340, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 350, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 360, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 370, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.80 s/batch)
epoch: 0, step: 380, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 390, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 400, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 410, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 420, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 430, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 440, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 450, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (2.51 s/batch)
epoch: 0, step: 460, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.70 s/batch)
epoch: 0, step: 470, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 480, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 490, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 500, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 510, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 520, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 530, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 540, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 550, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 560, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 570, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 580, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 590, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 600, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 610, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 620, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.92 s/batch)
epoch: 0, step: 630, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.68 s/batch)
epoch: 0, step: 640, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 650, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 660, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 670, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 680, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 690, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 700, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 710, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 720, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 730, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.93 s/batch)
epoch: 0, step: 740, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 750, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 760, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.70 s/batch)
epoch: 0, step: 770, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (2.08 s/batch)
epoch: 0, step: 780, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 790, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 800, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (2.09 s/batch)
epoch: 0, step: 810, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 820, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 830, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 840, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 850, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.75 s/batch)
epoch: 0, step: 860, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 870, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 880, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 890, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.92 s/batch)
epoch: 0, step: 900, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 910, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 920, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 930, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 940, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 950, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 960, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 970, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.87 s/batch)
epoch: 0, step: 980, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 990, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 1000, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 1010, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 1020, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 1030, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 1040, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 1050, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 1060, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 1070, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
epoch: 0, step: 1080, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.76 s/batch)
epoch: 0, step: 1090, loss: nan, bbox_loss: nan, iou_loss: nan, cls_loss: nan (1.69 s/batch)
THCudaCheck FAIL file=/b/wheel/pytorch-src/torch/lib/THC/generic/THCTensorCopy.c line=18 error=4 : unspecified launch failure
Traceback (most recent call last):
  File "/home/hal9000/Sources/yolo2-pytorch/train.py", line 74, in <module>
    im_data = net_utils.np_to_variable(im, is_cuda=True, volatile=False).permute(0, 3, 1, 2)
  File "/home/hal9000/Sources/yolo2-pytorch/utils/network.py", line 102, in np_to_variable
    v = v.cuda()
  File "/usr/local/lib/python2.7/dist-packages/torch/autograd/variable.py", line 240, in cuda
    return CudaTransfer(device_id, async)(self)
  File "/usr/local/lib/python2.7/dist-packages/torch/autograd/_functions/tensor.py", line 160, in forward
    return i.cuda(async=self.async)
  File "/usr/local/lib/python2.7/dist-packages/torch/_utils.py", line 65, in _cuda
    return new_type(self.size()).copy_(self, async)
RuntimeError: cuda runtime error (4) : unspecified launch failure at /b/wheel/pytorch-src/torch/lib/THC/generic/THCTensorCopy.c:18

Process finished with exit code 1
AceCoooool commented 7 years ago

I meet the same situation, are you solve it ?~

AceCoooool commented 7 years ago

Although I rewrite the VOCDataset(but the main code is the same as longcw write). Maybe, the problem you meet is the same as me.

The problem is mainly by the cfg.train_batch_size. When I set it equal to 1, the problem is appeared. Using the default cfg.train_batch_size (which is 16), the problem is gone. ~ (I don't check the reason behind it, but when i see the self.bbox_loss become very large after several iteration. --- you can debug by yourself).