Closed CodeJjang closed 6 years ago
@CodeJjang Hi, I think this is not a problem of NMS, it is because of your training data. It seems that the loss at begging is exploding. So I guess there are some thing wrong with your data loader or something.
@jwyang Do you have any idea how can I decrease it? I barely changed the data loader code to fit my data.
My dataset is a copy of 'pascal_voc.py', I just changed the amount of classes (4 classes + background even though I dont have background in my trainset), I added several formats aside jpeg, and my bounding boxes are already zero-indexed so I omitted the -1 from the calculation.
I also set 'MAX_NUM_GT_BOXES' to 93, as my images can contain up to 93 (very small) objects inside them.
What can I do in order to train the model without exploding the loss?
Edit:
First thing I've done was decreasing batch size to 4 (didnt work), and then to 1.
Now I get the following error:
[session 1][epoch 1][iter 0] loss: 196415.5469, lr: 4.00e-03
fg/bg=(0/128), time cost: 1.562245
rpn_cls: 196415.5469, rpn_box: 0.0000, rcnn_cls: 0.0000, rcnn_box 0.0000
Traceback (most recent call last):
File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/trainval_net.py", line 326, in <module>
rois_label = fasterRCNN(im_data, im_info, gt_boxes, num_boxes)
File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__
result = self.forward(*input, **kwargs)
File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 68, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 78, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 67, in parallel_apply
raise output
File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 42, in _worker
output = module(*input, **kwargs)
File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__
result = self.forward(*input, **kwargs)
File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/faster_rcnn/faster_rcnn.py", line 54, in forward
roi_data = self.RCNN_proposal_target(rois, gt_boxes, num_boxes)
File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__
result = self.forward(*input, **kwargs)
File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/rpn/proposal_target_layer_cascade.py", line 52, in forward
rois_per_image, self._num_classes)
File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/rpn/proposal_target_layer_cascade.py", line 190, in _sample_rois_pytorch
raise ValueError("bg_num_rois = 0 and fg_num_rois = 0, this should not happen!")
ValueError: bg_num_rois = 0 and fg_num_rois = 0, this should not happen!
Process finished with exit code 1
I guess the RPN doesn't find anything because the objects are quite small, am I right? What parameters should I play with now? When I use pretrained resnet I get:
[session 1][epoch 1][iter 0] loss: 3.0736, lr: 4.00e-03
fg/bg=(9/119), time cost: 1.587983
rpn_cls: 0.7013, rpn_box: 0.7161, rcnn_cls: 1.6172, rcnn_box 0.0390
[session 1][epoch 1][iter 100] loss: nan, lr: 4.00e-03
fg/bg=(0/128), time cost: 87.171304
rpn_cls: 0.5350, rpn_box: 0.0000, rcnn_cls: nan, rcnn_box nan
[session 1][epoch 1][iter 200] loss: nan, lr: 4.00e-03
fg/bg=(0/128), time cost: 87.340177
rpn_cls: 0.4193, rpn_box: 0.0000, rcnn_cls: nan, rcnn_box nan
[session 1][epoch 1][iter 300] loss: nan, lr: 4.00e-03
fg/bg=(0/128), time cost: 86.971604
rpn_cls: 0.3733, rpn_box: 0.0000, rcnn_cls: nan, rcnn_box nan
@CodeJjang Hi, it seems still weird that the number of fg becomes zero when the training proceed. I will check that on my side.
@jwyang Thanks. Do you have any tips regarding how to train when my dataset consists of quite small objects? Perhaps it causes the problem?
@CodeJjang , yes, that might cause the problem, since our batch data loader crop images, so the small objects might be removed from the training data, and thus there is no fg in the image any more. To address this problem, one way is set the batch size to be 1, and then do not crop the image by setting False to this line:
@jwyang Thanks for the quick response.
I will definitely try that.
Do you think I should play with the anchor scale sizes as well?
Another thing I'd like to hear from you about:
I have several objects which are annotated with 1 single point (since they are small), so a bounding box for them would be a box containing the same point X 4.
Maybe this approach somehow fails the RPN? How do you think I should deal with it?
My average objects area by the way is around ~1400 px, with the minimum being 1 due to the above.
@CodeJjang if the bbox is just a single point, the bound box should be [x1, y1, x1+1, y1+1]. Also, i think it is extremely hard (or impossible) for faster r-cnn to detect this small box. After down sampling, the bounding box size will be much less than one pixel. You should remove these kind of small boxes during training. Since your boxes generally have size of 30 in general, it would be good to change the anchor size to smaller ones.
@jwyang Ok, I removed them, and then trained for 1 epoch with anchor scales of [1, 2, 4, 8, 16] (16 captures largest object in dataset, 1 captures smallest):
before filtering, there are 3174 images...
after filtering, there are 3174 images...
3174 roidb entries
Loading pretrained weights from data/pretrained_model/resnet101_caffe.pth
/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/rpn/rpn.py:68: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
rpn_cls_prob_reshape = F.softmax(rpn_cls_score_reshape)
/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/faster_rcnn/faster_rcnn.py:99: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
cls_prob = F.softmax(cls_score)
[session 1][epoch 1][iter 0] loss: 2.2807, lr: 4.00e-03
fg/bg=(1/127), time cost: 1.571556
rpn_cls: 0.7308, rpn_box: 0.0036, rcnn_cls: 1.5462, rcnn_box 0.0001
[session 1][epoch 1][iter 100] loss: 0.7577, lr: 4.00e-03
fg/bg=(14/114), time cost: 88.627255
rpn_cls: 0.0332, rpn_box: 0.0132, rcnn_cls: 0.3017, rcnn_box 0.2466
[session 1][epoch 1][iter 200] loss: 0.6620, lr: 4.00e-03
fg/bg=(32/96), time cost: 88.554790
rpn_cls: 0.0154, rpn_box: 0.0119, rcnn_cls: 0.2113, rcnn_box 0.4514
[session 1][epoch 1][iter 300] loss: 0.5930, lr: 4.00e-03
fg/bg=(11/117), time cost: 88.482760
rpn_cls: 0.0008, rpn_box: 0.0017, rcnn_cls: 0.0594, rcnn_box 0.1276
[session 1][epoch 1][iter 400] loss: 0.5873, lr: 4.00e-03
fg/bg=(23/105), time cost: 88.516982
rpn_cls: 0.0618, rpn_box: 0.0167, rcnn_cls: 0.1997, rcnn_box 0.2790
[session 1][epoch 1][iter 500] loss: 0.5350, lr: 4.00e-03
fg/bg=(25/103), time cost: 88.559573
rpn_cls: 0.0119, rpn_box: 0.0145, rcnn_cls: 0.2690, rcnn_box 0.3017
[session 1][epoch 1][iter 600] loss: 0.4720, lr: 4.00e-03
fg/bg=(32/96), time cost: 88.640138
rpn_cls: 0.0131, rpn_box: 0.0192, rcnn_cls: 0.2619, rcnn_box 0.3176
[session 1][epoch 1][iter 700] loss: 0.5162, lr: 4.00e-03
fg/bg=(32/96), time cost: 88.620682
rpn_cls: 0.0145, rpn_box: 0.0095, rcnn_cls: 0.1665, rcnn_box 0.4751
[session 1][epoch 1][iter 800] loss: 0.4596, lr: 4.00e-03
fg/bg=(27/101), time cost: 88.645700
rpn_cls: 0.0592, rpn_box: 0.0363, rcnn_cls: 0.5551, rcnn_box 0.3302
[session 1][epoch 1][iter 900] loss: 0.4404, lr: 4.00e-03
fg/bg=(27/101), time cost: 88.658445
rpn_cls: 0.0134, rpn_box: 0.0012, rcnn_cls: 0.1136, rcnn_box 0.2929
[session 1][epoch 1][iter 1000] loss: 0.3656, lr: 4.00e-03
fg/bg=(32/96), time cost: 88.901456
rpn_cls: 0.0009, rpn_box: 0.0051, rcnn_cls: 0.0381, rcnn_box 0.2727
[session 1][epoch 1][iter 1100] loss: 0.4342, lr: 4.00e-03
fg/bg=(12/116), time cost: 89.013582
rpn_cls: 0.0298, rpn_box: 0.0120, rcnn_cls: 0.1846, rcnn_box 0.1274
[session 1][epoch 1][iter 1200] loss: 0.4642, lr: 4.00e-03
fg/bg=(32/96), time cost: 88.988293
rpn_cls: 0.0008, rpn_box: 0.0032, rcnn_cls: 0.0911, rcnn_box 0.1734
[session 1][epoch 1][iter 1300] loss: 0.4205, lr: 4.00e-03
fg/bg=(32/96), time cost: 88.669224
rpn_cls: 0.0071, rpn_box: 0.0047, rcnn_cls: 0.2761, rcnn_box 0.2634
[session 1][epoch 1][iter 1400] loss: 0.3865, lr: 4.00e-03
fg/bg=(32/96), time cost: 88.622101
rpn_cls: 0.0110, rpn_box: 0.0042, rcnn_cls: 0.1309, rcnn_box 0.1807
[session 1][epoch 1][iter 1500] loss: 0.3914, lr: 4.00e-03
fg/bg=(32/96), time cost: 88.633439
rpn_cls: 0.0093, rpn_box: 0.0044, rcnn_cls: 0.1291, rcnn_box 0.3489
[session 1][epoch 1][iter 1600] loss: 0.3732, lr: 4.00e-03
fg/bg=(32/96), time cost: 88.571652
rpn_cls: 0.0082, rpn_box: 0.0247, rcnn_cls: 0.1348, rcnn_box 0.3052
[session 1][epoch 1][iter 1700] loss: 0.4248, lr: 4.00e-03
fg/bg=(32/96), time cost: 88.575121
rpn_cls: 0.0231, rpn_box: 0.0059, rcnn_cls: 0.1251, rcnn_box 0.2280
[session 1][epoch 1][iter 1800] loss: 0.3906, lr: 4.00e-03
fg/bg=(32/96), time cost: 88.579328
rpn_cls: 0.0034, rpn_box: 0.0023, rcnn_cls: 0.0637, rcnn_box 0.1694
[session 1][epoch 1][iter 1900] loss: 0.3232, lr: 4.00e-03
fg/bg=(24/104), time cost: 88.739019
rpn_cls: 0.0042, rpn_box: 0.0025, rcnn_cls: 0.1196, rcnn_box 0.2444
[session 1][epoch 1][iter 2000] loss: 0.3236, lr: 4.00e-03
fg/bg=(21/107), time cost: 88.675636
rpn_cls: 0.0026, rpn_box: 0.0054, rcnn_cls: 0.0224, rcnn_box 0.1865
[session 1][epoch 1][iter 2100] loss: 0.3131, lr: 4.00e-03
fg/bg=(16/112), time cost: 88.627246
rpn_cls: 0.0003, rpn_box: 0.0006, rcnn_cls: 0.0499, rcnn_box 0.0309
[session 1][epoch 1][iter 2200] loss: 0.3636, lr: 4.00e-03
fg/bg=(29/99), time cost: 88.636636
rpn_cls: 0.0085, rpn_box: 0.0049, rcnn_cls: 0.0782, rcnn_box 0.2448
[session 1][epoch 1][iter 2300] loss: 0.3120, lr: 4.00e-03
fg/bg=(31/97), time cost: 88.677933
rpn_cls: 0.0047, rpn_box: 0.0056, rcnn_cls: 0.1251, rcnn_box 0.2862
[session 1][epoch 1][iter 2400] loss: 0.2969, lr: 4.00e-03
fg/bg=(32/96), time cost: 88.668061
rpn_cls: 0.0013, rpn_box: 0.0030, rcnn_cls: 0.1097, rcnn_box 0.1873
[session 1][epoch 1][iter 2500] loss: 0.3172, lr: 4.00e-03
fg/bg=(32/96), time cost: 88.613365
rpn_cls: 0.0034, rpn_box: 0.0073, rcnn_cls: 0.0920, rcnn_box 0.2048
[session 1][epoch 1][iter 2600] loss: 0.3404, lr: 4.00e-03
fg/bg=(6/122), time cost: 88.667422
rpn_cls: 0.0064, rpn_box: 0.0014, rcnn_cls: 0.0722, rcnn_box 0.0892
[session 1][epoch 1][iter 2700] loss: 0.3020, lr: 4.00e-03
fg/bg=(32/96), time cost: 88.598949
rpn_cls: 0.0015, rpn_box: 0.0035, rcnn_cls: 0.0652, rcnn_box 0.2704
[session 1][epoch 1][iter 2800] loss: 0.2890, lr: 4.00e-03
fg/bg=(32/96), time cost: 88.620969
rpn_cls: 0.0067, rpn_box: 0.0047, rcnn_cls: 0.1282, rcnn_box 0.2242
[session 1][epoch 1][iter 2900] loss: 0.2900, lr: 4.00e-03
fg/bg=(9/119), time cost: 88.608526
rpn_cls: 0.0008, rpn_box: 0.0011, rcnn_cls: 0.1151, rcnn_box 0.0593
[session 1][epoch 1][iter 3000] loss: 0.2832, lr: 4.00e-03
fg/bg=(13/115), time cost: 88.678512
rpn_cls: 0.0026, rpn_box: 0.0002, rcnn_cls: 0.0713, rcnn_box 0.0535
[session 1][epoch 1][iter 3100] loss: 0.2791, lr: 4.00e-03
fg/bg=(32/96), time cost: 89.003265
rpn_cls: 0.0024, rpn_box: 0.0040, rcnn_cls: 0.1118, rcnn_box 0.4206
save model: saved_models/res101/my_dataset/faster_rcnn_1_1_3173.pth
65.92297983169556
Process finished with exit code 0
Definitely looks better, however the fgs are still relatively small in some iterations (1, 9, etc...).
This is epoch 9:
[session 1][epoch 9][iter 0] loss: 0.1625, lr: 4.00e-04
fg/bg=(32/96), time cost: 1.539441
rpn_cls: 0.0002, rpn_box: 0.0016, rcnn_cls: 0.0499, rcnn_box 0.1108
Any ideas how to improve from here? :)
@jwyang
[session 1][epoch 14][iter 500] loss: 0.0917, lr: 4.00e-04
fg/bg=(96/416), time cost: 168.633036
rpn_cls: 0.0002, rpn_box: 0.0006, rcnn_cls: 0.0357, rcnn_box 0.0256
[session 1][epoch 14][iter 600] loss: 0.0865, lr: 4.00e-04
fg/bg=(90/422), time cost: 168.349854
rpn_cls: 0.0015, rpn_box: 0.0009, rcnn_cls: 0.0078, rcnn_box 0.0114
[session 1][epoch 14][iter 700] loss: 0.0794, lr: 4.00e-04
fg/bg=(118/394), time cost: 168.221013
rpn_cls: 0.0028, rpn_box: 0.0021, rcnn_cls: 0.0474, rcnn_box 0.0575
However, the network seems to learn small vehicle better than large vehicle, and also, it cannot learn the solar panel class (which is quite small) for some reason:
AP for large vehicle = 0.7531
AP for small vehicle = 0.8962
/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/datasets/voc_eval.py:204: RuntimeWarning: invalid value encountered in true_divide
rec = tp / float(npos)
/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/datasets/voc_eval.py:45: RuntimeWarning: invalid value encountered in greater_equal
if np.sum(rec >= t) == 0:
AP for solar panel = 0.0000
Mean AP = 0.5498
~~~~~~~~
Results:
0.753
0.896
0.000
0.550
~~~~~~~~
Any idea how to improve? Or why it fails on the solar panel? It throws some errors in the AP calculation though.
Edit: Just found out why it learns small vehicles better, that's because they appear way more than large vehicles, and apparently I mistakenly filtered solar panels out of my train set, thats why its 0.
@CodeJjang , great! then it seems that your training is fine now.
@jwyang Yep, indeed. Thanks!
However, I have tackled another small challenge.
I'm satisfied with the results, so I added also higher resolution images to the train set (so now it consists not only from 600x900 images but also 2500x4000 images which consists of the solar panel
class).
[session 1][epoch 1][iter 0] loss: 2.2565, lr: 4.00e-03
fg/bg=(21/107), time cost: 1.585239
rpn_cls: 0.6889, rpn_box: 0.1035, rcnn_cls: 1.3022, rcnn_box 0.1619
[session 1][epoch 1][iter 100] loss: 0.8573, lr: 4.00e-03
fg/bg=(13/115), time cost: 88.400754
rpn_cls: 0.0554, rpn_box: 0.0037, rcnn_cls: 0.2631, rcnn_box 0.2668
[session 1][epoch 1][iter 200] loss: 0.7695, lr: 4.00e-03
fg/bg=(25/103), time cost: 88.156886
rpn_cls: 0.0085, rpn_box: 0.0058, rcnn_cls: 0.3238, rcnn_box 0.3907
[session 1][epoch 1][iter 300] loss: 0.7010, lr: 4.00e-03
fg/bg=(32/96), time cost: 88.531185
rpn_cls: 0.0446, rpn_box: 0.0237, rcnn_cls: 0.3977, rcnn_box 0.4254
Traceback (most recent call last):
File "trainval_net.py", line 316, in <module>
data = next(data_iter)
File "/home/cyb/user/installations/anaconda3/envs/my_project/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 210, in __next__
return self._process_next_batch(batch)
File "/home/cyb/user/installations/anaconda3/envs/my_project/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 230, in _process_next_batch
raise batch.exc_type(batch.exc_msg)
ValueError: Traceback (most recent call last):
File "/home/cyb/user/installations/anaconda3/envs/my_project/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 42, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File "/home/cyb/user/installations/anaconda3/envs/my_project/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 42, in <listcomp>
samples = collate_fn([dataset[i] for i in batch_indices])
File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/roi_data_layer/roibatchLoader.py", line 67, in __getitem__
blobs = get_minibatch(minibatch_db, self._num_classes)
File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/roi_data_layer/minibatch.py", line 30, in get_minibatch
im_blob, im_scales = _get_image_blob(roidb, random_scale_inds)
File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/roi_data_layer/minibatch.py", line 79, in _get_image_blob
cfg.TRAIN.MAX_SIZE)
File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/utils/blob.py", line 39, in prep_im_for_blob
im -= pixel_means
ValueError: operands could not be broadcast together with shapes (2666,4010,4) (1,1,3) (2666,4010,4)
Remember, by a recommendation from you, I already omitted the class which had 1 point instead of bounding box, I set batch_size to 1 (can you explain why does it matter?) and also set the cropping to False
.
What is the problem now? Why does it fail with the dimensions? It happens also if I undo the crop=False
thing you recommended (I thought it was related).
@jwyang Updated the above comment, I had NaN before but then applied your fix from last days and now I have only dimensional problems.
@jwyang Fixed the above error by loading my high resolution images (which were tiff apparently) as 'RGB', but I've returned to the NaN
error despite applying all the above and also your last commits:
[session 1][epoch 1][iter 0] loss: 2.2763, lr: 4.00e-03
fg/bg=(9/119), time cost: 1.596377
rpn_cls: 0.6654, rpn_box: 0.0273, rcnn_cls: 1.4431, rcnn_box 0.1405
[session 1][epoch 1][iter 100] loss: 1.0325, lr: 4.00e-03
fg/bg=(5/123), time cost: 88.442405
rpn_cls: 0.1533, rpn_box: 0.0105, rcnn_cls: 0.1821, rcnn_box 0.0033
[session 1][epoch 1][iter 200] loss: 0.6312, lr: 4.00e-03
fg/bg=(5/123), time cost: 88.410855
rpn_cls: 0.1094, rpn_box: 0.0155, rcnn_cls: 0.1852, rcnn_box 0.0175
[session 1][epoch 1][iter 300] loss: 0.6391, lr: 4.00e-03
fg/bg=(4/124), time cost: 88.356845
rpn_cls: 0.0765, rpn_box: 0.0027, rcnn_cls: 0.1644, rcnn_box 0.0131
[session 1][epoch 1][iter 400] loss: 0.7291, lr: 4.00e-03
fg/bg=(26/102), time cost: 88.262887
rpn_cls: 0.0375, rpn_box: 0.0052, rcnn_cls: 0.5507, rcnn_box 0.5645
[session 1][epoch 1][iter 500] loss: 0.8458, lr: 4.00e-03
fg/bg=(10/118), time cost: 88.494323
rpn_cls: 0.0563, rpn_box: 0.0030, rcnn_cls: 0.2979, rcnn_box 0.1833
[session 1][epoch 1][iter 600] loss: 0.8930, lr: 4.00e-03
fg/bg=(25/103), time cost: 88.526348
rpn_cls: 0.0234, rpn_box: 0.0164, rcnn_cls: 0.3348, rcnn_box 0.3810
[session 1][epoch 1][iter 700] loss: 0.7675, lr: 4.00e-03
fg/bg=(32/96), time cost: 88.312344
rpn_cls: 0.0083, rpn_box: 0.0079, rcnn_cls: 0.3212, rcnn_box 0.5632
[session 1][epoch 1][iter 800] loss: 0.8177, lr: 4.00e-03
fg/bg=(9/119), time cost: 88.433375
rpn_cls: 0.0040, rpn_box: 0.0008, rcnn_cls: 0.0611, rcnn_box 0.1617
[session 1][epoch 1][iter 900] loss: 0.7154, lr: 4.00e-03
fg/bg=(8/120), time cost: 88.386658
rpn_cls: 0.0037, rpn_box: 0.0019, rcnn_cls: 0.0853, rcnn_box 0.1004
[session 1][epoch 1][iter 1000] loss: nan, lr: 4.00e-03
fg/bg=(128/0), time cost: 88.080895
rpn_cls: nan, rpn_box: nan, rcnn_cls: nan, rcnn_box nan
[session 1][epoch 1][iter 1100] loss: nan, lr: 4.00e-03
fg/bg=(128/0), time cost: 87.066358
rpn_cls: nan, rpn_box: nan, rcnn_cls: nan, rcnn_box nan
The NaN was fixed when I set the MAX_NUM_GT_BOXES
to the correct value.
@CodeJjang Hi, may i ask about your solution to "RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /opt/conda/conda-bld/pytorch_1513368888240/work/torch/lib/THC/generic/THCStorage.c:36"
The NaN was fixed when I set the
MAX_NUM_GT_BOXES
to the correct value.@jwyang Yep, indeed. Thanks! However, I have tackled another small challenge. I'm satisfied with the results, so I added also higher resolution images to the train set (so now it consists not only from 600x900 images but also 2500x4000 images which consists of the
solar panel
class).[session 1][epoch 1][iter 0] loss: 2.2565, lr: 4.00e-03 fg/bg=(21/107), time cost: 1.585239 rpn_cls: 0.6889, rpn_box: 0.1035, rcnn_cls: 1.3022, rcnn_box 0.1619 [session 1][epoch 1][iter 100] loss: 0.8573, lr: 4.00e-03 fg/bg=(13/115), time cost: 88.400754 rpn_cls: 0.0554, rpn_box: 0.0037, rcnn_cls: 0.2631, rcnn_box 0.2668 [session 1][epoch 1][iter 200] loss: 0.7695, lr: 4.00e-03 fg/bg=(25/103), time cost: 88.156886 rpn_cls: 0.0085, rpn_box: 0.0058, rcnn_cls: 0.3238, rcnn_box 0.3907 [session 1][epoch 1][iter 300] loss: 0.7010, lr: 4.00e-03 fg/bg=(32/96), time cost: 88.531185 rpn_cls: 0.0446, rpn_box: 0.0237, rcnn_cls: 0.3977, rcnn_box 0.4254 Traceback (most recent call last): File "trainval_net.py", line 316, in <module> data = next(data_iter) File "/home/cyb/user/installations/anaconda3/envs/my_project/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 210, in __next__ return self._process_next_batch(batch) File "/home/cyb/user/installations/anaconda3/envs/my_project/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 230, in _process_next_batch raise batch.exc_type(batch.exc_msg) ValueError: Traceback (most recent call last): File "/home/cyb/user/installations/anaconda3/envs/my_project/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 42, in _worker_loop samples = collate_fn([dataset[i] for i in batch_indices]) File "/home/cyb/user/installations/anaconda3/envs/my_project/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 42, in <listcomp> samples = collate_fn([dataset[i] for i in batch_indices]) File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/roi_data_layer/roibatchLoader.py", line 67, in __getitem__ blobs = get_minibatch(minibatch_db, self._num_classes) File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/roi_data_layer/minibatch.py", line 30, in get_minibatch im_blob, im_scales = _get_image_blob(roidb, random_scale_inds) File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/roi_data_layer/minibatch.py", line 79, in _get_image_blob cfg.TRAIN.MAX_SIZE) File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/utils/blob.py", line 39, in prep_im_for_blob im -= pixel_means ValueError: operands could not be broadcast together with shapes (2666,4010,4) (1,1,3) (2666,4010,4)
Remember, by a recommendation from you, I already omitted the class which had 1 point instead of bounding box, I set batch_size to 1 (can you explain why does it matter?) and also set the cropping to
False
. What is the problem now? Why does it fail with the dimensions? It happens also if I undo thecrop=False
thing you recommended (I thought it was related).
Hello, how did you solve the problem:ValueError: operands could not be broadcast together with shapes?
@CodeJjang I also detect very small targets, how do you design all the parameters about size
@jwyang i also have this problem,what shold i do ?
/home/xwj/anaconda3/envs/torch1.0/bin/python /home/xwj/pycharm-2018.3.6/helpers/pydev/pydevd.py --multiproc --qt-support=auto --client 127.0.0.1 --port 44855 --file /media/xwj/Programm/Python/faster-rcnn.pytorch/train_copy.py pydev debugger: process 54969 is connecting
Connected to pydev debugger (build 183.6156.13)
Called with args:
Namespace(batch_size=16, checkepoch=1, checkpoint=0, checkpoint_interval=10000, checksession=1, class_agnostic=False, cuda=True, dataset='pascal_voc', disp_interval=100, large_scale=False, lr=0.001, lr_decay_gamma=0.1, lr_decay_step=5, mGPUs=False, max_epochs=20, net='vgg16', num_workers=0, optimizer='sgd', resume=False, save_dir='models', session=1, start_epoch=1, use_tfboard=True)
/media/xwj/Programm/Python/faster-rcnn.pytorch/lib/model/utils/config.py:374: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
yaml_cfg = edict(yaml.load(f))
Using config:
{'ANCHOR_RATIOS': [0.5, 1, 2],
'ANCHOR_SCALES': [8, 16, 32],
'CROP_RESIZE_WITH_MAX_POOL': False,
'CUDA': False,
'DATA_DIR': '/media/xwj/Programm/Python/faster-rcnn.pytorch/data',
'DEDUP_BOXES': 0.0625,
'EPS': 1e-14,
'EXP_DIR': 'vgg16',
'FEAT_STRIDE': [16],
'GPU_ID': 0,
'MATLAB': 'matlab',
'MAX_NUM_GT_BOXES': 20,
'MOBILENET': {'DEPTH_MULTIPLIER': 1.0,
'FIXED_LAYERS': 5,
'REGU_DEPTH': False,
'WEIGHT_DECAY': 4e-05},
'PIXEL_MEANS': array([[[102.9801, 115.9465, 122.7717]]]),
'POOLING_MODE': 'align',
'POOLING_SIZE': 7,
'RESNET': {'FIXED_BLOCKS': 1, 'MAX_POOL': False},
'RNG_SEED': 3,
'ROOT_DIR': '/media/xwj/Programm/Python/faster-rcnn.pytorch',
'TEST': {'BBOX_REG': True,
'HAS_RPN': True,
'MAX_SIZE': 1000,
'MODE': 'nms',
'NMS': 0.3,
'PROPOSAL_METHOD': 'gt',
'RPN_MIN_SIZE': 16,
'RPN_NMS_THRESH': 0.7,
'RPN_POST_NMS_TOP_N': 300,
'RPN_PRE_NMS_TOP_N': 6000,
'RPN_TOP_N': 5000,
'SCALES': [600],
'SVM': False},
'TRAIN': {'ASPECT_GROUPING': False,
'BATCH_SIZE': 256,
'BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
'BBOX_NORMALIZE_MEANS': [0.0, 0.0, 0.0, 0.0],
'BBOX_NORMALIZE_STDS': [0.1, 0.1, 0.2, 0.2],
'BBOX_NORMALIZE_TARGETS': True,
'BBOX_NORMALIZE_TARGETS_PRECOMPUTED': True,
'BBOX_REG': True,
'BBOX_THRESH': 0.5,
'BG_THRESH_HI': 0.5,
'BG_THRESH_LO': 0.0,
'BIAS_DECAY': False,
'BN_TRAIN': False,
'DISPLAY': 10,
'DOUBLE_BIAS': True,
'FG_FRACTION': 0.25,
'FG_THRESH': 0.5,
'GAMMA': 0.1,
'HAS_RPN': True,
'IMS_PER_BATCH': 1,
'LEARNING_RATE': 0.01,
'MAX_SIZE': 1000,
'MOMENTUM': 0.9,
'PROPOSAL_METHOD': 'gt',
'RPN_BATCHSIZE': 256,
'RPN_BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
'RPN_CLOBBER_POSITIVES': False,
'RPN_FG_FRACTION': 0.5,
'RPN_MIN_SIZE': 8,
'RPN_NEGATIVE_OVERLAP': 0.3,
'RPN_NMS_THRESH': 0.7,
'RPN_POSITIVE_OVERLAP': 0.7,
'RPN_POSITIVE_WEIGHT': -1.0,
'RPN_POST_NMS_TOP_N': 2000,
'RPN_PRE_NMS_TOP_N': 12000,
'SCALES': [600],
'SNAPSHOT_ITERS': 5000,
'SNAPSHOT_KEPT': 3,
'SNAPSHOT_PREFIX': 'res101_faster_rcnn',
'STEPSIZE': [30000],
'SUMMARY_INTERVAL': 180,
'TRIM_HEIGHT': 600,
'TRIM_WIDTH': 600,
'TRUNCATED': False,
'USE_ALL_GT': True,
'USE_FLIPPED': True,
'USE_GT': False,
'WEIGHT_DECAY': 0.0005},
'USE_GPU_NMS': True}
Loaded dataset voc_2007_trainval
Set proposal method: gt
Appending horizontally-flipped training examples...
voc_2007_trainval gt roidb loaded from /media/xwj/Programm/Python/faster-rcnn.pytorch/data/cache/voc_2007_trainval_gt_roidb.pkl
done
Preparing training data...
Image sizes loaded from /media/xwj/Programm/Python/faster-rcnn.pytorch/data/cache/voc_2007_trainval_sizes.pkl
done
before filtering, there are 18406 images...
after filtering, there are 18406 images...
18406 roidb entries
vgg16(
(RCNN_rpn): _RPN(
(RPN_Conv): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(RPN_cls_score): Conv2d(512, 18, kernel_size=(1, 1), stride=(1, 1))
(RPN_bbox_pred): Conv2d(512, 36, kernel_size=(1, 1), stride=(1, 1))
(RPN_proposal): _ProposalLayer()
(RPN_anchor_target): _AnchorTargetLayer()
)
(RCNN_proposal_target): _ProposalTargetLayer()
(RCNN_roi_pool): PrRoIPool2D()
(RCNN_roi_align): RoIAlignAvg()
(RCNN_roi_crop): _RoICrop()
(RCNN_base): Sequential(
(0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): ReLU(inplace)
(2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(3): ReLU(inplace)
(4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(6): ReLU(inplace)
(7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(8): ReLU(inplace)
(9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(11): ReLU(inplace)
(12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(13): ReLU(inplace)
(14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(15): ReLU(inplace)
(16): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(17): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(18): ReLU(inplace)
(19): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(20): ReLU(inplace)
(21): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(22): ReLU(inplace)
(23): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(24): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(25): ReLU(inplace)
(26): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(27): ReLU(inplace)
(28): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(29): ReLU(inplace)
)
(RCNN_top): Sequential(
(0): Linear(in_features=25088, out_features=4096, bias=True)
(1): ReLU(inplace)
(2): Dropout(p=0.5)
(3): Linear(in_features=4096, out_features=4096, bias=True)
(4): ReLU(inplace)
(5): Dropout(p=0.5)
)
(RCNN_cls_score): Linear(in_features=4096, out_features=2, bias=True)
(RCNN_bbox_pred): Linear(in_features=4096, out_features=8, bias=True)
)
[session 1][epoch 1][iter 0/1150] loss: 5.0934, lr: 1.00e-03
fg/bg=(98/3998), time cost: 2.378769
rpn_cls: 0.5810, rpn_box: 0.5940, rcnn_cls: 3.8850, rcnn_box 0.0333
[session 1][epoch 1][iter 100/1150] loss: 1.3188, lr: 1.00e-03
fg/bg=(719/3377), time cost: 237.837066
rpn_cls: 0.2457, rpn_box: 0.0229, rcnn_cls: 0.3879, rcnn_box 0.3908
[session 1][epoch 1][iter 200/1150] loss: 0.9812, lr: 1.00e-03
fg/bg=(673/3423), time cost: 235.772547
rpn_cls: 0.2274, rpn_box: 0.0174, rcnn_cls: 0.3308, rcnn_box 0.3237
[session 1][epoch 1][iter 300/1150] loss: 0.9300, lr: 1.00e-03
fg/bg=(917/3179), time cost: 237.889025
rpn_cls: 0.2124, rpn_box: 0.0241, rcnn_cls: 0.3444, rcnn_box 0.4794
[session 1][epoch 1][iter 400/1150] loss: 0.9346, lr: 1.00e-03
fg/bg=(156/3940), time cost: 240.610902
rpn_cls: 0.2184, rpn_box: 0.0153, rcnn_cls: 0.2388, rcnn_box 0.0898
[session 1][epoch 1][iter 500/1150] loss: 0.9442, lr: 1.00e-03
fg/bg=(724/3372), time cost: 243.984946
rpn_cls: 0.2009, rpn_box: 0.0155, rcnn_cls: 0.3383, rcnn_box 0.2944
[session 1][epoch 1][iter 600/1150] loss: 0.9229, lr: 1.00e-03
fg/bg=(920/3176), time cost: 247.035084
rpn_cls: 0.3912, rpn_box: 0.0443, rcnn_cls: 0.3589, rcnn_box 0.4273
[session 1][epoch 1][iter 700/1150] loss: 0.8946, lr: 1.00e-03
fg/bg=(967/3129), time cost: 247.947732
rpn_cls: 0.3382, rpn_box: 0.0598, rcnn_cls: 0.3729, rcnn_box 0.5075
[session 1][epoch 1][iter 800/1150] loss: 0.9000, lr: 1.00e-03
fg/bg=(1021/3075), time cost: 244.124488
rpn_cls: 0.3048, rpn_box: 0.0643, rcnn_cls: 0.3559, rcnn_box 0.5121
[session 1][epoch 1][iter 900/1150] loss: 0.8982, lr: 1.00e-03
fg/bg=(279/3817), time cost: 242.603654
rpn_cls: 0.2306, rpn_box: 0.0129, rcnn_cls: 0.1911, rcnn_box 0.1284
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1532502421238/work/aten/src/THC/generated/../THCReduceAll.cuh line=317 error=77 : an illegal memory access was encountered
Traceback (most recent call last):
File "/home/xwj/pycharm-2018.3.6/helpers/pydev/pydevd.py", line 1741, in
The NaN was fixed when I set the
MAX_NUM_GT_BOXES
to the correct value.@jwyang Yep, indeed. Thanks! However, I have tackled another small challenge. I'm satisfied with the results, so I added also higher resolution images to the train set (so now it consists not only from 600x900 images but also 2500x4000 images which consists of the
solar panel
class).[session 1][epoch 1][iter 0] loss: 2.2565, lr: 4.00e-03 fg/bg=(21/107), time cost: 1.585239 rpn_cls: 0.6889, rpn_box: 0.1035, rcnn_cls: 1.3022, rcnn_box 0.1619 [session 1][epoch 1][iter 100] loss: 0.8573, lr: 4.00e-03 fg/bg=(13/115), time cost: 88.400754 rpn_cls: 0.0554, rpn_box: 0.0037, rcnn_cls: 0.2631, rcnn_box 0.2668 [session 1][epoch 1][iter 200] loss: 0.7695, lr: 4.00e-03 fg/bg=(25/103), time cost: 88.156886 rpn_cls: 0.0085, rpn_box: 0.0058, rcnn_cls: 0.3238, rcnn_box 0.3907 [session 1][epoch 1][iter 300] loss: 0.7010, lr: 4.00e-03 fg/bg=(32/96), time cost: 88.531185 rpn_cls: 0.0446, rpn_box: 0.0237, rcnn_cls: 0.3977, rcnn_box 0.4254 Traceback (most recent call last): File "trainval_net.py", line 316, in <module> data = next(data_iter) File "/home/cyb/user/installations/anaconda3/envs/my_project/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 210, in __next__ return self._process_next_batch(batch) File "/home/cyb/user/installations/anaconda3/envs/my_project/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 230, in _process_next_batch raise batch.exc_type(batch.exc_msg) ValueError: Traceback (most recent call last): File "/home/cyb/user/installations/anaconda3/envs/my_project/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 42, in _worker_loop samples = collate_fn([dataset[i] for i in batch_indices]) File "/home/cyb/user/installations/anaconda3/envs/my_project/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 42, in <listcomp> samples = collate_fn([dataset[i] for i in batch_indices]) File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/roi_data_layer/roibatchLoader.py", line 67, in __getitem__ blobs = get_minibatch(minibatch_db, self._num_classes) File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/roi_data_layer/minibatch.py", line 30, in get_minibatch im_blob, im_scales = _get_image_blob(roidb, random_scale_inds) File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/roi_data_layer/minibatch.py", line 79, in _get_image_blob cfg.TRAIN.MAX_SIZE) File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/utils/blob.py", line 39, in prep_im_for_blob im -= pixel_means ValueError: operands could not be broadcast together with shapes (2666,4010,4) (1,1,3) (2666,4010,4)
Remember, by a recommendation from you, I already omitted the class which had 1 point instead of bounding box, I set batch_size to 1 (can you explain why does it matter?) and also set the cropping to
False
. What is the problem now? Why does it fail with the dimensions? It happens also if I undo thecrop=False
thing you recommended (I thought it was related).Hello, how did you solve the problem:ValueError: operands could not be broadcast together with shapes?
@xwjBupt I have the same problem regarding broadcast ValueError: operands could not be broadcast together with shapes (2666,4010,4) (1,1,3) (2666,4010,4)
that is because your image is not RGB, probably CMYK, which has 4 channels (hence the value 4) instead of 3, and the operation cannot be completed due to different matrix shapes. All you need to do is to filter out all the images that are not RGB from your dataset or convert them to RGB. If you are using PIL you can check it with im.mode
Trying to train this model on my own dataset.
I converted it to pascal voc format, assured max resolution is of 1000 (most images are 600x900), adjusted some fine details, but I get the following error while training:
I have two Titan K40 cards, however it's an illegal access and not out of memory error, so I wonder where does it come from.