jwyang / faster-rcnn.pytorch

A faster pytorch implementation of faster r-cnn
MIT License
7.69k stars 2.33k forks source link

nms_gpu throws runtime error: illegal memory access #99

Closed CodeJjang closed 6 years ago

CodeJjang commented 6 years ago

Trying to train this model on my own dataset.
I converted it to pascal voc format, assured max resolution is of 1000 (most images are 600x900), adjusted some fine details, but I get the following error while training:

Called with args:
Namespace(batch_size=8, checkepoch=1, checkpoint=0, checkpoint_interval=10000, checksession=1, class_agnostic=False, cuda=True, dataset='my_custom_ds', disp_interval=100, large_scale=False, lr=0.004, lr_decay_gamma=0.1, lr_decay_step=8, mGPUs=True, max_epochs=2, net='res101', num_workers=2, optimizer='sgd', resume=False, save_dir='saved_models', session=1, start_epoch=1, use_tfboard=False)
Using config:
{'ANCHOR_RATIOS': [0.5, 1, 2],
 'ANCHOR_SCALES': [8, 16, 32],
 'CROP_RESIZE_WITH_MAX_POOL': False,
 'CUDA': False,
 'DATA_DIR': '/home/cyb/user/pycharm/src/faster-rcnn.pytorch/data',
 'DEDUP_BOXES': 0.0625,
 'EPS': 1e-14,
 'EXP_DIR': 'res101',
 'FEAT_STRIDE': [16],
 'GPU_ID': 0,
 'MATLAB': 'matlab',
 'MAX_NUM_GT_BOXES': 93,
 'MOBILENET': {'DEPTH_MULTIPLIER': 1.0,
               'FIXED_LAYERS': 5,
               'REGU_DEPTH': False,
               'WEIGHT_DECAY': 4e-05},
 'PIXEL_MEANS': array([[[ 102.9801,  115.9465,  122.7717]]]),
 'POOLING_MODE': 'align',
 'POOLING_SIZE': 7,
 'RESNET': {'FIXED_BLOCKS': 1, 'MAX_POOL': False},
 'RNG_SEED': 3,
 'ROOT_DIR': '/home/cyb/user/pycharm/src/faster-rcnn.pytorch',
 'TEST': {'BBOX_REG': True,
          'HAS_RPN': True,
          'MAX_SIZE': 1000,
          'MODE': 'nms',
          'NMS': 0.3,
          'PROPOSAL_METHOD': 'gt',
          'RPN_MIN_SIZE': 16,
          'RPN_NMS_THRESH': 0.7,
          'RPN_POST_NMS_TOP_N': 300,
          'RPN_PRE_NMS_TOP_N': 6000,
          'RPN_TOP_N': 5000,
          'SCALES': [600],
          'SVM': False},
 'TRAIN': {'ASPECT_GROUPING': False,
           'BATCH_SIZE': 128,
           'BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
           'BBOX_NORMALIZE_MEANS': [0.0, 0.0, 0.0, 0.0],
           'BBOX_NORMALIZE_STDS': [0.1, 0.1, 0.2, 0.2],
           'BBOX_NORMALIZE_TARGETS': True,
           'BBOX_NORMALIZE_TARGETS_PRECOMPUTED': True,
           'BBOX_REG': True,
           'BBOX_THRESH': 0.5,
           'BG_THRESH_HI': 0.5,
           'BG_THRESH_LO': 0.0,
           'BIAS_DECAY': False,
           'BN_TRAIN': False,
           'DISPLAY': 20,
           'DOUBLE_BIAS': False,
           'FG_FRACTION': 0.25,
           'FG_THRESH': 0.5,
           'GAMMA': 0.1,
           'HAS_RPN': True,
           'IMS_PER_BATCH': 1,
           'LEARNING_RATE': 0.001,
           'MAX_SIZE': 1000,
           'MOMENTUM': 0.9,
           'PROPOSAL_METHOD': 'gt',
           'RPN_BATCHSIZE': 256,
           'RPN_BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
           'RPN_CLOBBER_POSITIVES': False,
           'RPN_FG_FRACTION': 0.5,
           'RPN_MIN_SIZE': 8,
           'RPN_NEGATIVE_OVERLAP': 0.3,
           'RPN_NMS_THRESH': 0.7,
           'RPN_POSITIVE_OVERLAP': 0.7,
           'RPN_POSITIVE_WEIGHT': -1.0,
           'RPN_POST_NMS_TOP_N': 2000,
           'RPN_PRE_NMS_TOP_N': 12000,
           'SCALES': [600],
           'SNAPSHOT_ITERS': 5000,
           'SNAPSHOT_KEPT': 3,
           'SNAPSHOT_PREFIX': 'res101_faster_rcnn',
           'STEPSIZE': [30000],
           'SUMMARY_INTERVAL': 180,
           'TRIM_HEIGHT': 600,
           'TRIM_WIDTH': 600,
           'TRUNCATED': False,
           'USE_ALL_GT': True,
           'USE_FLIPPED': True,
           'USE_GT': False,
           'WEIGHT_DECAY': 0.0001},
 'USE_GPU_NMS': True}
Loaded dataset `voc_2007_trainval` for training
Set proposal method: gt
Appending horizontally-flipped training examples...
wrote gt roidb to /home/cyb/user/pycharm/src/faster-rcnn.pytorch/data/cache/voc_2007_trainval_gt_roidb.pkl
done
Preparing training data...
done
before filtering, there are 4224 images...
after filtering, there are 4224 images...
4224 roidb entries
/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/rpn/rpn.py:68: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
  rpn_cls_prob_reshape = F.softmax(rpn_cls_score_reshape)
/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/faster_rcnn/faster_rcnn.py:98: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
  cls_prob = F.softmax(cls_score)
[session 1][epoch  1][iter    0] loss: 233749.3594, lr: 4.00e-03
            fg/bg=(24/1000), time cost: 6.419112
            rpn_cls: 179158.4219, rpn_box: 41295.5859, rcnn_cls: 9535.8477, rcnn_box 3759.5171
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1513368888240/work/torch/lib/THC/generic/THCTensorMath.cu line=267 error=77 : an illegal memory access was encountered
an illegal memory access was encountered
CUDA Error: an illegal memory access was encountered, at line 147
CUDA Error: an illegal memory access was encountered, at line 154
an illegal memory access was encountered
an illegal memory access was encountered
Traceback (most recent call last):
  File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/trainval_net.py", line 326, in <module>
    rois_label = fasterRCNN(im_data, im_info, gt_boxes, num_boxes)
  File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 68, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 78, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 67, in parallel_apply
    raise output
  File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 42, in _worker
    output = module(*input, **kwargs)
  File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/faster_rcnn/faster_rcnn.py", line 50, in forward
    rois, rpn_loss_cls, rpn_loss_bbox = self.RCNN_rpn(base_feat, im_info, gt_boxes, num_boxes)
  File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/rpn/rpn.py", line 78, in forward
    im_info, cfg_key))
  File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/rpn/proposal_layer.py", line 148, in forward
    keep_idx_i = nms(torch.cat((proposals_single, scores_single), 1), nms_thresh)
  File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/nms/nms_wrapper.py", line 18, in nms
    return nms_gpu(dets, thresh)
  File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/nms/nms_gpu.py", line 11, in nms_gpu
    keep = keep[:num_out[0]]
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /opt/conda/conda-bld/pytorch_1513368888240/work/torch/lib/THC/generic/THCStorage.c:36

Process finished with exit code 1

I have two Titan K40 cards, however it's an illegal access and not out of memory error, so I wonder where does it come from.

jwyang commented 6 years ago

@CodeJjang Hi, I think this is not a problem of NMS, it is because of your training data. It seems that the loss at begging is exploding. So I guess there are some thing wrong with your data loader or something.

CodeJjang commented 6 years ago

@jwyang Do you have any idea how can I decrease it? I barely changed the data loader code to fit my data.
My dataset is a copy of 'pascal_voc.py', I just changed the amount of classes (4 classes + background even though I dont have background in my trainset), I added several formats aside jpeg, and my bounding boxes are already zero-indexed so I omitted the -1 from the calculation.
I also set 'MAX_NUM_GT_BOXES' to 93, as my images can contain up to 93 (very small) objects inside them.

What can I do in order to train the model without exploding the loss?

Edit: First thing I've done was decreasing batch size to 4 (didnt work), and then to 1.
Now I get the following error:

[session 1][epoch  1][iter    0] loss: 196415.5469, lr: 4.00e-03
            fg/bg=(0/128), time cost: 1.562245
            rpn_cls: 196415.5469, rpn_box: 0.0000, rcnn_cls: 0.0000, rcnn_box 0.0000
Traceback (most recent call last):
  File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/trainval_net.py", line 326, in <module>
    rois_label = fasterRCNN(im_data, im_info, gt_boxes, num_boxes)
  File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 68, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 78, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 67, in parallel_apply
    raise output
  File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 42, in _worker
    output = module(*input, **kwargs)
  File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/faster_rcnn/faster_rcnn.py", line 54, in forward
    roi_data = self.RCNN_proposal_target(rois, gt_boxes, num_boxes)
  File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/rpn/proposal_target_layer_cascade.py", line 52, in forward
    rois_per_image, self._num_classes)
  File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/rpn/proposal_target_layer_cascade.py", line 190, in _sample_rois_pytorch
    raise ValueError("bg_num_rois = 0 and fg_num_rois = 0, this should not happen!")
ValueError: bg_num_rois = 0 and fg_num_rois = 0, this should not happen!

Process finished with exit code 1

I guess the RPN doesn't find anything because the objects are quite small, am I right? What parameters should I play with now? When I use pretrained resnet I get:

[session 1][epoch  1][iter    0] loss: 3.0736, lr: 4.00e-03
            fg/bg=(9/119), time cost: 1.587983
            rpn_cls: 0.7013, rpn_box: 0.7161, rcnn_cls: 1.6172, rcnn_box 0.0390
[session 1][epoch  1][iter  100] loss: nan, lr: 4.00e-03
            fg/bg=(0/128), time cost: 87.171304
            rpn_cls: 0.5350, rpn_box: 0.0000, rcnn_cls: nan, rcnn_box nan
[session 1][epoch  1][iter  200] loss: nan, lr: 4.00e-03
            fg/bg=(0/128), time cost: 87.340177
            rpn_cls: 0.4193, rpn_box: 0.0000, rcnn_cls: nan, rcnn_box nan
[session 1][epoch  1][iter  300] loss: nan, lr: 4.00e-03
            fg/bg=(0/128), time cost: 86.971604
            rpn_cls: 0.3733, rpn_box: 0.0000, rcnn_cls: nan, rcnn_box nan
jwyang commented 6 years ago

@CodeJjang Hi, it seems still weird that the number of fg becomes zero when the training proceed. I will check that on my side.

CodeJjang commented 6 years ago

@jwyang Thanks. Do you have any tips regarding how to train when my dataset consists of quite small objects? Perhaps it causes the problem?

jwyang commented 6 years ago

@CodeJjang , yes, that might cause the problem, since our batch data loader crop images, so the small objects might be removed from the training data, and thus there is no fg in the image any more. To address this problem, one way is set the batch size to be 1, and then do not crop the image by setting False to this line:

https://github.com/jwyang/faster-rcnn.pytorch/blob/7c0c5fbba8f159820c3d28302a9681c0ce0fc84e/lib/roi_data_layer/roibatchLoader.py#L88

CodeJjang commented 6 years ago

@jwyang Thanks for the quick response. I will definitely try that.
Do you think I should play with the anchor scale sizes as well?

Another thing I'd like to hear from you about:
I have several objects which are annotated with 1 single point (since they are small), so a bounding box for them would be a box containing the same point X 4.
Maybe this approach somehow fails the RPN? How do you think I should deal with it?

My average objects area by the way is around ~1400 px, with the minimum being 1 due to the above.

jwyang commented 6 years ago

@CodeJjang if the bbox is just a single point, the bound box should be [x1, y1, x1+1, y1+1]. Also, i think it is extremely hard (or impossible) for faster r-cnn to detect this small box. After down sampling, the bounding box size will be much less than one pixel. You should remove these kind of small boxes during training. Since your boxes generally have size of 30 in general, it would be good to change the anchor size to smaller ones.

CodeJjang commented 6 years ago

@jwyang Ok, I removed them, and then trained for 1 epoch with anchor scales of [1, 2, 4, 8, 16] (16 captures largest object in dataset, 1 captures smallest):

before filtering, there are 3174 images...
after filtering, there are 3174 images...
3174 roidb entries
Loading pretrained weights from data/pretrained_model/resnet101_caffe.pth
/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/rpn/rpn.py:68: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
  rpn_cls_prob_reshape = F.softmax(rpn_cls_score_reshape)
/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/faster_rcnn/faster_rcnn.py:99: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
  cls_prob = F.softmax(cls_score)
[session 1][epoch  1][iter    0] loss: 2.2807, lr: 4.00e-03
            fg/bg=(1/127), time cost: 1.571556
            rpn_cls: 0.7308, rpn_box: 0.0036, rcnn_cls: 1.5462, rcnn_box 0.0001
[session 1][epoch  1][iter  100] loss: 0.7577, lr: 4.00e-03
            fg/bg=(14/114), time cost: 88.627255
            rpn_cls: 0.0332, rpn_box: 0.0132, rcnn_cls: 0.3017, rcnn_box 0.2466
[session 1][epoch  1][iter  200] loss: 0.6620, lr: 4.00e-03
            fg/bg=(32/96), time cost: 88.554790
            rpn_cls: 0.0154, rpn_box: 0.0119, rcnn_cls: 0.2113, rcnn_box 0.4514
[session 1][epoch  1][iter  300] loss: 0.5930, lr: 4.00e-03
            fg/bg=(11/117), time cost: 88.482760
            rpn_cls: 0.0008, rpn_box: 0.0017, rcnn_cls: 0.0594, rcnn_box 0.1276
[session 1][epoch  1][iter  400] loss: 0.5873, lr: 4.00e-03
            fg/bg=(23/105), time cost: 88.516982
            rpn_cls: 0.0618, rpn_box: 0.0167, rcnn_cls: 0.1997, rcnn_box 0.2790
[session 1][epoch  1][iter  500] loss: 0.5350, lr: 4.00e-03
            fg/bg=(25/103), time cost: 88.559573
            rpn_cls: 0.0119, rpn_box: 0.0145, rcnn_cls: 0.2690, rcnn_box 0.3017
[session 1][epoch  1][iter  600] loss: 0.4720, lr: 4.00e-03
            fg/bg=(32/96), time cost: 88.640138
            rpn_cls: 0.0131, rpn_box: 0.0192, rcnn_cls: 0.2619, rcnn_box 0.3176
[session 1][epoch  1][iter  700] loss: 0.5162, lr: 4.00e-03
            fg/bg=(32/96), time cost: 88.620682
            rpn_cls: 0.0145, rpn_box: 0.0095, rcnn_cls: 0.1665, rcnn_box 0.4751
[session 1][epoch  1][iter  800] loss: 0.4596, lr: 4.00e-03
            fg/bg=(27/101), time cost: 88.645700
            rpn_cls: 0.0592, rpn_box: 0.0363, rcnn_cls: 0.5551, rcnn_box 0.3302
[session 1][epoch  1][iter  900] loss: 0.4404, lr: 4.00e-03
            fg/bg=(27/101), time cost: 88.658445
            rpn_cls: 0.0134, rpn_box: 0.0012, rcnn_cls: 0.1136, rcnn_box 0.2929
[session 1][epoch  1][iter 1000] loss: 0.3656, lr: 4.00e-03
            fg/bg=(32/96), time cost: 88.901456
            rpn_cls: 0.0009, rpn_box: 0.0051, rcnn_cls: 0.0381, rcnn_box 0.2727
[session 1][epoch  1][iter 1100] loss: 0.4342, lr: 4.00e-03
            fg/bg=(12/116), time cost: 89.013582
            rpn_cls: 0.0298, rpn_box: 0.0120, rcnn_cls: 0.1846, rcnn_box 0.1274
[session 1][epoch  1][iter 1200] loss: 0.4642, lr: 4.00e-03
            fg/bg=(32/96), time cost: 88.988293
            rpn_cls: 0.0008, rpn_box: 0.0032, rcnn_cls: 0.0911, rcnn_box 0.1734
[session 1][epoch  1][iter 1300] loss: 0.4205, lr: 4.00e-03
            fg/bg=(32/96), time cost: 88.669224
            rpn_cls: 0.0071, rpn_box: 0.0047, rcnn_cls: 0.2761, rcnn_box 0.2634
[session 1][epoch  1][iter 1400] loss: 0.3865, lr: 4.00e-03
            fg/bg=(32/96), time cost: 88.622101
            rpn_cls: 0.0110, rpn_box: 0.0042, rcnn_cls: 0.1309, rcnn_box 0.1807
[session 1][epoch  1][iter 1500] loss: 0.3914, lr: 4.00e-03
            fg/bg=(32/96), time cost: 88.633439
            rpn_cls: 0.0093, rpn_box: 0.0044, rcnn_cls: 0.1291, rcnn_box 0.3489
[session 1][epoch  1][iter 1600] loss: 0.3732, lr: 4.00e-03
            fg/bg=(32/96), time cost: 88.571652
            rpn_cls: 0.0082, rpn_box: 0.0247, rcnn_cls: 0.1348, rcnn_box 0.3052
[session 1][epoch  1][iter 1700] loss: 0.4248, lr: 4.00e-03
            fg/bg=(32/96), time cost: 88.575121
            rpn_cls: 0.0231, rpn_box: 0.0059, rcnn_cls: 0.1251, rcnn_box 0.2280
[session 1][epoch  1][iter 1800] loss: 0.3906, lr: 4.00e-03
            fg/bg=(32/96), time cost: 88.579328
            rpn_cls: 0.0034, rpn_box: 0.0023, rcnn_cls: 0.0637, rcnn_box 0.1694
[session 1][epoch  1][iter 1900] loss: 0.3232, lr: 4.00e-03
            fg/bg=(24/104), time cost: 88.739019
            rpn_cls: 0.0042, rpn_box: 0.0025, rcnn_cls: 0.1196, rcnn_box 0.2444
[session 1][epoch  1][iter 2000] loss: 0.3236, lr: 4.00e-03
            fg/bg=(21/107), time cost: 88.675636
            rpn_cls: 0.0026, rpn_box: 0.0054, rcnn_cls: 0.0224, rcnn_box 0.1865
[session 1][epoch  1][iter 2100] loss: 0.3131, lr: 4.00e-03
            fg/bg=(16/112), time cost: 88.627246
            rpn_cls: 0.0003, rpn_box: 0.0006, rcnn_cls: 0.0499, rcnn_box 0.0309
[session 1][epoch  1][iter 2200] loss: 0.3636, lr: 4.00e-03
            fg/bg=(29/99), time cost: 88.636636
            rpn_cls: 0.0085, rpn_box: 0.0049, rcnn_cls: 0.0782, rcnn_box 0.2448
[session 1][epoch  1][iter 2300] loss: 0.3120, lr: 4.00e-03
            fg/bg=(31/97), time cost: 88.677933
            rpn_cls: 0.0047, rpn_box: 0.0056, rcnn_cls: 0.1251, rcnn_box 0.2862
[session 1][epoch  1][iter 2400] loss: 0.2969, lr: 4.00e-03
            fg/bg=(32/96), time cost: 88.668061
            rpn_cls: 0.0013, rpn_box: 0.0030, rcnn_cls: 0.1097, rcnn_box 0.1873
[session 1][epoch  1][iter 2500] loss: 0.3172, lr: 4.00e-03
            fg/bg=(32/96), time cost: 88.613365
            rpn_cls: 0.0034, rpn_box: 0.0073, rcnn_cls: 0.0920, rcnn_box 0.2048
[session 1][epoch  1][iter 2600] loss: 0.3404, lr: 4.00e-03
            fg/bg=(6/122), time cost: 88.667422
            rpn_cls: 0.0064, rpn_box: 0.0014, rcnn_cls: 0.0722, rcnn_box 0.0892
[session 1][epoch  1][iter 2700] loss: 0.3020, lr: 4.00e-03
            fg/bg=(32/96), time cost: 88.598949
            rpn_cls: 0.0015, rpn_box: 0.0035, rcnn_cls: 0.0652, rcnn_box 0.2704
[session 1][epoch  1][iter 2800] loss: 0.2890, lr: 4.00e-03
            fg/bg=(32/96), time cost: 88.620969
            rpn_cls: 0.0067, rpn_box: 0.0047, rcnn_cls: 0.1282, rcnn_box 0.2242
[session 1][epoch  1][iter 2900] loss: 0.2900, lr: 4.00e-03
            fg/bg=(9/119), time cost: 88.608526
            rpn_cls: 0.0008, rpn_box: 0.0011, rcnn_cls: 0.1151, rcnn_box 0.0593
[session 1][epoch  1][iter 3000] loss: 0.2832, lr: 4.00e-03
            fg/bg=(13/115), time cost: 88.678512
            rpn_cls: 0.0026, rpn_box: 0.0002, rcnn_cls: 0.0713, rcnn_box 0.0535
[session 1][epoch  1][iter 3100] loss: 0.2791, lr: 4.00e-03
            fg/bg=(32/96), time cost: 89.003265
            rpn_cls: 0.0024, rpn_box: 0.0040, rcnn_cls: 0.1118, rcnn_box 0.4206
save model: saved_models/res101/my_dataset/faster_rcnn_1_1_3173.pth
65.92297983169556

Process finished with exit code 0

Definitely looks better, however the fgs are still relatively small in some iterations (1, 9, etc...).
This is epoch 9:

[session 1][epoch  9][iter    0] loss: 0.1625, lr: 4.00e-04
            fg/bg=(32/96), time cost: 1.539441
            rpn_cls: 0.0002, rpn_box: 0.0016, rcnn_cls: 0.0499, rcnn_box 0.1108

Any ideas how to improve from here? :)

CodeJjang commented 6 years ago

@jwyang

[session 1][epoch 14][iter  500] loss: 0.0917, lr: 4.00e-04
            fg/bg=(96/416), time cost: 168.633036
            rpn_cls: 0.0002, rpn_box: 0.0006, rcnn_cls: 0.0357, rcnn_box 0.0256
[session 1][epoch 14][iter  600] loss: 0.0865, lr: 4.00e-04
            fg/bg=(90/422), time cost: 168.349854
            rpn_cls: 0.0015, rpn_box: 0.0009, rcnn_cls: 0.0078, rcnn_box 0.0114
[session 1][epoch 14][iter  700] loss: 0.0794, lr: 4.00e-04
            fg/bg=(118/394), time cost: 168.221013
            rpn_cls: 0.0028, rpn_box: 0.0021, rcnn_cls: 0.0474, rcnn_box 0.0575

However, the network seems to learn small vehicle better than large vehicle, and also, it cannot learn the solar panel class (which is quite small) for some reason:

AP for large vehicle = 0.7531
AP for small vehicle = 0.8962
/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/datasets/voc_eval.py:204: RuntimeWarning: invalid value encountered in true_divide
  rec = tp / float(npos)
/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/datasets/voc_eval.py:45: RuntimeWarning: invalid value encountered in greater_equal
  if np.sum(rec >= t) == 0:
AP for solar panel = 0.0000
Mean AP = 0.5498
~~~~~~~~
Results:
0.753
0.896
0.000
0.550
~~~~~~~~

Any idea how to improve? Or why it fails on the solar panel? It throws some errors in the AP calculation though.

Edit: Just found out why it learns small vehicles better, that's because they appear way more than large vehicles, and apparently I mistakenly filtered solar panels out of my train set, thats why its 0.

jwyang commented 6 years ago

@CodeJjang , great! then it seems that your training is fine now.

CodeJjang commented 6 years ago

@jwyang Yep, indeed. Thanks!
However, I have tackled another small challenge. I'm satisfied with the results, so I added also higher resolution images to the train set (so now it consists not only from 600x900 images but also 2500x4000 images which consists of the solar panel class).

[session 1][epoch  1][iter    0] loss: 2.2565, lr: 4.00e-03
            fg/bg=(21/107), time cost: 1.585239
            rpn_cls: 0.6889, rpn_box: 0.1035, rcnn_cls: 1.3022, rcnn_box 0.1619
[session 1][epoch  1][iter  100] loss: 0.8573, lr: 4.00e-03
            fg/bg=(13/115), time cost: 88.400754
            rpn_cls: 0.0554, rpn_box: 0.0037, rcnn_cls: 0.2631, rcnn_box 0.2668
[session 1][epoch  1][iter  200] loss: 0.7695, lr: 4.00e-03
            fg/bg=(25/103), time cost: 88.156886
            rpn_cls: 0.0085, rpn_box: 0.0058, rcnn_cls: 0.3238, rcnn_box 0.3907
[session 1][epoch  1][iter  300] loss: 0.7010, lr: 4.00e-03
            fg/bg=(32/96), time cost: 88.531185
            rpn_cls: 0.0446, rpn_box: 0.0237, rcnn_cls: 0.3977, rcnn_box 0.4254
Traceback (most recent call last):
  File "trainval_net.py", line 316, in <module>
    data = next(data_iter)
  File "/home/cyb/user/installations/anaconda3/envs/my_project/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 210, in __next__
    return self._process_next_batch(batch)
  File "/home/cyb/user/installations/anaconda3/envs/my_project/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 230, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
ValueError: Traceback (most recent call last):
  File "/home/cyb/user/installations/anaconda3/envs/my_project/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 42, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/cyb/user/installations/anaconda3/envs/my_project/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 42, in <listcomp>
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/roi_data_layer/roibatchLoader.py", line 67, in __getitem__
    blobs = get_minibatch(minibatch_db, self._num_classes)
  File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/roi_data_layer/minibatch.py", line 30, in get_minibatch
    im_blob, im_scales = _get_image_blob(roidb, random_scale_inds)
  File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/roi_data_layer/minibatch.py", line 79, in _get_image_blob
    cfg.TRAIN.MAX_SIZE)
  File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/utils/blob.py", line 39, in prep_im_for_blob
    im -= pixel_means
ValueError: operands could not be broadcast together with shapes (2666,4010,4) (1,1,3) (2666,4010,4) 

Remember, by a recommendation from you, I already omitted the class which had 1 point instead of bounding box, I set batch_size to 1 (can you explain why does it matter?) and also set the cropping to False.
What is the problem now? Why does it fail with the dimensions? It happens also if I undo the crop=False thing you recommended (I thought it was related).

CodeJjang commented 6 years ago

@jwyang Updated the above comment, I had NaN before but then applied your fix from last days and now I have only dimensional problems.

CodeJjang commented 6 years ago

@jwyang Fixed the above error by loading my high resolution images (which were tiff apparently) as 'RGB', but I've returned to the NaN error despite applying all the above and also your last commits:

[session 1][epoch  1][iter    0] loss: 2.2763, lr: 4.00e-03
            fg/bg=(9/119), time cost: 1.596377
            rpn_cls: 0.6654, rpn_box: 0.0273, rcnn_cls: 1.4431, rcnn_box 0.1405
[session 1][epoch  1][iter  100] loss: 1.0325, lr: 4.00e-03
            fg/bg=(5/123), time cost: 88.442405
            rpn_cls: 0.1533, rpn_box: 0.0105, rcnn_cls: 0.1821, rcnn_box 0.0033
[session 1][epoch  1][iter  200] loss: 0.6312, lr: 4.00e-03
            fg/bg=(5/123), time cost: 88.410855
            rpn_cls: 0.1094, rpn_box: 0.0155, rcnn_cls: 0.1852, rcnn_box 0.0175
[session 1][epoch  1][iter  300] loss: 0.6391, lr: 4.00e-03
            fg/bg=(4/124), time cost: 88.356845
            rpn_cls: 0.0765, rpn_box: 0.0027, rcnn_cls: 0.1644, rcnn_box 0.0131
[session 1][epoch  1][iter  400] loss: 0.7291, lr: 4.00e-03
            fg/bg=(26/102), time cost: 88.262887
            rpn_cls: 0.0375, rpn_box: 0.0052, rcnn_cls: 0.5507, rcnn_box 0.5645
[session 1][epoch  1][iter  500] loss: 0.8458, lr: 4.00e-03
            fg/bg=(10/118), time cost: 88.494323
            rpn_cls: 0.0563, rpn_box: 0.0030, rcnn_cls: 0.2979, rcnn_box 0.1833
[session 1][epoch  1][iter  600] loss: 0.8930, lr: 4.00e-03
            fg/bg=(25/103), time cost: 88.526348
            rpn_cls: 0.0234, rpn_box: 0.0164, rcnn_cls: 0.3348, rcnn_box 0.3810
[session 1][epoch  1][iter  700] loss: 0.7675, lr: 4.00e-03
            fg/bg=(32/96), time cost: 88.312344
            rpn_cls: 0.0083, rpn_box: 0.0079, rcnn_cls: 0.3212, rcnn_box 0.5632
[session 1][epoch  1][iter  800] loss: 0.8177, lr: 4.00e-03
            fg/bg=(9/119), time cost: 88.433375
            rpn_cls: 0.0040, rpn_box: 0.0008, rcnn_cls: 0.0611, rcnn_box 0.1617
[session 1][epoch  1][iter  900] loss: 0.7154, lr: 4.00e-03
            fg/bg=(8/120), time cost: 88.386658
            rpn_cls: 0.0037, rpn_box: 0.0019, rcnn_cls: 0.0853, rcnn_box 0.1004
[session 1][epoch  1][iter 1000] loss: nan, lr: 4.00e-03
            fg/bg=(128/0), time cost: 88.080895
            rpn_cls: nan, rpn_box: nan, rcnn_cls: nan, rcnn_box nan
[session 1][epoch  1][iter 1100] loss: nan, lr: 4.00e-03
            fg/bg=(128/0), time cost: 87.066358
            rpn_cls: nan, rpn_box: nan, rcnn_cls: nan, rcnn_box nan
CodeJjang commented 6 years ago

The NaN was fixed when I set the MAX_NUM_GT_BOXES to the correct value.

zorrocai commented 6 years ago

@CodeJjang Hi, may i ask about your solution to "RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /opt/conda/conda-bld/pytorch_1513368888240/work/torch/lib/THC/generic/THCStorage.c:36"

zcunyi commented 5 years ago

The NaN was fixed when I set the MAX_NUM_GT_BOXES to the correct value.

@jwyang Yep, indeed. Thanks! However, I have tackled another small challenge. I'm satisfied with the results, so I added also higher resolution images to the train set (so now it consists not only from 600x900 images but also 2500x4000 images which consists of the solar panel class).

[session 1][epoch  1][iter    0] loss: 2.2565, lr: 4.00e-03
          fg/bg=(21/107), time cost: 1.585239
          rpn_cls: 0.6889, rpn_box: 0.1035, rcnn_cls: 1.3022, rcnn_box 0.1619
[session 1][epoch  1][iter  100] loss: 0.8573, lr: 4.00e-03
          fg/bg=(13/115), time cost: 88.400754
          rpn_cls: 0.0554, rpn_box: 0.0037, rcnn_cls: 0.2631, rcnn_box 0.2668
[session 1][epoch  1][iter  200] loss: 0.7695, lr: 4.00e-03
          fg/bg=(25/103), time cost: 88.156886
          rpn_cls: 0.0085, rpn_box: 0.0058, rcnn_cls: 0.3238, rcnn_box 0.3907
[session 1][epoch  1][iter  300] loss: 0.7010, lr: 4.00e-03
          fg/bg=(32/96), time cost: 88.531185
          rpn_cls: 0.0446, rpn_box: 0.0237, rcnn_cls: 0.3977, rcnn_box 0.4254
Traceback (most recent call last):
  File "trainval_net.py", line 316, in <module>
    data = next(data_iter)
  File "/home/cyb/user/installations/anaconda3/envs/my_project/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 210, in __next__
    return self._process_next_batch(batch)
  File "/home/cyb/user/installations/anaconda3/envs/my_project/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 230, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
ValueError: Traceback (most recent call last):
  File "/home/cyb/user/installations/anaconda3/envs/my_project/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 42, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/cyb/user/installations/anaconda3/envs/my_project/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 42, in <listcomp>
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/roi_data_layer/roibatchLoader.py", line 67, in __getitem__
    blobs = get_minibatch(minibatch_db, self._num_classes)
  File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/roi_data_layer/minibatch.py", line 30, in get_minibatch
    im_blob, im_scales = _get_image_blob(roidb, random_scale_inds)
  File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/roi_data_layer/minibatch.py", line 79, in _get_image_blob
    cfg.TRAIN.MAX_SIZE)
  File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/utils/blob.py", line 39, in prep_im_for_blob
    im -= pixel_means
ValueError: operands could not be broadcast together with shapes (2666,4010,4) (1,1,3) (2666,4010,4) 

Remember, by a recommendation from you, I already omitted the class which had 1 point instead of bounding box, I set batch_size to 1 (can you explain why does it matter?) and also set the cropping to False. What is the problem now? Why does it fail with the dimensions? It happens also if I undo the crop=False thing you recommended (I thought it was related).

Hello, how did you solve the problem:ValueError: operands could not be broadcast together with shapes?

E-Dreamer-LQ commented 5 years ago

@CodeJjang I also detect very small targets, how do you design all the parameters about size

xwjBupt commented 5 years ago

@jwyang i also have this problem,what shold i do ?

/home/xwj/anaconda3/envs/torch1.0/bin/python /home/xwj/pycharm-2018.3.6/helpers/pydev/pydevd.py --multiproc --qt-support=auto --client 127.0.0.1 --port 44855 --file /media/xwj/Programm/Python/faster-rcnn.pytorch/train_copy.py pydev debugger: process 54969 is connecting

Connected to pydev debugger (build 183.6156.13) Called with args: Namespace(batch_size=16, checkepoch=1, checkpoint=0, checkpoint_interval=10000, checksession=1, class_agnostic=False, cuda=True, dataset='pascal_voc', disp_interval=100, large_scale=False, lr=0.001, lr_decay_gamma=0.1, lr_decay_step=5, mGPUs=False, max_epochs=20, net='vgg16', num_workers=0, optimizer='sgd', resume=False, save_dir='models', session=1, start_epoch=1, use_tfboard=True) /media/xwj/Programm/Python/faster-rcnn.pytorch/lib/model/utils/config.py:374: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details. yaml_cfg = edict(yaml.load(f)) Using config: {'ANCHOR_RATIOS': [0.5, 1, 2], 'ANCHOR_SCALES': [8, 16, 32], 'CROP_RESIZE_WITH_MAX_POOL': False, 'CUDA': False, 'DATA_DIR': '/media/xwj/Programm/Python/faster-rcnn.pytorch/data', 'DEDUP_BOXES': 0.0625, 'EPS': 1e-14, 'EXP_DIR': 'vgg16', 'FEAT_STRIDE': [16], 'GPU_ID': 0, 'MATLAB': 'matlab', 'MAX_NUM_GT_BOXES': 20, 'MOBILENET': {'DEPTH_MULTIPLIER': 1.0, 'FIXED_LAYERS': 5, 'REGU_DEPTH': False, 'WEIGHT_DECAY': 4e-05}, 'PIXEL_MEANS': array([[[102.9801, 115.9465, 122.7717]]]), 'POOLING_MODE': 'align', 'POOLING_SIZE': 7, 'RESNET': {'FIXED_BLOCKS': 1, 'MAX_POOL': False}, 'RNG_SEED': 3, 'ROOT_DIR': '/media/xwj/Programm/Python/faster-rcnn.pytorch', 'TEST': {'BBOX_REG': True, 'HAS_RPN': True, 'MAX_SIZE': 1000, 'MODE': 'nms', 'NMS': 0.3, 'PROPOSAL_METHOD': 'gt', 'RPN_MIN_SIZE': 16, 'RPN_NMS_THRESH': 0.7, 'RPN_POST_NMS_TOP_N': 300, 'RPN_PRE_NMS_TOP_N': 6000, 'RPN_TOP_N': 5000, 'SCALES': [600], 'SVM': False}, 'TRAIN': {'ASPECT_GROUPING': False, 'BATCH_SIZE': 256, 'BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0], 'BBOX_NORMALIZE_MEANS': [0.0, 0.0, 0.0, 0.0], 'BBOX_NORMALIZE_STDS': [0.1, 0.1, 0.2, 0.2], 'BBOX_NORMALIZE_TARGETS': True, 'BBOX_NORMALIZE_TARGETS_PRECOMPUTED': True, 'BBOX_REG': True, 'BBOX_THRESH': 0.5, 'BG_THRESH_HI': 0.5, 'BG_THRESH_LO': 0.0, 'BIAS_DECAY': False, 'BN_TRAIN': False, 'DISPLAY': 10, 'DOUBLE_BIAS': True, 'FG_FRACTION': 0.25, 'FG_THRESH': 0.5, 'GAMMA': 0.1, 'HAS_RPN': True, 'IMS_PER_BATCH': 1, 'LEARNING_RATE': 0.01, 'MAX_SIZE': 1000, 'MOMENTUM': 0.9, 'PROPOSAL_METHOD': 'gt', 'RPN_BATCHSIZE': 256, 'RPN_BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0], 'RPN_CLOBBER_POSITIVES': False, 'RPN_FG_FRACTION': 0.5, 'RPN_MIN_SIZE': 8, 'RPN_NEGATIVE_OVERLAP': 0.3, 'RPN_NMS_THRESH': 0.7, 'RPN_POSITIVE_OVERLAP': 0.7, 'RPN_POSITIVE_WEIGHT': -1.0, 'RPN_POST_NMS_TOP_N': 2000, 'RPN_PRE_NMS_TOP_N': 12000, 'SCALES': [600], 'SNAPSHOT_ITERS': 5000, 'SNAPSHOT_KEPT': 3, 'SNAPSHOT_PREFIX': 'res101_faster_rcnn', 'STEPSIZE': [30000], 'SUMMARY_INTERVAL': 180, 'TRIM_HEIGHT': 600, 'TRIM_WIDTH': 600, 'TRUNCATED': False, 'USE_ALL_GT': True, 'USE_FLIPPED': True, 'USE_GT': False, 'WEIGHT_DECAY': 0.0005}, 'USE_GPU_NMS': True} Loaded dataset voc_2007_trainval Set proposal method: gt Appending horizontally-flipped training examples... voc_2007_trainval gt roidb loaded from /media/xwj/Programm/Python/faster-rcnn.pytorch/data/cache/voc_2007_trainval_gt_roidb.pkl done Preparing training data... Image sizes loaded from /media/xwj/Programm/Python/faster-rcnn.pytorch/data/cache/voc_2007_trainval_sizes.pkl done before filtering, there are 18406 images... after filtering, there are 18406 images... 18406 roidb entries vgg16( (RCNN_rpn): _RPN( (RPN_Conv): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (RPN_cls_score): Conv2d(512, 18, kernel_size=(1, 1), stride=(1, 1)) (RPN_bbox_pred): Conv2d(512, 36, kernel_size=(1, 1), stride=(1, 1)) (RPN_proposal): _ProposalLayer() (RPN_anchor_target): _AnchorTargetLayer() ) (RCNN_proposal_target): _ProposalTargetLayer() (RCNN_roi_pool): PrRoIPool2D() (RCNN_roi_align): RoIAlignAvg() (RCNN_roi_crop): _RoICrop() (RCNN_base): Sequential( (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (1): ReLU(inplace) (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (3): ReLU(inplace) (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) (5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (6): ReLU(inplace) (7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (8): ReLU(inplace) (9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) (10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (11): ReLU(inplace) (12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (13): ReLU(inplace) (14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (15): ReLU(inplace) (16): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) (17): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (18): ReLU(inplace) (19): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (20): ReLU(inplace) (21): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (22): ReLU(inplace) (23): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) (24): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (25): ReLU(inplace) (26): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (27): ReLU(inplace) (28): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (29): ReLU(inplace) ) (RCNN_top): Sequential( (0): Linear(in_features=25088, out_features=4096, bias=True) (1): ReLU(inplace) (2): Dropout(p=0.5) (3): Linear(in_features=4096, out_features=4096, bias=True) (4): ReLU(inplace) (5): Dropout(p=0.5) ) (RCNN_cls_score): Linear(in_features=4096, out_features=2, bias=True) (RCNN_bbox_pred): Linear(in_features=4096, out_features=8, bias=True) ) [session 1][epoch 1][iter 0/1150] loss: 5.0934, lr: 1.00e-03 fg/bg=(98/3998), time cost: 2.378769 rpn_cls: 0.5810, rpn_box: 0.5940, rcnn_cls: 3.8850, rcnn_box 0.0333 [session 1][epoch 1][iter 100/1150] loss: 1.3188, lr: 1.00e-03 fg/bg=(719/3377), time cost: 237.837066 rpn_cls: 0.2457, rpn_box: 0.0229, rcnn_cls: 0.3879, rcnn_box 0.3908 [session 1][epoch 1][iter 200/1150] loss: 0.9812, lr: 1.00e-03 fg/bg=(673/3423), time cost: 235.772547 rpn_cls: 0.2274, rpn_box: 0.0174, rcnn_cls: 0.3308, rcnn_box 0.3237 [session 1][epoch 1][iter 300/1150] loss: 0.9300, lr: 1.00e-03 fg/bg=(917/3179), time cost: 237.889025 rpn_cls: 0.2124, rpn_box: 0.0241, rcnn_cls: 0.3444, rcnn_box 0.4794 [session 1][epoch 1][iter 400/1150] loss: 0.9346, lr: 1.00e-03 fg/bg=(156/3940), time cost: 240.610902 rpn_cls: 0.2184, rpn_box: 0.0153, rcnn_cls: 0.2388, rcnn_box 0.0898 [session 1][epoch 1][iter 500/1150] loss: 0.9442, lr: 1.00e-03 fg/bg=(724/3372), time cost: 243.984946 rpn_cls: 0.2009, rpn_box: 0.0155, rcnn_cls: 0.3383, rcnn_box 0.2944 [session 1][epoch 1][iter 600/1150] loss: 0.9229, lr: 1.00e-03 fg/bg=(920/3176), time cost: 247.035084 rpn_cls: 0.3912, rpn_box: 0.0443, rcnn_cls: 0.3589, rcnn_box 0.4273 [session 1][epoch 1][iter 700/1150] loss: 0.8946, lr: 1.00e-03 fg/bg=(967/3129), time cost: 247.947732 rpn_cls: 0.3382, rpn_box: 0.0598, rcnn_cls: 0.3729, rcnn_box 0.5075 [session 1][epoch 1][iter 800/1150] loss: 0.9000, lr: 1.00e-03 fg/bg=(1021/3075), time cost: 244.124488 rpn_cls: 0.3048, rpn_box: 0.0643, rcnn_cls: 0.3559, rcnn_box 0.5121 [session 1][epoch 1][iter 900/1150] loss: 0.8982, lr: 1.00e-03 fg/bg=(279/3817), time cost: 242.603654 rpn_cls: 0.2306, rpn_box: 0.0129, rcnn_cls: 0.1911, rcnn_box 0.1284 THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1532502421238/work/aten/src/THC/generated/../THCReduceAll.cuh line=317 error=77 : an illegal memory access was encountered Traceback (most recent call last): File "/home/xwj/pycharm-2018.3.6/helpers/pydev/pydevd.py", line 1741, in main() File "/home/xwj/pycharm-2018.3.6/helpers/pydev/pydevd.py", line 1735, in main globals = debugger.run(setup['file'], None, None, is_module) File "/home/xwj/pycharm-2018.3.6/helpers/pydev/pydevd.py", line 1135, in run pydev_imports.execfile(file, globals, locals) # execute the script File "/home/xwj/pycharm-2018.3.6/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile exec(compile(contents+"\n", file, 'exec'), glob, loc) File "/media/xwj/Programm/Python/faster-rcnn.pytorch/train_copy.py", line 338, in clip_gradient(fasterRCNN, 10.) File "/media/xwj/Programm/Python/faster-rcnn.pytorch/lib/model/utils/net_utils.py", line 43, in clip_gradient modulenorm = p.grad.data.norm() RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /opt/conda/conda-bld/pytorch_1532502421238/work/aten/src/THC/generated/../THCReduceAll.cuh:317

marcunzueta commented 5 years ago

The NaN was fixed when I set the MAX_NUM_GT_BOXES to the correct value.

@jwyang Yep, indeed. Thanks! However, I have tackled another small challenge. I'm satisfied with the results, so I added also higher resolution images to the train set (so now it consists not only from 600x900 images but also 2500x4000 images which consists of the solar panel class).

[session 1][epoch  1][iter    0] loss: 2.2565, lr: 4.00e-03
            fg/bg=(21/107), time cost: 1.585239
            rpn_cls: 0.6889, rpn_box: 0.1035, rcnn_cls: 1.3022, rcnn_box 0.1619
[session 1][epoch  1][iter  100] loss: 0.8573, lr: 4.00e-03
            fg/bg=(13/115), time cost: 88.400754
            rpn_cls: 0.0554, rpn_box: 0.0037, rcnn_cls: 0.2631, rcnn_box 0.2668
[session 1][epoch  1][iter  200] loss: 0.7695, lr: 4.00e-03
            fg/bg=(25/103), time cost: 88.156886
            rpn_cls: 0.0085, rpn_box: 0.0058, rcnn_cls: 0.3238, rcnn_box 0.3907
[session 1][epoch  1][iter  300] loss: 0.7010, lr: 4.00e-03
            fg/bg=(32/96), time cost: 88.531185
            rpn_cls: 0.0446, rpn_box: 0.0237, rcnn_cls: 0.3977, rcnn_box 0.4254
Traceback (most recent call last):
  File "trainval_net.py", line 316, in <module>
    data = next(data_iter)
  File "/home/cyb/user/installations/anaconda3/envs/my_project/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 210, in __next__
    return self._process_next_batch(batch)
  File "/home/cyb/user/installations/anaconda3/envs/my_project/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 230, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
ValueError: Traceback (most recent call last):
  File "/home/cyb/user/installations/anaconda3/envs/my_project/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 42, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/cyb/user/installations/anaconda3/envs/my_project/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 42, in <listcomp>
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/roi_data_layer/roibatchLoader.py", line 67, in __getitem__
    blobs = get_minibatch(minibatch_db, self._num_classes)
  File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/roi_data_layer/minibatch.py", line 30, in get_minibatch
    im_blob, im_scales = _get_image_blob(roidb, random_scale_inds)
  File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/roi_data_layer/minibatch.py", line 79, in _get_image_blob
    cfg.TRAIN.MAX_SIZE)
  File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/utils/blob.py", line 39, in prep_im_for_blob
    im -= pixel_means
ValueError: operands could not be broadcast together with shapes (2666,4010,4) (1,1,3) (2666,4010,4) 

Remember, by a recommendation from you, I already omitted the class which had 1 point instead of bounding box, I set batch_size to 1 (can you explain why does it matter?) and also set the cropping to False. What is the problem now? Why does it fail with the dimensions? It happens also if I undo the crop=False thing you recommended (I thought it was related).

Hello, how did you solve the problem:ValueError: operands could not be broadcast together with shapes?

@xwjBupt I have the same problem regarding broadcast ValueError: operands could not be broadcast together with shapes (2666,4010,4) (1,1,3) (2666,4010,4)

that is because your image is not RGB, probably CMYK, which has 4 channels (hence the value 4) instead of 3, and the operation cannot be completed due to different matrix shapes. All you need to do is to filter out all the images that are not RGB from your dataset or convert them to RGB. If you are using PIL you can check it with im.mode