Finetuning res101 model pretrained models error

jsuit commented 5 years ago

So if you try and retrain the resnet101 (faster_rcnn_1_10_9771.pth) you get the following error:

   File "trainval_net.py", line 336, in <module>
      optimizer.step()
   File "... pytorch_0.4/lib/python3.6/site-packages/torch/optim/sgd.py", line 101, in step
      buf.mul_(momentum).add_(1 - dampening, d_p)
   RuntimeError: The expanded size of the tensor (512) must match the existing size (1024) at 
  non-singleton dimension 1

It's the optimizer version knowledge of the parameters is different than what it actually is. Also, you get the same error if you use, /home/jonathan/faster-rcnn.pytorch/faster_rcnn_1_6_9771.pth

EMCP commented 5 years ago

this was fixed a while ago I believe.. I've successfully retrained res101 now

https://github.com/jwyang/faster-rcnn.pytorch/pull/515

AlexanderHustinx commented 5 years ago

Can you include your command-line argument and the full stack-trace? Might be able to help you out if I can recreate the issue.

Also, what version of Python, PyTorch, CUDA, are you using?

jsuit commented 5 years ago

pytorch 0.4 cuda = 9.0 python 3.6.6 faster_rcnn_1_10_9771.pth (gotten from link provided on README page)

Stack trace:

Preparing training data... done loading annotations into memory... Done (t=2.96s) creating index... index created! before filtering, there are 236574 images... after filtering, there are 234532 images... 234532 roidb entries Loading pretrained data/pretrained_model/resnet101_caffe.pth loading checkpoint faster-rcnn.pytorch/faster_rcnn_1_10_9771.pth loaded checkpoint faster-rcnn.pytorch/faster_rcnn_1_10_9771.pth Traceback (most recent call last): File "trainvalnet.py", line 336, in optimizer.step() File python3.6/site-packages/torch/optim/sgd.py", line 101, in step buf.mul(momentum).add_(1 - dampening, d_p) RuntimeError: The expanded size of the tensor (512) must match the existing size (1024) at non-singleton dimension 1

AlexanderHustinx commented 5 years ago

Could you also include the command-line argument? i.e. what you enter in your terminal to run the code

For example: python demo.py --dataset pascal --net res101 --checksession 1 --checkepoch 7 --checkpoint 10021 --cuda

jsuit commented 5 years ago

   python trainval_net.py --r True --cuda --bs 1

I'm basically saying resume training for the load_name:

(basically, I just changed lines 275-276 in trainval_net.py to load_name = my path to faster_rcnn_1_10_9771.pth)

Here's the only part I changed. Everything else is the same as the main branch. if args.resume: load_name = "path to faster_rcnn_1_10_9771.pth" print("loading checkpoint %s" % (load_name)) checkpoint = torch.load(load_name) args.session = checkpoint['session'] args.start_epoch = checkpoint['epoch'] fasterRCNN.load_state_dict(checkpoint['model']) optimizer.load_state_dict(checkpoint['optimizer'])

my default network is res101 and I'm training on coco2014.

AlexanderHustinx commented 5 years ago

Could you try to add the following tag to your command: --dataset coco I think you might be loading the pascal_voc dataset by default.

Additionally, you could try using all the command-line parameters: python trainval_net.py --dataset coco --net res101 --r True --checksession 1 --checkepoch 10 --checkpoint 9771 --bs 1 --cuda It is possible you might be overlooking something in your adjusted code.

zychen2016 commented 4 years ago

have you solved this problem ? @jsuit

zychen2016 commented 4 years ago

Hello, @AlexanderHustinx , I have the same problem

Using res101 pretrained model with trainval_net.py.

But using the same model with test_net.py and demo.py , there is no problem

Using pytorch-1.0 branch

AlexanderHustinx commented 4 years ago

Could you show the stack trace and error? Maybe I can help spot the issue

zychen2016 commented 4 years ago

Could you show the stack trace and error? Maybe I can help spot the issue

Thanks. @AlexanderHustinx

1.

Using CUDA_VISBLE_DEVICES=0 python3 trainval_net.py --r True --dataset pascal_voc --net vgg16 --bs 1 --nw 0 --lr 1e-3 --lr_decay_step 5 --cuda --checksession 1 --checkepoch 6 --checkpoint 10021 --use_tfb

Called with args: Namespace(batch_size=1, checkepoch=6, checkpoint=10021, checkpoint_interval=10000, checksession=1, class_agnostic=False, cuda=True, dataset='pascal_voc', disp_interval=100, large_scale=False, lr=0.001, lr_decay_gamma=0.1, lr_decay_step=5, mGPUs=False, max_epochs=20, net='vgg16', num_workers=0, optimizer='sgd', resume=True, save_dir='models', session=1, start_epoch=1, use_tfboard=True) /data2/CZY/data/faster-rcnn.pytorch/lib/model/utils/config.py:374: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details. yaml_cfg = edict(yaml.load(f)) Using config: {'ANCHOR_RATIOS': [0.5, 1, 2], 'ANCHOR_SCALES': [8, 16, 32], 'CROP_RESIZE_WITH_MAX_POOL': False, 'CUDA': False, 'DATA_DIR': '/data2/CZY/data/faster-rcnn.pytorch/data', 'DEDUP_BOXES': 0.0625, 'EPS': 1e-14, 'EXP_DIR': 'vgg16', 'FEAT_STRIDE': [16], 'GPU_ID': 0, 'MATLAB': 'matlab', 'MAX_NUM_GT_BOXES': 20, 'MOBILENET': {'DEPTH_MULTIPLIER': 1.0, 'FIXED_LAYERS': 5, 'REGU_DEPTH': False, 'WEIGHT_DECAY': 4e-05}, 'PIXEL_MEANS': array([[[102.9801, 115.9465, 122.7717]]]), 'POOLING_MODE': 'align', 'POOLING_SIZE': 7, 'RESNET': {'FIXED_BLOCKS': 1, 'MAX_POOL': False}, 'RNG_SEED': 3, 'ROOT_DIR': '/data2/CZY/data/faster-rcnn.pytorch', 'TEST': {'BBOX_REG': True, 'HAS_RPN': True, 'MAX_SIZE': 1000, 'MODE': 'nms', 'NMS': 0.3, 'PROPOSAL_METHOD': 'gt', 'RPN_MIN_SIZE': 16, 'RPN_NMS_THRESH': 0.7, 'RPN_POST_NMS_TOP_N': 300, 'RPN_PRE_NMS_TOP_N': 6000, 'RPN_TOP_N': 5000, 'SCALES': [600], 'SVM': False}, 'TRAIN': {'ASPECT_GROUPING': False, 'BATCH_SIZE': 256, 'BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0], 'BBOX_NORMALIZE_MEANS': [0.0, 0.0, 0.0, 0.0], 'BBOX_NORMALIZE_STDS': [0.1, 0.1, 0.2, 0.2], 'BBOX_NORMALIZE_TARGETS': True, 'BBOX_NORMALIZE_TARGETS_PRECOMPUTED': True, 'BBOX_REG': True, 'BBOX_THRESH': 0.5, 'BG_THRESH_HI': 0.5, 'BG_THRESH_LO': 0.0, 'BIAS_DECAY': False, 'BN_TRAIN': False, 'DISPLAY': 10, 'DOUBLE_BIAS': True, 'FG_FRACTION': 0.25, 'FG_THRESH': 0.5, 'GAMMA': 0.1, 'HAS_RPN': True, 'IMS_PER_BATCH': 1, 'LEARNING_RATE': 0.01, 'MAX_SIZE': 1000, 'MOMENTUM': 0.9, 'PROPOSAL_METHOD': 'gt', 'RPN_BATCHSIZE': 256, 'RPN_BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0], 'RPN_CLOBBER_POSITIVES': False, 'RPN_FG_FRACTION': 0.5, 'RPN_MIN_SIZE': 8, 'RPN_NEGATIVE_OVERLAP': 0.3, 'RPN_NMS_THRESH': 0.7, 'RPN_POSITIVE_OVERLAP': 0.7, 'RPN_POSITIVE_WEIGHT': -1.0, 'RPN_POST_NMS_TOP_N': 2000, 'RPN_PRE_NMS_TOP_N': 12000, 'SCALES': [600], 'SNAPSHOT_ITERS': 5000, 'SNAPSHOT_KEPT': 3, 'SNAPSHOT_PREFIX': 'res101_faster_rcnn', 'STEPSIZE': [30000], 'SUMMARY_INTERVAL': 180, 'TRIM_HEIGHT': 600, 'TRIM_WIDTH': 600, 'TRUNCATED': False, 'USE_ALL_GT': True, 'USE_FLIPPED': True, 'USE_GT': False, 'WEIGHT_DECAY': 0.0005}, 'USE_GPU_NMS': True} Loaded dataset voc_2007_trainval for training Set proposal method: gt Appending horizontally-flipped training examples... voc_2007_trainval gt roidb loaded from /data2/CZY/data/faster-rcnn.pytorch/data/cache/voc_2007_trainval_gt_roidb.pkl done Preparing training data... done before filtering, there are 10022 images... after filtering, there are 10022 images... 10022 roidb entries Loading pretrained weights from data/pretrained_model/vgg16_caffe.pth loading checkpoint models/vgg16/pascal_voc/faster_rcnn_1_6_10021.pth loaded checkpoint models/vgg16/pascal_voc/faster_rcnn_1_6_10021.pth Traceback (most recent call last): File "trainvalnet.py", line 332, in optimizer.step() File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/torch/optim/sgd.py", line 101, in step buf.mul(momentum).add_(1 - dampening, d_p) RuntimeError: The size of tensor a (4096) must match the size of tensor b (3) at non-singleton dimension 3

Using CUDA_VISBLE_DEVICES=0 python3 trainval_net.py --r True --dataset pascal_voc --net res101 --bs 1 --nw 0 --lr 1e-3 --lr_decay_step 5 --cuda --checksession 1 --checkepoch 7 --checkpoint 10021 --use_tfb

Called with args: Namespace(batch_size=1, checkepoch=7, checkpoint=10021, checkpoint_interval=10000, checksession=1, class_agnostic=False, cuda=True, dataset='pascal_voc', disp_interval=100, large_scale=False, lr=0.001, lr_decay_gamma=0.1, lr_decay_step=5, mGPUs=False, max_epochs=20, net='res101', num_workers=0, optimizer='sgd', resume=True, save_dir='models', session=1, start_epoch=1, use_tfboard=True) /data2/CZY/data/faster-rcnn.pytorch/lib/model/utils/config.py:374: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details. yaml_cfg = edict(yaml.load(f)) Using config: {'ANCHOR_RATIOS': [0.5, 1, 2], 'ANCHOR_SCALES': [8, 16, 32], 'CROP_RESIZE_WITH_MAX_POOL': False, 'CUDA': False, 'DATA_DIR': '/data2/CZY/data/faster-rcnn.pytorch/data', 'DEDUP_BOXES': 0.0625, 'EPS': 1e-14, 'EXP_DIR': 'res101', 'FEAT_STRIDE': [16], 'GPU_ID': 0, 'MATLAB': 'matlab', 'MAX_NUM_GT_BOXES': 20, 'MOBILENET': {'DEPTH_MULTIPLIER': 1.0, 'FIXED_LAYERS': 5, 'REGU_DEPTH': False, 'WEIGHT_DECAY': 4e-05}, 'PIXEL_MEANS': array([[[102.9801, 115.9465, 122.7717]]]), 'POOLING_MODE': 'align', 'POOLING_SIZE': 7, 'RESNET': {'FIXED_BLOCKS': 1, 'MAX_POOL': False}, 'RNG_SEED': 3, 'ROOT_DIR': '/data2/CZY/data/faster-rcnn.pytorch', 'TEST': {'BBOX_REG': True, 'HAS_RPN': True, 'MAX_SIZE': 1000, 'MODE': 'nms', 'NMS': 0.3, 'PROPOSAL_METHOD': 'gt', 'RPN_MIN_SIZE': 16, 'RPN_NMS_THRESH': 0.7, 'RPN_POST_NMS_TOP_N': 300, 'RPN_PRE_NMS_TOP_N': 6000, 'RPN_TOP_N': 5000, 'SCALES': [600], 'SVM': False}, 'TRAIN': {'ASPECT_GROUPING': False, 'BATCH_SIZE': 128, 'BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0], 'BBOX_NORMALIZE_MEANS': [0.0, 0.0, 0.0, 0.0], 'BBOX_NORMALIZE_STDS': [0.1, 0.1, 0.2, 0.2], 'BBOX_NORMALIZE_TARGETS': True, 'BBOX_NORMALIZE_TARGETS_PRECOMPUTED': True, 'BBOX_REG': True, 'BBOX_THRESH': 0.5, 'BG_THRESH_HI': 0.5, 'BG_THRESH_LO': 0.0, 'BIAS_DECAY': False, 'BN_TRAIN': False, 'DISPLAY': 20, 'DOUBLE_BIAS': False, 'FG_FRACTION': 0.25, 'FG_THRESH': 0.5, 'GAMMA': 0.1, 'HAS_RPN': True, 'IMS_PER_BATCH': 1, 'LEARNING_RATE': 0.001, 'MAX_SIZE': 1000, 'MOMENTUM': 0.9, 'PROPOSAL_METHOD': 'gt', 'RPN_BATCHSIZE': 256, 'RPN_BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0], 'RPN_CLOBBER_POSITIVES': False, 'RPN_FG_FRACTION': 0.5, 'RPN_MIN_SIZE': 8, 'RPN_NEGATIVE_OVERLAP': 0.3, 'RPN_NMS_THRESH': 0.7, 'RPN_POSITIVE_OVERLAP': 0.7, 'RPN_POSITIVE_WEIGHT': -1.0, 'RPN_POST_NMS_TOP_N': 2000, 'RPN_PRE_NMS_TOP_N': 12000, 'SCALES': [600], 'SNAPSHOT_ITERS': 5000, 'SNAPSHOT_KEPT': 3, 'SNAPSHOT_PREFIX': 'res101_faster_rcnn', 'STEPSIZE': [30000], 'SUMMARY_INTERVAL': 180, 'TRIM_HEIGHT': 600, 'TRIM_WIDTH': 600, 'TRUNCATED': False, 'USE_ALL_GT': True, 'USE_FLIPPED': True, 'USE_GT': False, 'WEIGHT_DECAY': 0.0001}, 'USE_GPU_NMS': True} Loaded dataset voc_2007_trainval for training Set proposal method: gt Appending horizontally-flipped training examples... voc_2007_trainval gt roidb loaded from /data2/CZY/data/faster-rcnn.pytorch/data/cache/voc_2007_trainval_gt_roidb.pkl done Preparing training data... done before filtering, there are 10022 images... after filtering, there are 10022 images... 10022 roidb entries Loading pretrained weights from data/pretrained_model/resnet101_caffe.pth loading checkpoint models/res101/pascal_voc/faster_rcnn_1_7_10021.pth loaded checkpoint models/res101/pascal_voc/faster_rcnn_1_7_10021.pth Traceback (most recent call last): File "trainvalnet.py", line 332, in optimizer.step() File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/torch/optim/sgd.py", line 101, in step buf.mul(momentum).add_(1 - dampening, d_p) RuntimeError: The size of tensor a (512) must match the size of tensor b (1024) at non-singleton dimension 1

However , In test_net.py and demo.py , it's Ok.

AlexanderHustinx commented 4 years ago

Oef, okay this has been a while for me 😅

So you can successfully run the test_net.py and demo.py. Can you run a training session from scratch? (Not resuming)

zychen2016 commented 4 years ago

Oef, okay this has been a while for me

So you can successfully run the test_net.py and demo.py. Can you run a training session from scratch? (Not resuming)

Thank you! Yes, I already train from scratch.

AlexanderHustinx commented 4 years ago

Have you made any changes to the original resnet.py, faster_rcnn.py, or trainval_net.py?

zychen2016 commented 4 years ago

Have you made any changes to the original resnet.py, faster_rcnn.py, or trainval_net.py?

No, Just download pascal_voc and res101_caffe and faster_1_6_10021.pth.

Using pytorch-1.0 branch

AlexanderHustinx commented 4 years ago

Can you try and train any model for 1 epoch and then continue training it afterwards? Also, what OS are you running? And which version of PyTorch? 1.0.0, 1.0.1, or higher?

zychen2016 commented 4 years ago

Can you try and train any model for 1 epoch and then continue training it afterwards? Also, what OS are you running? And which version of PyTorch? 1.0.0, 1.0.1, or higher?

Using my own dataset train from scratch and resume from any epoch ,It's ok. Maybe the config set is different from authors sets. we just dowaload pascal_voc data like https://github.com/rbgirshick/py-faster-rcnn#beyond-the-demo-installation-for-training-and-testing-models

user@user:/data2/CZY/data/faster-rcnn.pytorch/data/VOCdevkit2007$ tree -d
.
├── annotations_cache
├── local
│   ├── VOC2006
│   └── VOC2007
├── results
│   ├── VOC2006
│   │   └── Main
│   └── VOC2007
│       ├── Layout
│       ├── Main
│       └── Segmentation
├── VOC2007
│   ├── Annotations
│   ├── ImageSets
│   │   ├── Layout
│   │   ├── Main
│   │   └── Segmentation
│   ├── JPEGImages
│   ├── SegmentationClass
│   └── SegmentationObject
└── VOCcode

Ubuntu 16.04 PyTorch 1.0.0

Here, I have another question: After rotating the image which has bounding box, the rotated bounding box has more "empty" area that doesn't has object, It will harm for the model performance? for example : https://imgaug.readthedocs.io/en/latest/_images/rotation.jpg

AlexanderHustinx commented 4 years ago

I assume it is indeed something with the configs, maybe it's related to the number of anchor scales and ratios you're using?

After rotating the image which has bounding box, the rotated bounding box has more "empty" area that doesn't has object, It will harm for the model performance?

Regarding this question, you won't know for sure until you try. It is possible that your model will not mind if trained enough. But, I would say that it can indeed slow down your training and even be harmful for performance. This is because you'll be training the R-CNN to learn parts of a feature map (after RoI Pooling) that contain no distinct features of the object you're trying to find.

zychen2016 commented 4 years ago

I assume it is indeed something with the configs, maybe it's related to the number of anchor scales and ratios you're using?

After rotating the image which has bounding box, the rotated bounding box has more "empty" area that doesn't has object, It will harm for the model performance?

Regarding this question, you won't know for sure until you try. It is possible that your model will not mind if trained enough. But, I would say that it can indeed slow down your training and even be harmful for performance. This is because you'll be training the R-CNN to learn parts of a feature map (after RoI Pooling) that contain no distinct features of the object you're trying to find.

Thanks. Using default configs for pascal_voc. In my opinion, YOLOs need set different anchor scales and ratio for different datasets, faster rcnn also needed?

Could you have another way to do data augment?

AlexanderHustinx commented 4 years ago

In my opinion, YOLOs need set different anchor scales and ratio for different datasets, faster rcnn also needed?

You are correct. It depends on your dataset, e.g. if you have a dataset of only small objects you'll want smaller anchor scales; if there are only big objects using larger scales will suffice; when the objects range from small to large, using more scales likely benefits your performance. Same goes for ratios, if you assume only long/narrow objects, i.e. objects where width > height, you might want to consider ratios that give e.g. w\h = [1, 1.5, 2], instead of the default [0.5, 1, 2]

Note that when changing the scales (for PASCAL VOC by default: [8, 16, 32]), these values aren't the final sizes in pixes. Instead they will still be multiplied by 16 as part of another hyperparameter in the code. So the final scales would be [128, 256, 512].

Could you have another way to do data augment?

This really depends on your dataset, the expected orientation of the objects etc. Faster RCNN is not rotation invariant, so if you want to find e.g. a quokka (as in your example) in cases where it is standing and it is laying, you need to train on both cases. e.g. by rotating the image (though just more data is obviously better)

An easy and almost always valid data augmentation for object detection is (horizontal) flipping. You can also consider random cropping: leaving out parts of the image, sometimes also leaving out parts of the objects (in the PASCAL VOC dataset these objects would be labeled as 'truncated')

zychen2016 commented 4 years ago

I happend another question，could you help me？ For the test result， I want to get coco metrics. So I want to use --dataset coco parameters to train my own data, It must be that my image name must be the same format as COCO? Or how to train my own datasets with coco format?

@AlexanderHustinx

AlexanderHustinx commented 4 years ago

You'll need to have a look at the lib/datasets/imdb.py and lib/datasets/coco.py to mimic the structure of your dataset (images and annotations), and you'll need to add your own dataset to lib/datasets/factory.py.

An example of adding a dataset that uses the PASCAL VOC eval metrics can be found here: https://github.com/haleuh/face-faster-rcnn.pytorch/tree/master/lib/datasets You can look at what they changed between PASCAL VOC and WIDER FACE to get it working. This should probably be enough to get it working for your own dataset too.

Alternatively, you could have a look at: https://github.com/deboc/py-faster-rcnn/tree/master/help

zychen2016 commented 4 years ago

You'll need to have a look at the lib/datasets/imdb.py and lib/datasets/coco.py to mimic the structure of your dataset (images and annotations), and you'll need to add your own dataset to lib/datasets/factory.py.

An example of adding a dataset that uses the PASCAL VOC eval metrics can be found here: https://github.com/haleuh/face-faster-rcnn.pytorch/tree/master/lib/datasets You can look at what they changed between PASCAL VOC and WIDER FACE to get it working. This should probably be enough to get it working for your own dataset too.

Alternatively, you could have a look at: https://github.com/deboc/py-faster-rcnn/tree/master/help

Thank you very much for your help! When run trainval_net.py with --use_tfb,There are only has train loss curve ,no val loss curve. It is right?In my opinions,There should be val loss curve.

update: I know , may be I should write some code in test_net.py to save losses and use tensorboard to visualizing. Right?

AlexanderHustinx commented 4 years ago

You're right, there is no actual validation set used. You can write a small if-statement and for-loop in the trainval_net.py to e.g. once every 500 epochs run a validation set. To do so you must make sure that the validation set is not in the training set, and during the validation set you use with torch.no_grad(): in order to make sure you don't train on the validation set.

The test_net.py doesn't use tensorboard because during testing you don't calculate losses, instead you only 'infer' the boxes and classes. This is because technically you don't have the targets during test time, only during training and validation.

zychen2016 commented 4 years ago

You're right, there is no actual validation set used. You can write a small if-statement and for-loop in the trainval_net.py to e.g. once every 500 epochs run a validation set. To do so you must make sure that the validation set is not in the training set, and during the validation set you use with torch.no_grad(): in order to make sure you don't train on the validation set.

The test_net.py doesn't use tensorboard because during testing you don't calculate losses, instead you only 'infer' the boxes and classes. This is because technically you don't have the targets during test time, only during training and validation.

Thank you! I will have a try.

zychen2016 commented 4 years ago

I have a try with this code in trainval_net.py @AlexanderHustinx

1.I want to run a validation every 500 steps. But all val data are needed to be run at a validation? Or just one batch size one validation?

Could you review my code? Thanks

   if args.dataset == "pascal_voc":
       args.imdb_name = "voc_2007_trainval"
       args.imdbval_name = "voc_2007_test"
       ###############################
       # validation dataset
       ###############################
+      args.imdbval_name_val="voc_2007_val"
       args.set_cfgs = ['ANCHOR_SCALES', '[8, 16, 32]', 'ANCHOR_RATIOS', '[0.5,1,2]', 'MAX_NUM_GT_BOXES', '20']
   imdb, roidb, ratio_list, ratio_index = combined_roidb(args.imdb_name)
+  imdb_val, roidb_val, ratio_list_val, ratio_index_val = combined_roidb(args.imdbval_name_val)
   train_size = len(roidb)
+  val_size=len(roidb_val)
   sampler_batch = sampler(train_size, args.batch_size)
+  sampler_batch_val = sampler(val_size, args.batch_size)

   dataset = roibatchLoader(roidb, ratio_list, ratio_index, args.batch_size, \
                            imdb.num_classes, training=True)
+  dataset_val = roibatchLoader(roidb_val, ratio_list_val, ratio_index_val, args.batch_size, imdb_val.num_classes, training=False)
   dataloader = torch.utils.data.DataLoader(dataset,batch_size=args.batch_size,sampler=sampler_batch, num_workers=args.num_workers)
+  dataloader_val = torch.utils.data.DataLoader(dataset_val,batch_size=args.batch_size,sampler=sampler_batch_val, num_workers=args.num_workers)
     data_iter = iter(dataloader)
+    data_iter_val=iter(dataloader_val)

+      if step%500==0:
+            while True:
+                data_val=next(data_iter_val)
+                with torch.no_grad():
+                       im_val_data.resize_(data_val[0].size()).copy_(data_val[0])
+                       im_val_info.resize_(data_val[1].size()).copy_(data_val[1])
+                       gt_val_boxes.resize_(data_val[2].size()).copy_(data_val[2])
+                       num_val_boxes.resize_(data_val[3].size()).copy_(data_val[3])
+                       rois_val, cls_prob_val, bbox_pred_val, \
+                       rpn_loss_cls_val, rpn_loss_box_val, \
+                       RCNN_loss_cls_val, RCNN_loss_bbox_val, \
+                       rois_label_val = fasterRCNN(im_val_data, im_val_info, gt_val_boxes, num_val_boxes)
+                loss_rpn_cls_val = rpn_loss_cls_val.item()
+                loss_rpn_box_val = rpn_loss_box_val.item()
+                loss_rcnn_cls_val = RCNN_loss_cls_val.item()
+                loss_rcnn_box_val = RCNN_loss_bbox_val.item()
+                loss_val = rpn_loss_cls_val.mean() + rpn_loss_box_val.mean() \
+                 + RCNN_loss_cls_val.mean() + RCNN_loss_bbox_val.mean()
+                loss_temp_val += loss_val.item()
+            val_info = {
+                'loss_val': loss_temp_val,
+                'loss_rpn_cls_val': loss_rpn_cls_val,
+                'loss_rpn_box_val': loss_rpn_box_val,
+                'loss_rcnn_cls_val': loss_rcnn_cls_val,
+                'loss_rcnn_box_val': loss_rcnn_box_val
+            }
+            logger.add_scalar("logs_s_{}/val_losses".format(args.session), val_info, (epoch - 1) * iters_per_epoch + step)

     save_name = os.path.join(output_dir, 'faster_rcnn_{}_{}_{}.pth'.format(args.session, epoch, step))

errors:

Loading pretrained weights from data/pretrained_model/resnet101_caffe.pth
[session 1][epoch  1][iter    0/1673] loss: 4.0778, lr: 1.00e-03
            fg/bg=(119/393), time cost: 2.804799
            rpn_cls: 0.6306, rpn_box: 0.0667, rcnn_cls: 2.7759, rcnn_box 0.6047
Traceback (most recent call last):
  File "trainval_net.py", line 377, in <module>
    data_val=next(data_iter_val)
  File "/home/melody/anaconda3/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 615, in __next__
    batch = self.collate_fn([self.dataset[i] for i in indices])
  File "/home/melody/anaconda3/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 232, in default_collate
    return [default_collate(samples) for samples in transposed]
  File "/home/melody/anaconda3/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 232, in <listcomp>
    return [default_collate(samples) for samples in transposed]
  File "/home/melody/anaconda3/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 209, in default_collate
    return torch.stack(batch, 0, out=out)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 800 and 1067 in dimension 2 at /opt/conda/conda-bld/pytorch_1544199946412/work/aten/src/TH/generic/THTensorMoreMath.cpp:1333

AlexanderHustinx commented 4 years ago

You don't need to use the entire valuation set when validating. I would very much recommend it though! That way you can more accurately track your model's progression over time.

I would run the validation using a batch size of 1 instead of your normal batch size, that more closely resembles the results you would get during test time.

I can't really spot an error immediately, but as it's complaining about the size I assume that using a batch size of 1 during validation will likely solve the problem.

3?. You need to make sure that your training set doe not include your validation set, otherwise the added value of the validation set is negated. I mention this because your code states to use voc_2007_trainval for training and voc_2007_val for validation. voc_2007_trainval includes the validation set, so be careful there.

Saharkakavand commented 4 years ago

I have a try with this code in trainval_net.py @AlexanderHustinx

1.I want to run a validation every 500 steps. But all val data are needed to be run at a validation? Or just one batch size one validation?

Could you review my code? Thanks

   if args.dataset == "pascal_voc":
       args.imdb_name = "voc_2007_trainval"
       args.imdbval_name = "voc_2007_test"
       ###############################
       # validation dataset
       ###############################
+      args.imdbval_name_val="voc_2007_val"
       args.set_cfgs = ['ANCHOR_SCALES', '[8, 16, 32]', 'ANCHOR_RATIOS', '[0.5,1,2]', 'MAX_NUM_GT_BOXES', '20']
   imdb, roidb, ratio_list, ratio_index = combined_roidb(args.imdb_name)
+  imdb_val, roidb_val, ratio_list_val, ratio_index_val = combined_roidb(args.imdbval_name_val)
   train_size = len(roidb)
+  val_size=len(roidb_val)
   sampler_batch = sampler(train_size, args.batch_size)
+  sampler_batch_val = sampler(val_size, args.batch_size)

   dataset = roibatchLoader(roidb, ratio_list, ratio_index, args.batch_size, \
                            imdb.num_classes, training=True)
+  dataset_val = roibatchLoader(roidb_val, ratio_list_val, ratio_index_val, args.batch_size, imdb_val.num_classes, training=False)
   dataloader = torch.utils.data.DataLoader(dataset,batch_size=args.batch_size,sampler=sampler_batch, num_workers=args.num_workers)
+  dataloader_val = torch.utils.data.DataLoader(dataset_val,batch_size=args.batch_size,sampler=sampler_batch_val, num_workers=args.num_workers)
     data_iter = iter(dataloader)
+    data_iter_val=iter(dataloader_val)

+      if step%500==0:
+            while True:
+                data_val=next(data_iter_val)
+                with torch.no_grad():
+                       im_val_data.resize_(data_val[0].size()).copy_(data_val[0])
+                       im_val_info.resize_(data_val[1].size()).copy_(data_val[1])
+                       gt_val_boxes.resize_(data_val[2].size()).copy_(data_val[2])
+                       num_val_boxes.resize_(data_val[3].size()).copy_(data_val[3])
+                       rois_val, cls_prob_val, bbox_pred_val, \
+                       rpn_loss_cls_val, rpn_loss_box_val, \
+                       RCNN_loss_cls_val, RCNN_loss_bbox_val, \
+                       rois_label_val = fasterRCNN(im_val_data, im_val_info, gt_val_boxes, num_val_boxes)
+                loss_rpn_cls_val = rpn_loss_cls_val.item()
+                loss_rpn_box_val = rpn_loss_box_val.item()
+                loss_rcnn_cls_val = RCNN_loss_cls_val.item()
+                loss_rcnn_box_val = RCNN_loss_bbox_val.item()
+                loss_val = rpn_loss_cls_val.mean() + rpn_loss_box_val.mean() \
+                 + RCNN_loss_cls_val.mean() + RCNN_loss_bbox_val.mean()
+                loss_temp_val += loss_val.item()
+            val_info = {
+                'loss_val': loss_temp_val,
+                'loss_rpn_cls_val': loss_rpn_cls_val,
+                'loss_rpn_box_val': loss_rpn_box_val,
+                'loss_rcnn_cls_val': loss_rcnn_cls_val,
+                'loss_rcnn_box_val': loss_rcnn_box_val
+            }
+            logger.add_scalar("logs_s_{}/val_losses".format(args.session), val_info, (epoch - 1) * iters_per_epoch + step)

     save_name = os.path.join(output_dir, 'faster_rcnn_{}_{}_{}.pth'.format(args.session, epoch, step))

errors:

Loading pretrained weights from data/pretrained_model/resnet101_caffe.pth
[session 1][epoch  1][iter    0/1673] loss: 4.0778, lr: 1.00e-03
          fg/bg=(119/393), time cost: 2.804799
          rpn_cls: 0.6306, rpn_box: 0.0667, rcnn_cls: 2.7759, rcnn_box 0.6047
Traceback (most recent call last):
  File "trainval_net.py", line 377, in <module>
    data_val=next(data_iter_val)
  File "/home/melody/anaconda3/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 615, in __next__
    batch = self.collate_fn([self.dataset[i] for i in indices])
  File "/home/melody/anaconda3/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 232, in default_collate
    return [default_collate(samples) for samples in transposed]
  File "/home/melody/anaconda3/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 232, in <listcomp>
    return [default_collate(samples) for samples in transposed]
  File "/home/melody/anaconda3/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 209, in default_collate
    return torch.stack(batch, 0, out=out)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 800 and 1067 in dimension 2 at /opt/conda/conda-bld/pytorch_1544199946412/work/aten/src/TH/generic/THTensorMoreMath.cpp:1333

Hi, did you solve the problem for validation?

tyshiwo1 commented 4 years ago

I think you may try these codes in trainval_net.py. It seems to work :

if args.resume:
    load_name = os.path.join(output_dir,
      'faster_rcnn_{}_{}_{}.pth'.format(args.checksession, args.checkepoch, args.checkpoint))
    print("loading checkpoint %s" % (load_name))
    checkpoint = torch.load(load_name)
    #args.session = checkpoint['session']
    #args.start_epoch = checkpoint['epoch']
    #fasterRCNN.load_state_dict(checkpoint['model'])
    #optimizer.load_state_dict(checkpoint['optimizer'])
    #lr = optimizer.param_groups[0]['lr'] 
    for k, v in fasterRCNN.state_dict().items():
      if k in checkpoint['model']:
        param = torch.from_numpy(np.asarray(checkpoint['model'][k].cpu()))
        v.copy_(param)
      else:
        print('lose [in faster_cnn]: {}'.format(k))

jwyang / faster-rcnn.pytorch

Finetuning res101 model pretrained models error #557