Open jsuit opened 5 years ago
this was fixed a while ago I believe.. I've successfully retrained res101 now
Can you include your command-line argument and the full stack-trace? Might be able to help you out if I can recreate the issue.
Also, what version of Python, PyTorch, CUDA, are you using?
pytorch 0.4 cuda = 9.0 python 3.6.6 faster_rcnn_1_10_9771.pth (gotten from link provided on README page)
Stack trace:
Preparing training data...
done
loading annotations into memory...
Done (t=2.96s)
creating index...
index created!
before filtering, there are 236574 images...
after filtering, there are 234532 images...
234532 roidb entries
Loading pretrained data/pretrained_model/resnet101_caffe.pth
loading checkpoint faster-rcnn.pytorch/faster_rcnn_1_10_9771.pth
loaded checkpoint faster-rcnn.pytorch/faster_rcnn_1_10_9771.pth
Traceback (most recent call last):
File "trainvalnet.py", line 336, in
Could you also include the command-line argument? i.e. what you enter in your terminal to run the code
For example:
python demo.py --dataset pascal --net res101 --checksession 1 --checkepoch 7 --checkpoint 10021 --cuda
python trainval_net.py --r True --cuda --bs 1
I'm basically saying resume training for the load_name:
(basically, I just changed lines 275-276 in trainval_net.py to load_name = my path to faster_rcnn_1_10_9771.pth)
Here's the only part I changed. Everything else is the same as the main branch. if args.resume: load_name = "path to faster_rcnn_1_10_9771.pth" print("loading checkpoint %s" % (load_name)) checkpoint = torch.load(load_name) args.session = checkpoint['session'] args.start_epoch = checkpoint['epoch'] fasterRCNN.load_state_dict(checkpoint['model']) optimizer.load_state_dict(checkpoint['optimizer'])
my default network is res101 and I'm training on coco2014.
Could you try to add the following tag to your command:
--dataset coco
I think you might be loading the pascal_voc
dataset by default.
Additionally, you could try using all the command-line parameters:
python trainval_net.py --dataset coco --net res101 --r True --checksession 1 --checkepoch 10 --checkpoint 9771 --bs 1 --cuda
It is possible you might be overlooking something in your adjusted code.
have you solved this problem ? @jsuit
Hello, @AlexanderHustinx , I have the same problem
Using res101 pretrained model with trainval_net.py.
But using the same model with test_net.py and demo.py , there is no problem
Using pytorch-1.0 branch
Could you show the stack trace and error? Maybe I can help spot the issue
Could you show the stack trace and error? Maybe I can help spot the issue
Thanks. @AlexanderHustinx
1.
Using CUDA_VISBLE_DEVICES=0 python3 trainval_net.py --r True --dataset pascal_voc --net vgg16 --bs 1 --nw 0 --lr 1e-3 --lr_decay_step 5 --cuda --checksession 1 --checkepoch 6 --checkpoint 10021 --use_tfb
Called with args:
Namespace(batch_size=1, checkepoch=6, checkpoint=10021, checkpoint_interval=10000, checksession=1, class_agnostic=False, cuda=True, dataset='pascal_voc', disp_interval=100, large_scale=False, lr=0.001, lr_decay_gamma=0.1, lr_decay_step=5, mGPUs=False, max_epochs=20, net='vgg16', num_workers=0, optimizer='sgd', resume=True, save_dir='models', session=1, start_epoch=1, use_tfboard=True)
/data2/CZY/data/faster-rcnn.pytorch/lib/model/utils/config.py:374: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
yaml_cfg = edict(yaml.load(f))
Using config:
{'ANCHOR_RATIOS': [0.5, 1, 2],
'ANCHOR_SCALES': [8, 16, 32],
'CROP_RESIZE_WITH_MAX_POOL': False,
'CUDA': False,
'DATA_DIR': '/data2/CZY/data/faster-rcnn.pytorch/data',
'DEDUP_BOXES': 0.0625,
'EPS': 1e-14,
'EXP_DIR': 'vgg16',
'FEAT_STRIDE': [16],
'GPU_ID': 0,
'MATLAB': 'matlab',
'MAX_NUM_GT_BOXES': 20,
'MOBILENET': {'DEPTH_MULTIPLIER': 1.0,
'FIXED_LAYERS': 5,
'REGU_DEPTH': False,
'WEIGHT_DECAY': 4e-05},
'PIXEL_MEANS': array([[[102.9801, 115.9465, 122.7717]]]),
'POOLING_MODE': 'align',
'POOLING_SIZE': 7,
'RESNET': {'FIXED_BLOCKS': 1, 'MAX_POOL': False},
'RNG_SEED': 3,
'ROOT_DIR': '/data2/CZY/data/faster-rcnn.pytorch',
'TEST': {'BBOX_REG': True,
'HAS_RPN': True,
'MAX_SIZE': 1000,
'MODE': 'nms',
'NMS': 0.3,
'PROPOSAL_METHOD': 'gt',
'RPN_MIN_SIZE': 16,
'RPN_NMS_THRESH': 0.7,
'RPN_POST_NMS_TOP_N': 300,
'RPN_PRE_NMS_TOP_N': 6000,
'RPN_TOP_N': 5000,
'SCALES': [600],
'SVM': False},
'TRAIN': {'ASPECT_GROUPING': False,
'BATCH_SIZE': 256,
'BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
'BBOX_NORMALIZE_MEANS': [0.0, 0.0, 0.0, 0.0],
'BBOX_NORMALIZE_STDS': [0.1, 0.1, 0.2, 0.2],
'BBOX_NORMALIZE_TARGETS': True,
'BBOX_NORMALIZE_TARGETS_PRECOMPUTED': True,
'BBOX_REG': True,
'BBOX_THRESH': 0.5,
'BG_THRESH_HI': 0.5,
'BG_THRESH_LO': 0.0,
'BIAS_DECAY': False,
'BN_TRAIN': False,
'DISPLAY': 10,
'DOUBLE_BIAS': True,
'FG_FRACTION': 0.25,
'FG_THRESH': 0.5,
'GAMMA': 0.1,
'HAS_RPN': True,
'IMS_PER_BATCH': 1,
'LEARNING_RATE': 0.01,
'MAX_SIZE': 1000,
'MOMENTUM': 0.9,
'PROPOSAL_METHOD': 'gt',
'RPN_BATCHSIZE': 256,
'RPN_BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
'RPN_CLOBBER_POSITIVES': False,
'RPN_FG_FRACTION': 0.5,
'RPN_MIN_SIZE': 8,
'RPN_NEGATIVE_OVERLAP': 0.3,
'RPN_NMS_THRESH': 0.7,
'RPN_POSITIVE_OVERLAP': 0.7,
'RPN_POSITIVE_WEIGHT': -1.0,
'RPN_POST_NMS_TOP_N': 2000,
'RPN_PRE_NMS_TOP_N': 12000,
'SCALES': [600],
'SNAPSHOT_ITERS': 5000,
'SNAPSHOT_KEPT': 3,
'SNAPSHOT_PREFIX': 'res101_faster_rcnn',
'STEPSIZE': [30000],
'SUMMARY_INTERVAL': 180,
'TRIM_HEIGHT': 600,
'TRIM_WIDTH': 600,
'TRUNCATED': False,
'USE_ALL_GT': True,
'USE_FLIPPED': True,
'USE_GT': False,
'WEIGHT_DECAY': 0.0005},
'USE_GPU_NMS': True}
Loaded dataset voc_2007_trainval
for training
Set proposal method: gt
Appending horizontally-flipped training examples...
voc_2007_trainval gt roidb loaded from /data2/CZY/data/faster-rcnn.pytorch/data/cache/voc_2007_trainval_gt_roidb.pkl
done
Preparing training data...
done
before filtering, there are 10022 images...
after filtering, there are 10022 images...
10022 roidb entries
Loading pretrained weights from data/pretrained_model/vgg16_caffe.pth
loading checkpoint models/vgg16/pascal_voc/faster_rcnn_1_6_10021.pth
loaded checkpoint models/vgg16/pascal_voc/faster_rcnn_1_6_10021.pth
Traceback (most recent call last):
File "trainvalnet.py", line 332, in
Using CUDA_VISBLE_DEVICES=0 python3 trainval_net.py --r True --dataset pascal_voc --net res101 --bs 1 --nw 0 --lr 1e-3 --lr_decay_step 5 --cuda --checksession 1 --checkepoch 7 --checkpoint 10021 --use_tfb
Called with args:
Namespace(batch_size=1, checkepoch=7, checkpoint=10021, checkpoint_interval=10000, checksession=1, class_agnostic=False, cuda=True, dataset='pascal_voc', disp_interval=100, large_scale=False, lr=0.001, lr_decay_gamma=0.1, lr_decay_step=5, mGPUs=False, max_epochs=20, net='res101', num_workers=0, optimizer='sgd', resume=True, save_dir='models', session=1, start_epoch=1, use_tfboard=True)
/data2/CZY/data/faster-rcnn.pytorch/lib/model/utils/config.py:374: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
yaml_cfg = edict(yaml.load(f))
Using config:
{'ANCHOR_RATIOS': [0.5, 1, 2],
'ANCHOR_SCALES': [8, 16, 32],
'CROP_RESIZE_WITH_MAX_POOL': False,
'CUDA': False,
'DATA_DIR': '/data2/CZY/data/faster-rcnn.pytorch/data',
'DEDUP_BOXES': 0.0625,
'EPS': 1e-14,
'EXP_DIR': 'res101',
'FEAT_STRIDE': [16],
'GPU_ID': 0,
'MATLAB': 'matlab',
'MAX_NUM_GT_BOXES': 20,
'MOBILENET': {'DEPTH_MULTIPLIER': 1.0,
'FIXED_LAYERS': 5,
'REGU_DEPTH': False,
'WEIGHT_DECAY': 4e-05},
'PIXEL_MEANS': array([[[102.9801, 115.9465, 122.7717]]]),
'POOLING_MODE': 'align',
'POOLING_SIZE': 7,
'RESNET': {'FIXED_BLOCKS': 1, 'MAX_POOL': False},
'RNG_SEED': 3,
'ROOT_DIR': '/data2/CZY/data/faster-rcnn.pytorch',
'TEST': {'BBOX_REG': True,
'HAS_RPN': True,
'MAX_SIZE': 1000,
'MODE': 'nms',
'NMS': 0.3,
'PROPOSAL_METHOD': 'gt',
'RPN_MIN_SIZE': 16,
'RPN_NMS_THRESH': 0.7,
'RPN_POST_NMS_TOP_N': 300,
'RPN_PRE_NMS_TOP_N': 6000,
'RPN_TOP_N': 5000,
'SCALES': [600],
'SVM': False},
'TRAIN': {'ASPECT_GROUPING': False,
'BATCH_SIZE': 128,
'BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
'BBOX_NORMALIZE_MEANS': [0.0, 0.0, 0.0, 0.0],
'BBOX_NORMALIZE_STDS': [0.1, 0.1, 0.2, 0.2],
'BBOX_NORMALIZE_TARGETS': True,
'BBOX_NORMALIZE_TARGETS_PRECOMPUTED': True,
'BBOX_REG': True,
'BBOX_THRESH': 0.5,
'BG_THRESH_HI': 0.5,
'BG_THRESH_LO': 0.0,
'BIAS_DECAY': False,
'BN_TRAIN': False,
'DISPLAY': 20,
'DOUBLE_BIAS': False,
'FG_FRACTION': 0.25,
'FG_THRESH': 0.5,
'GAMMA': 0.1,
'HAS_RPN': True,
'IMS_PER_BATCH': 1,
'LEARNING_RATE': 0.001,
'MAX_SIZE': 1000,
'MOMENTUM': 0.9,
'PROPOSAL_METHOD': 'gt',
'RPN_BATCHSIZE': 256,
'RPN_BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
'RPN_CLOBBER_POSITIVES': False,
'RPN_FG_FRACTION': 0.5,
'RPN_MIN_SIZE': 8,
'RPN_NEGATIVE_OVERLAP': 0.3,
'RPN_NMS_THRESH': 0.7,
'RPN_POSITIVE_OVERLAP': 0.7,
'RPN_POSITIVE_WEIGHT': -1.0,
'RPN_POST_NMS_TOP_N': 2000,
'RPN_PRE_NMS_TOP_N': 12000,
'SCALES': [600],
'SNAPSHOT_ITERS': 5000,
'SNAPSHOT_KEPT': 3,
'SNAPSHOT_PREFIX': 'res101_faster_rcnn',
'STEPSIZE': [30000],
'SUMMARY_INTERVAL': 180,
'TRIM_HEIGHT': 600,
'TRIM_WIDTH': 600,
'TRUNCATED': False,
'USE_ALL_GT': True,
'USE_FLIPPED': True,
'USE_GT': False,
'WEIGHT_DECAY': 0.0001},
'USE_GPU_NMS': True}
Loaded dataset voc_2007_trainval
for training
Set proposal method: gt
Appending horizontally-flipped training examples...
voc_2007_trainval gt roidb loaded from /data2/CZY/data/faster-rcnn.pytorch/data/cache/voc_2007_trainval_gt_roidb.pkl
done
Preparing training data...
done
before filtering, there are 10022 images...
after filtering, there are 10022 images...
10022 roidb entries
Loading pretrained weights from data/pretrained_model/resnet101_caffe.pth
loading checkpoint models/res101/pascal_voc/faster_rcnn_1_7_10021.pth
loaded checkpoint models/res101/pascal_voc/faster_rcnn_1_7_10021.pth
Traceback (most recent call last):
File "trainvalnet.py", line 332, in
Oef, okay this has been a while for me π
So you can successfully run the test_net.py and demo.py. Can you run a training session from scratch? (Not resuming)
Oef, okay this has been a while for me
So you can successfully run the test_net.py and demo.py. Can you run a training session from scratch? (Not resuming)
Thank you! Yes, I already train from scratch.
Have you made any changes to the original resnet.py, faster_rcnn.py, or trainval_net.py?
Have you made any changes to the original resnet.py, faster_rcnn.py, or trainval_net.py?
No, Just download pascal_voc and res101_caffe and faster_1_6_10021.pth.
Using pytorch-1.0 branch
Can you try and train any model for 1 epoch and then continue training it afterwards? Also, what OS are you running? And which version of PyTorch? 1.0.0, 1.0.1, or higher?
Can you try and train any model for 1 epoch and then continue training it afterwards? Also, what OS are you running? And which version of PyTorch? 1.0.0, 1.0.1, or higher?
Using my own dataset train from scratch and resume from any epoch ,It's ok. Maybe the config set is different from authors sets. we just dowaload pascal_voc data like https://github.com/rbgirshick/py-faster-rcnn#beyond-the-demo-installation-for-training-and-testing-models
user@user:/data2/CZY/data/faster-rcnn.pytorch/data/VOCdevkit2007$ tree -d
.
βββ annotations_cache
βββ local
βΒ Β βββ VOC2006
βΒ Β βββ VOC2007
βββ results
βΒ Β βββ VOC2006
βΒ Β βΒ Β βββ Main
βΒ Β βββ VOC2007
βΒ Β βββ Layout
βΒ Β βββ Main
βΒ Β βββ Segmentation
βββ VOC2007
βΒ Β βββ Annotations
βΒ Β βββ ImageSets
βΒ Β βΒ Β βββ Layout
βΒ Β βΒ Β βββ Main
βΒ Β βΒ Β βββ Segmentation
βΒ Β βββ JPEGImages
βΒ Β βββ SegmentationClass
βΒ Β βββ SegmentationObject
βββ VOCcode
Ubuntu 16.04 PyTorch 1.0.0
Here, I have another question: After rotating the image which has bounding box, the rotated bounding box has more "empty" area that doesn't has object, It will harm for the model performance? for example : https://imgaug.readthedocs.io/en/latest/_images/rotation.jpg
I assume it is indeed something with the configs, maybe it's related to the number of anchor scales and ratios you're using?
After rotating the image which has bounding box, the rotated bounding box has more "empty" area that doesn't has object, It will harm for the model performance?
Regarding this question, you won't know for sure until you try. It is possible that your model will not mind if trained enough. But, I would say that it can indeed slow down your training and even be harmful for performance. This is because you'll be training the R-CNN to learn parts of a feature map (after RoI Pooling) that contain no distinct features of the object you're trying to find.
I assume it is indeed something with the configs, maybe it's related to the number of anchor scales and ratios you're using?
After rotating the image which has bounding box, the rotated bounding box has more "empty" area that doesn't has object, It will harm for the model performance?
Regarding this question, you won't know for sure until you try. It is possible that your model will not mind if trained enough. But, I would say that it can indeed slow down your training and even be harmful for performance. This is because you'll be training the R-CNN to learn parts of a feature map (after RoI Pooling) that contain no distinct features of the object you're trying to find.
Thanks. Using default configs for pascal_voc. In my opinion, YOLOs need set different anchor scales and ratio for different datasets, faster rcnn also needed?
Could you have another way to do data augment?
In my opinion, YOLOs need set different anchor scales and ratio for different datasets, faster rcnn also needed?
You are correct. It depends on your dataset, e.g. if you have a dataset of only small objects you'll want smaller anchor scales; if there are only big objects using larger scales will suffice; when the objects range from small to large, using more scales likely benefits your performance.
Same goes for ratios, if you assume only long/narrow objects, i.e. objects where width > height
, you might want to consider ratios that give e.g. w\h = [1, 1.5, 2]
, instead of the default [0.5, 1, 2]
Note that when changing the scales (for PASCAL VOC by default: [8, 16, 32]
), these values aren't the final sizes in pixes. Instead they will still be multiplied by 16 as part of another hyperparameter in the code. So the final scales would be [128, 256, 512]
.
Could you have another way to do data augment?
This really depends on your dataset, the expected orientation of the objects etc. Faster RCNN is not rotation invariant, so if you want to find e.g. a quokka (as in your example) in cases where it is standing and it is laying, you need to train on both cases. e.g. by rotating the image (though just more data is obviously better)
An easy and almost always valid data augmentation for object detection is (horizontal) flipping. You can also consider random cropping: leaving out parts of the image, sometimes also leaving out parts of the objects (in the PASCAL VOC dataset these objects would be labeled as 'truncated')
I happend another questionοΌcould you help meοΌ
For the test resultοΌ I want to get coco metrics.
So I want to use --dataset coco
parameters to train my own data, It must be that my image name must be the same format as COCO?
Or how to train my own datasets with coco format?
@AlexanderHustinx
You'll need to have a look at the lib/datasets/imdb.py
and lib/datasets/coco.py
to mimic the structure of your dataset (images and annotations), and you'll need to add your own dataset to lib/datasets/factory.py
.
An example of adding a dataset that uses the PASCAL VOC eval metrics can be found here: https://github.com/haleuh/face-faster-rcnn.pytorch/tree/master/lib/datasets You can look at what they changed between PASCAL VOC and WIDER FACE to get it working. This should probably be enough to get it working for your own dataset too.
Alternatively, you could have a look at: https://github.com/deboc/py-faster-rcnn/tree/master/help
You'll need to have a look at the
lib/datasets/imdb.py
andlib/datasets/coco.py
to mimic the structure of your dataset (images and annotations), and you'll need to add your own dataset tolib/datasets/factory.py
.An example of adding a dataset that uses the PASCAL VOC eval metrics can be found here: https://github.com/haleuh/face-faster-rcnn.pytorch/tree/master/lib/datasets You can look at what they changed between PASCAL VOC and WIDER FACE to get it working. This should probably be enough to get it working for your own dataset too.
Alternatively, you could have a look at: https://github.com/deboc/py-faster-rcnn/tree/master/help
Thank you very much for your help!
When run trainval_net.py with --use_tfb
,There are only has train loss curve ,no val loss curve.
It is right?In my opinions,There should be val loss curve.
update: I know , may be I should write some code in test_net.py to save losses and use tensorboard to visualizing. Right?
You're right, there is no actual validation set used.
You can write a small if-statement and for-loop in the trainval_net.py
to e.g. once every 500 epochs run a validation set.
To do so you must make sure that the validation set is not in the training set, and during the validation set you use with torch.no_grad():
in order to make sure you don't train on the validation set.
The test_net.py
doesn't use tensorboard because during testing you don't calculate losses, instead you only 'infer' the boxes and classes. This is because technically you don't have the targets during test time, only during training and validation.
You're right, there is no actual validation set used. You can write a small if-statement and for-loop in the
trainval_net.py
to e.g. once every 500 epochs run a validation set. To do so you must make sure that the validation set is not in the training set, and during the validation set you usewith torch.no_grad():
in order to make sure you don't train on the validation set.The
test_net.py
doesn't use tensorboard because during testing you don't calculate losses, instead you only 'infer' the boxes and classes. This is because technically you don't have the targets during test time, only during training and validation.
Thank you! I will have a try.
I have a try with this code in trainval_net.py @AlexanderHustinx
1.I want to run a validation every 500 steps. But all val data are needed to be run at a validation? Or just one batch size one validation?
if args.dataset == "pascal_voc":
args.imdb_name = "voc_2007_trainval"
args.imdbval_name = "voc_2007_test"
###############################
# validation dataset
###############################
+ args.imdbval_name_val="voc_2007_val"
args.set_cfgs = ['ANCHOR_SCALES', '[8, 16, 32]', 'ANCHOR_RATIOS', '[0.5,1,2]', 'MAX_NUM_GT_BOXES', '20']
imdb, roidb, ratio_list, ratio_index = combined_roidb(args.imdb_name)
+ imdb_val, roidb_val, ratio_list_val, ratio_index_val = combined_roidb(args.imdbval_name_val)
train_size = len(roidb)
+ val_size=len(roidb_val)
sampler_batch = sampler(train_size, args.batch_size)
+ sampler_batch_val = sampler(val_size, args.batch_size)
dataset = roibatchLoader(roidb, ratio_list, ratio_index, args.batch_size, \
imdb.num_classes, training=True)
+ dataset_val = roibatchLoader(roidb_val, ratio_list_val, ratio_index_val, args.batch_size, imdb_val.num_classes, training=False)
dataloader = torch.utils.data.DataLoader(dataset,batch_size=args.batch_size,sampler=sampler_batch, num_workers=args.num_workers)
+ dataloader_val = torch.utils.data.DataLoader(dataset_val,batch_size=args.batch_size,sampler=sampler_batch_val, num_workers=args.num_workers)
data_iter = iter(dataloader)
+ data_iter_val=iter(dataloader_val)
+ if step%500==0:
+ while True:
+ data_val=next(data_iter_val)
+ with torch.no_grad():
+ im_val_data.resize_(data_val[0].size()).copy_(data_val[0])
+ im_val_info.resize_(data_val[1].size()).copy_(data_val[1])
+ gt_val_boxes.resize_(data_val[2].size()).copy_(data_val[2])
+ num_val_boxes.resize_(data_val[3].size()).copy_(data_val[3])
+ rois_val, cls_prob_val, bbox_pred_val, \
+ rpn_loss_cls_val, rpn_loss_box_val, \
+ RCNN_loss_cls_val, RCNN_loss_bbox_val, \
+ rois_label_val = fasterRCNN(im_val_data, im_val_info, gt_val_boxes, num_val_boxes)
+ loss_rpn_cls_val = rpn_loss_cls_val.item()
+ loss_rpn_box_val = rpn_loss_box_val.item()
+ loss_rcnn_cls_val = RCNN_loss_cls_val.item()
+ loss_rcnn_box_val = RCNN_loss_bbox_val.item()
+ loss_val = rpn_loss_cls_val.mean() + rpn_loss_box_val.mean() \
+ + RCNN_loss_cls_val.mean() + RCNN_loss_bbox_val.mean()
+ loss_temp_val += loss_val.item()
+ val_info = {
+ 'loss_val': loss_temp_val,
+ 'loss_rpn_cls_val': loss_rpn_cls_val,
+ 'loss_rpn_box_val': loss_rpn_box_val,
+ 'loss_rcnn_cls_val': loss_rcnn_cls_val,
+ 'loss_rcnn_box_val': loss_rcnn_box_val
+ }
+ logger.add_scalar("logs_s_{}/val_losses".format(args.session), val_info, (epoch - 1) * iters_per_epoch + step)
save_name = os.path.join(output_dir, 'faster_rcnn_{}_{}_{}.pth'.format(args.session, epoch, step))
errors:
Loading pretrained weights from data/pretrained_model/resnet101_caffe.pth
[session 1][epoch 1][iter 0/1673] loss: 4.0778, lr: 1.00e-03
fg/bg=(119/393), time cost: 2.804799
rpn_cls: 0.6306, rpn_box: 0.0667, rcnn_cls: 2.7759, rcnn_box 0.6047
Traceback (most recent call last):
File "trainval_net.py", line 377, in <module>
data_val=next(data_iter_val)
File "/home/melody/anaconda3/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 615, in __next__
batch = self.collate_fn([self.dataset[i] for i in indices])
File "/home/melody/anaconda3/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 232, in default_collate
return [default_collate(samples) for samples in transposed]
File "/home/melody/anaconda3/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 232, in <listcomp>
return [default_collate(samples) for samples in transposed]
File "/home/melody/anaconda3/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 209, in default_collate
return torch.stack(batch, 0, out=out)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 800 and 1067 in dimension 2 at /opt/conda/conda-bld/pytorch_1544199946412/work/aten/src/TH/generic/THTensorMoreMath.cpp:1333
I would run the validation using a batch size of 1 instead of your normal batch size, that more closely resembles the results you would get during test time.
3?. You need to make sure that your training set doe not include your validation set, otherwise the added value of the validation set is negated.
I mention this because your code states to use voc_2007_trainval
for training and voc_2007_val
for validation. voc_2007_trainval
includes the validation set, so be careful there.
I have a try with this code in trainval_net.py @AlexanderHustinx
1.I want to run a validation every 500 steps. But all val data are needed to be run at a validation? Or just one batch size one validation?
- Could you review my code? Thanks
if args.dataset == "pascal_voc": args.imdb_name = "voc_2007_trainval" args.imdbval_name = "voc_2007_test" ############################### # validation dataset ############################### + args.imdbval_name_val="voc_2007_val" args.set_cfgs = ['ANCHOR_SCALES', '[8, 16, 32]', 'ANCHOR_RATIOS', '[0.5,1,2]', 'MAX_NUM_GT_BOXES', '20'] imdb, roidb, ratio_list, ratio_index = combined_roidb(args.imdb_name) + imdb_val, roidb_val, ratio_list_val, ratio_index_val = combined_roidb(args.imdbval_name_val) train_size = len(roidb) + val_size=len(roidb_val) sampler_batch = sampler(train_size, args.batch_size) + sampler_batch_val = sampler(val_size, args.batch_size) dataset = roibatchLoader(roidb, ratio_list, ratio_index, args.batch_size, \ imdb.num_classes, training=True) + dataset_val = roibatchLoader(roidb_val, ratio_list_val, ratio_index_val, args.batch_size, imdb_val.num_classes, training=False) dataloader = torch.utils.data.DataLoader(dataset,batch_size=args.batch_size,sampler=sampler_batch, num_workers=args.num_workers) + dataloader_val = torch.utils.data.DataLoader(dataset_val,batch_size=args.batch_size,sampler=sampler_batch_val, num_workers=args.num_workers) data_iter = iter(dataloader) + data_iter_val=iter(dataloader_val) + if step%500==0: + while True: + data_val=next(data_iter_val) + with torch.no_grad(): + im_val_data.resize_(data_val[0].size()).copy_(data_val[0]) + im_val_info.resize_(data_val[1].size()).copy_(data_val[1]) + gt_val_boxes.resize_(data_val[2].size()).copy_(data_val[2]) + num_val_boxes.resize_(data_val[3].size()).copy_(data_val[3]) + rois_val, cls_prob_val, bbox_pred_val, \ + rpn_loss_cls_val, rpn_loss_box_val, \ + RCNN_loss_cls_val, RCNN_loss_bbox_val, \ + rois_label_val = fasterRCNN(im_val_data, im_val_info, gt_val_boxes, num_val_boxes) + loss_rpn_cls_val = rpn_loss_cls_val.item() + loss_rpn_box_val = rpn_loss_box_val.item() + loss_rcnn_cls_val = RCNN_loss_cls_val.item() + loss_rcnn_box_val = RCNN_loss_bbox_val.item() + loss_val = rpn_loss_cls_val.mean() + rpn_loss_box_val.mean() \ + + RCNN_loss_cls_val.mean() + RCNN_loss_bbox_val.mean() + loss_temp_val += loss_val.item() + val_info = { + 'loss_val': loss_temp_val, + 'loss_rpn_cls_val': loss_rpn_cls_val, + 'loss_rpn_box_val': loss_rpn_box_val, + 'loss_rcnn_cls_val': loss_rcnn_cls_val, + 'loss_rcnn_box_val': loss_rcnn_box_val + } + logger.add_scalar("logs_s_{}/val_losses".format(args.session), val_info, (epoch - 1) * iters_per_epoch + step) save_name = os.path.join(output_dir, 'faster_rcnn_{}_{}_{}.pth'.format(args.session, epoch, step))
errors:
Loading pretrained weights from data/pretrained_model/resnet101_caffe.pth [session 1][epoch 1][iter 0/1673] loss: 4.0778, lr: 1.00e-03 fg/bg=(119/393), time cost: 2.804799 rpn_cls: 0.6306, rpn_box: 0.0667, rcnn_cls: 2.7759, rcnn_box 0.6047 Traceback (most recent call last): File "trainval_net.py", line 377, in <module> data_val=next(data_iter_val) File "/home/melody/anaconda3/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 615, in __next__ batch = self.collate_fn([self.dataset[i] for i in indices]) File "/home/melody/anaconda3/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 232, in default_collate return [default_collate(samples) for samples in transposed] File "/home/melody/anaconda3/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 232, in <listcomp> return [default_collate(samples) for samples in transposed] File "/home/melody/anaconda3/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 209, in default_collate return torch.stack(batch, 0, out=out) RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 800 and 1067 in dimension 2 at /opt/conda/conda-bld/pytorch_1544199946412/work/aten/src/TH/generic/THTensorMoreMath.cpp:1333
Hi, did you solve the problem for validation?
I think you may try these codes in trainval_net.py. It seems to work :
if args.resume:
load_name = os.path.join(output_dir,
'faster_rcnn_{}_{}_{}.pth'.format(args.checksession, args.checkepoch, args.checkpoint))
print("loading checkpoint %s" % (load_name))
checkpoint = torch.load(load_name)
#args.session = checkpoint['session']
#args.start_epoch = checkpoint['epoch']
#fasterRCNN.load_state_dict(checkpoint['model'])
#optimizer.load_state_dict(checkpoint['optimizer'])
#lr = optimizer.param_groups[0]['lr']
for k, v in fasterRCNN.state_dict().items():
if k in checkpoint['model']:
param = torch.from_numpy(np.asarray(checkpoint['model'][k].cpu()))
v.copy_(param)
else:
print('lose [in faster_cnn]: {}'.format(k))
So if you try and retrain the resnet101 (faster_rcnn_1_10_9771.pth) you get the following error:
It's the optimizer version knowledge of the parameters is different than what it actually is. Also, you get the same error if you use, /home/jonathan/faster-rcnn.pytorch/faster_rcnn_1_6_9771.pth