Error training relation detection model

ronsoohyeong commented 5 years ago

Hi, I encountered an error related to roidb when trying to train the relationship network using a VGG16 backbone as follows. Can you please advise on how to fix it? Thanks in advance.

Environment : pytorch 0.4.0, Cuda 9, python3.6 Command : python tools/train_net_step_rel.py --dataset vg --cfg configs/e2e_relcnn_VGG16_8_epochs_vg_y_loss_only.yaml --nw 8 --use_tfboard

Result: Called with args:

Namespace(batch_size=None, cfg_file='configs/e2e_relcnn_VGG16_8_epochs_vg_y_loss_only.yaml', cuda=True, dataset='vg', disp_interval=20, iter_size=1, load_ckpt=None, load_detectron=None, lr=None, lr_decay_gamma=None, no_save=False, num_workers=8, optimizer=None, resume=False, set_cfgs=[], start_step=0, use_tfboard=True) effective_batch_size = batch_size iter_size = 8 1 Adaptive config changes: effective_batch_size: 8 --> 8 NUM_GPUS: 8 --> 1 IMS_PER_BATCH: 1 --> 8 Adjust BASE_LR linearly according to batch_size change: BASE_LR: 0.01 --> 0.01 Adjust SOLVER.STEPS and SOLVER.MAX_ITER linearly based on effective_batch_size change: SOLVER.STEPS: [0, 83631, 111508] --> [0, 83631, 111508] SOLVER.MAX_ITER: 125446 --> 125446 Number of data loading threads: 8 loading annotations into memory... Done (t=2.70s) creating index... index created! INFO json_dataset_rel.py: 391: Loading cached gt_roidb from /path/Large-Scale-VRD.pytorch/data/cache/vg_train_rel_gt_roidb.pkl INFO roidb_rel.py: 48: Appending horizontally-flipped training examples... INFO roidb_rel.py: 50: Loaded dataset: vg_train INFO roidb_rel.py: 148: Filtered 0 roidb entries: 125446 -> 125446 INFO roidb_rel.py: 67: Computing image aspect ratios and ordering the ratios... INFO roidb_rel.py: 69: done INFO roidb_rel.py: 73: Computing bounding-box regression targets... INFO roidb_rel.py: 75: done INFO train_net_step_rel.py: 232: 125446 roidb entries INFO train_net_step_rel.py: 233: Takes 90.15 sec(s) to construct roidb INFO utils_any2vec.py: 172: loading projection weights from /path/Large-Scale-VRD.pytorch/data/word2vec_model/GoogleNews-vectors-negative300.bin /home/user/anaconda3/envs/pytorch0.4.0/lib/python3.6/site-packages/smart_open/smart_open_lib.py:398: UserWarning: This function is deprecated, use smart_open.open instead. See the migration notes for details: https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst#migrating-to-the-new-open-function 'See the migration notes for details: %s' % _MIGRATION_NOTES_URL INFO utils_any2vec.py: 234: loaded (3000000, 300) matrix from /path/Large-Scale-VRD.pytorch/data/word2vec_model/GoogleNews-vectors-negative300.bin INFO model_builder_rel.py: 72: Model loaded. INFO model_builder_rel.py: 78: Wiki words converted to lowercase. INFO model_builder_rel.py: 102: Object label vectors loaded. INFO model_builder_rel.py: 110: Predicate label vectors loaded. INFO sparse_targets_rel.py: 48: Frequency bias tables loaded. INFO model_builder_rel.py: 213: loading pretrained weights from detection_models/vg/VGG16/model_step479999.pth INFO model_builder_rel.py: 191: loading prd pretrained weights from detection_models/vg/VGG16/model_step479999.pth INFO train_net_step_rel.py: 384: Training starts ! INFO net.py: 129: Changing learning rate 0.000000 -> 0.003333 Traceback (most recent call last): File "tools/train_net_step_rel.py", line 463, in main() File "tools/train_net_step_rel.py", line 433, in main net_outputs = maskRCNN(input_data) File "/home/user/anaconda3/envs/pytorch0.4.0/lib/python3.6/site-packages/torch/nn/modules/module.py", line 492, in call result = self.forward(*input, *kwargs) File "/path/Large-Scale-VRD.pytorch/lib/nn/parallel/data_parallel.py", line 108, in forward outputs = [self.module(inputs[0], kwargs[0])] File "/home/user/anaconda3/envs/pytorch0.4.0/lib/python3.6/site-packages/torch/nn/modules/module.py", line 492, in call result = self.forward(input, kwargs) File "/path/Large-Scale-VRD.pytorch/lib/modeling/model_builder_rel.py", line 232, in forward return self._forward(data, im_info, dataset_name, roidb, use_gt_labels, rpn_kwargs) File "/path/Large-Scale-VRD.pytorch/lib/modeling/model_builder_rel.py", line 271, in _forward rel_ret = self.RelPN(det_rois, det_labels, det_scores, im_info, dataset_name, roidb) File "/home/user/anaconda3/envs/pytorch0.4.0/lib/python3.6/site-packages/torch/nn/modules/module.py", line 492, in call result = self.forward(input, **kwargs) File "/path/Large-Scale-VRD.pytorch/lib/modeling/relpn_heads.py", line 58, in forward assert len(roidb) == 1 AssertionError

jz462 commented 5 years ago

Hi @ronsoohyeong,

From your outputs:

effective_batch_size = batch_size iter_size = 8 1 Adaptive config changes: effective_batch_size: 8 --> 8 NUM_GPUS: 8 --> 1 IMS_PER_BATCH: 1 --> 8

It seems that you are using only 1 GPU but still forcing the effective batch size to be 8, in which case len(roidb) will be 8 instead of 1, and that will trigger the assertion error you are facing right now. I strongly suggest using 8 GPUs not only to easily solve this error, but also to ensure that you can finish training in a tolerable period of time.

Hope this helps!

Ji

ronsoohyeong commented 5 years ago

Thanks. It worked with 8 GPUs.

jz462 / Large-Scale-VRD.pytorch

Error training relation detection model #9