Training fails with batch size > 1

deepc94 commented 5 years ago

When I use batch size 1 with multiple TitanX GPUs (4 12GB GPUs), it trains perfectly fine, but when I increase the batch size to 8 or 12 or even 2, training fails with the following trace after loading the weights: Called with args: Namespace(batch_size=12, checkepoch=10, checkpoint=0, checkpoint_interval=10000, checksession=20, class_agnostic=False, cuda=True, dataset='kaist', disp_interval=100, large_scale=False, lr=0.001, lr_decay_gamma=0.1, lr_decay_step=15, mGPUs=True, max_epochs=45, net='res50', num_workers=0, optimizer='sgd', resume=False, save_dir='/mnt/nfs/scratch1/dchakraborty/infrared/models', session=10, start_epoch=1, use_tfboard=False) Using config: {'ANCHOR_RATIOS': [0.5, 1, 2], 'ANCHOR_SCALES': [0.05, 0.1, 0.25, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4], 'CROP_RESIZE_WITH_MAX_POOL': True, 'CUDA': False, 'DATA_DIR': '/mnt/nfs/scratch1/dchakraborty/Kaist', 'DEDUP_BOXES': 0.0625, 'EPS': 1e-14, 'EXP_DIR': 'res50', 'FEAT_STRIDE': [16], 'GPU_ID': 0, 'MATLAB': 'matlab', 'MAX_NUM_GT_BOXES': 30, 'MOBILENET': {'DEPTH_MULTIPLIER': 1.0, 'FIXED_LAYERS': 5, 'REGU_DEPTH': False, 'WEIGHT_DECAY': 4e-05}, 'PIXEL_MEANS': array([[[102.9801, 115.9465, 122.7717]]]), 'POOLING_MODE': 'crop', 'POOLING_SIZE': 7, 'RESNET': {'FIXED_BLOCKS': 1, 'MAX_POOL': False}, 'RNG_SEED': 3, 'ROOT_DIR': '/mnt/nfs/scratch1/dchakraborty', 'TEST': {'BBOX_REG': True, 'HAS_RPN': True, 'MAX_SIZE': 1000, 'MODE': 'nms', 'NMS': 0.3, 'PROPOSAL_METHOD': 'gt', 'RPN_MIN_SIZE': 16, 'RPN_NMS_THRESH': 0.7, 'RPN_POST_NMS_TOP_N': 300, 'RPN_PRE_NMS_TOP_N': 6000, 'RPN_TOP_N': 5000, 'SCALES': [600], 'SVM': False}, 'TRAIN': {'ASPECT_GROUPING': False, 'BATCH_SIZE': 256, 'BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0], 'BBOX_NORMALIZE_MEANS': [0.0, 0.0, 0.0, 0.0], 'BBOX_NORMALIZE_STDS': [0.1, 0.1, 0.2, 0.2], 'BBOX_NORMALIZE_TARGETS': True, 'BBOX_NORMALIZE_TARGETS_PRECOMPUTED': True, 'BBOX_REG': True, 'BBOX_THRESH': 0.5, 'BG_THRESH_HI': 0.5, 'BG_THRESH_LO': 0.0, 'BIAS_DECAY': False, 'BN_TRAIN': False, 'DISPLAY': 20, 'DOUBLE_BIAS': False, 'FG_FRACTION': 0.25, 'FG_THRESH': 0.5, 'GAMMA': 0.1, 'HAS_RPN': True, 'IMS_PER_BATCH': 1, 'LEARNING_RATE': 0.001, 'MAX_SIZE': 1000, 'MOMENTUM': 0.9, 'PROPOSAL_METHOD': 'gt', 'RPN_BATCHSIZE': 256, 'RPN_BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0], 'RPN_CLOBBER_POSITIVES': False, 'RPN_FG_FRACTION': 0.5, 'RPN_MIN_SIZE': 8, 'RPN_NEGATIVE_OVERLAP': 0.3, 'RPN_NMS_THRESH': 0.7, 'RPN_POSITIVE_OVERLAP': 0.7, 'RPN_POSITIVE_WEIGHT': -1.0, 'RPN_POST_NMS_TOP_N': 2000, 'RPN_PRE_NMS_TOP_N': 12000, 'SCALES': [600], 'SNAPSHOT_ITERS': 5000, 'SNAPSHOT_KEPT': 3, 'SNAPSHOT_PREFIX': 'res50_faster_rcnn', 'STEPSIZE': [30000], 'SUMMARY_INTERVAL': 180, 'TRIM_HEIGHT': 600, 'TRIM_WIDTH': 600, 'TRUNCATED': False, 'USE_ALL_GT': True, 'USE_FLIPPED': True, 'USE_GT': False, 'WEIGHT_DECAY': 0.0001}, 'USE_GPU_NMS': True} Loaded dataset combined_train for training Appending horizontally-flipped training examples... wrote gt roidb to /mnt/nfs/scratch1/dchakraborty/Kaist/cache/combined_train_gt_roidb.pkl before filtering, there are 100368 images... after filtering, there are 44316 images... 44316 roidb entries Loading pretrained weights from /home/dchakraborty/.torch/models/resnet50-19c8e357.pth Traceback (most recent call last): File "trainval_net.py", line 338, in rois_label = fasterRCNN(im_data, im_info, gt_boxes, num_boxes) File "/home/dchakraborty/.local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call result = self.forward(*input, **kwargs) File "/home/dchakraborty/.local/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 114, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/dchakraborty/.local/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 124, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/dchakraborty/.local/lib/python2.7/site-packages/torch/nn/parallel/parallel_apply.py", line 65, in parallel_apply raise output RuntimeError: bool value of Tensor with more than one value is ambiguous

Please help identify the issue if possible. Let me know if you need any more info.

kentaroy47 commented 5 years ago

does it work if you try to train with single GPU? multi-GPU sometimes does not work.

deepc94 commented 5 years ago

@kentaroy47 Yes, it works on a single GPU, and it also works on multiple GPUs when batch size is set equal to 1 or equal to number of GPUs used. It doesn't work however if I use any batch size which is not in the set {1, nGPU}.

kentaroy47 commented 5 years ago

I think the batch size should be a multiplier of nGPU used, or it cannot be distributed.

2019年3月28日(木) 20:21 Deep Chakraborty notifications@github.com:

@kentaroy47 https://github.com/kentaroy47 Yes, it works on a single GPU, and it also works on multiple GPUs when batch size is set equal to 1 or equal to number of GPUs used. It doesn't work however if I use any batch size which is not in the set {1, nGPU}.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jwyang/faster-rcnn.pytorch/issues/446#issuecomment-477554312, or mute the thread https://github.com/notifications/unsubscribe-auth/AlxuJaWFGhrzb47Tby7gQ1y1bAFj4-wrks5vbKW1gaJpZM4a88bC .

deepc94 commented 5 years ago

I'm aware of that, but it doesn't work with any multiplier of nGPU other than 1. I.e, if I'm using 4 GPUs, I am either being able to use batch size of 1 or 4, but not anything else - 8,16, etc.

On Thu, 28 Mar 2019, 11:23 a.m. kenken, notifications@github.com wrote:

I think the batch size should be a multiplier of nGPU used, or it cannot be distributed.

2019年3月28日(木) 20:21 Deep Chakraborty notifications@github.com:

@kentaroy47 https://github.com/kentaroy47 Yes, it works on a single GPU, and it also works on multiple GPUs when batch size is set equal to 1 or equal to number of GPUs used. It doesn't work however if I use any batch size which is not in the set {1, nGPU}.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/jwyang/faster-rcnn.pytorch/issues/446#issuecomment-477554312 , or mute the thread < https://github.com/notifications/unsubscribe-auth/AlxuJaWFGhrzb47Tby7gQ1y1bAFj4-wrks5vbKW1gaJpZM4a88bC

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jwyang/faster-rcnn.pytorch/issues/446#issuecomment-477643101, or mute the thread https://github.com/notifications/unsubscribe-auth/AF7tIBsB5fpmCdcklST23X6iiHotqMjNks5vbN6FgaJpZM4a88bC .

AtriSaxena commented 5 years ago

I am also using single GPU and when i set batch size greater than 1 it gives error. While training with batch size 1 there is no problem. But it takes lot of time to train.

kentaroy47 commented 5 years ago

@AtriSaxena You should be able to set to a larger batch size (if your GPU has enough memory). What kind of error do you get?

AtriSaxena commented 5 years ago

I solved the problem. I was using Cuda10 with pytorch 1.1. I was getting some dimension error in minibatch. When i switched to pytorch 1.0.0 problem is solved.

devendraswamy commented 4 years ago

when i kept the training with batch-size 2 for my own dataset then my system was restarted ? any idea , please help .

jwyang / faster-rcnn.pytorch

Training fails with batch size > 1 #446