MEM/HANG: rcnn demo.py error

niluanwudidadi commented 7 years ago

when python demo.py --prefix final --epoch 0 --image myimage.jpg --gpu 0 error occur,can you help me,thank you! I have installed cudnnv5.0 for cuda7.5,nvidia980,ubuntu14.04

/home/yx/mxnet/dmlc-core/include/dmlc/logging.h:235: [23:33:31] src/operator/./convolution-inl.h:299: Check failed: (param_.workspace) >= (required_size) Minimum workspace size: 1228800000 Bytes Given: 1073741824 Bytes [23:33:31] /home/yx/mxnet/dmlc-core/include/dmlc/logging.h:235: [23:33:31] src/engine/./threadedengine.h:306: [23:33:31] src/operator/./convolution-inl.h:299: Check failed: (param.workspace) >= (required_size) Minimum workspace size: 1228800000 Bytes Given: 1073741824 Bytes An fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging. terminate called after throwing an instance of 'dmlc::Error' what(): [23:33:31] src/engine/./threadedengine.h:306: [23:33:31] src/operator/./convolution-inl.h:299: Check failed: (param.workspace) >= (required_size) Minimum workspace size: 1228800000 Bytes Given: 1073741824 Bytes An fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.

alysazhang commented 7 years ago

(1)Set environment variable: export MXNET_ENGINE_TYPE="NaiveEngine" (2)Set convolutional layer workspace: change symbol_vgg.py and add workspace=2048 to all convolutional layers

niluanwudidadi commented 7 years ago

@alysazhang Thank you very much,I have solved this problem.But I have met another problem when python train_end2end.py

I have downsize the VOC2007 train sets to1096 images,my 980 have 4G MEMORY, but it still report out of memory, I want to know how to solve this problem.Thank you!

yx@yx-X8DTL:~/mxnet/example/mx-rcnn-master$ python train_end2end.pyCalled with argument: Namespace(begin_epoch=0, dataset='PascalVOC', dataset_path='data/VOCdevkit', end_epoch=10, epoch=1, flip=True, frequent=20, gpus='0', image_set='2007_trainval', kvstore='device', lr=0.001, lr_step=50000, network='vgg', prefix='model/e2e', pretrained='model/vgg16', resume=False, root_path='data', work_load_list=None) {'EPS': 1e-14, 'IMAGE_STRIDE': 0, 'PIXEL_MEANS': array([[[ 123.68 , 116.779, 103.939]]]), 'SCALES': [(600, 1000)], 'TEST': {'BATCH_IMAGES': 1, 'HAS_RPN': False, 'NMS': 0.3, 'RPN_MIN_SIZE': 16, 'RPN_NMS_THRESH': 0.7, 'RPN_POST_NMS_TOP_N': 300, 'RPN_PRE_NMS_TOP_N': 6000}, 'TRAIN': {'BATCH_IMAGES': 1, 'BATCH_ROIS': 128, 'BBOX_MEANS': [0.0, 0.0, 0.0, 0.0], 'BBOX_NORMALIZATION_PRECOMPUTED': True, 'BBOX_REGRESSION_THRESH': 0.5, 'BBOX_STDS': [0.1, 0.1, 0.2, 0.2], 'BBOX_WEIGHTS': array([ 1., 1., 1., 1.]), 'BG_THRESH_HI': 0.5, 'BG_THRESH_LO': 0.0, 'END2END': True, 'FG_FRACTION': 0.25, 'FG_THRESH': 0.5, 'RPN_BATCH_SIZE': 256, 'RPN_BBOX_WEIGHTS': [1.0, 1.0, 1.0, 1.0], 'RPN_CLOBBER_POSITIVES': False, 'RPN_FG_FRACTION': 0.5, 'RPN_MIN_SIZE': 16, 'RPN_NEGATIVE_OVERLAP': 0.3, 'RPN_NMS_THRESH': 0.7, 'RPN_POSITIVE_OVERLAP': 0.7, 'RPN_POSITIVE_WEIGHT': -1.0, 'RPN_POST_NMS_TOP_N': 6000, 'RPN_PRE_NMS_TOP_N': 12000}} num_images 1096 voc_2007_trainval gt roidb loaded from data/cache/voc_2007_trainval_gt_roidb.pkl append flipped images to roidb [20:41:11] src/engine/engine.cc:36: MXNet start using engine: NaiveEngine providing maximum shape [('data', (1, 3, 1000, 1000)), ('gt_boxes', (1, 100, 5))] [('label', (1, 34596)), ('bbox_target', (1, 36, 62, 62)), ('bbox_weight', (1, 36, 62, 62))] output shape {'bbox_loss_reshape_output': (1L, 128L, 84L), 'blockgrad0_output': (1L, 128L), 'cls_prob_reshape_output': (1L, 128L, 21L), 'rpn_bbox_loss_output': (1L, 36L, 37L, 50L), 'rpn_cls_prob_output': (1L, 2L, 333L, 50L)} [20:41:15] /home/yx/mxnet/dmlc-core/include/dmlc/./logging.h:235: [20:41:15] src/storage/./pooled_storage_manager.h:79: cudaMalloc failed: out of memory Traceback (most recent call last): File "train_end2end.py", line 185, in main() File "train_end2end.py", line 182, in main lr=args.lr, lr_step=args.lr_step) File "train_end2end.py", line 133, in train_net arg_params=arg_params, aux_params=aux_params, begin_epoch=begin_epoch, num_epoch=end_epoch) File "/usr/local/lib/python2.7/dist-packages/mxnet-0.7.0-py2.7.egg/mxnet/module/base_module.py", line 379, in fit self.update() File "/home/yx/mxnet/example/mx-rcnn-master/rcnn/core/module.py", line 183, in update self._curr_module.update() File "/usr/local/lib/python2.7/dist-packages/mxnet-0.7.0-py2.7.egg/mxnet/module/module.py", line 419, in update kvstore=self._kvstore) File "/usr/local/lib/python2.7/dist-packages/mxnet-0.7.0-py2.7.egg/mxnet/model.py", line 115, in _update_params updater(indexnum_device+k, g, w) File "/usr/local/lib/python2.7/dist-packages/mxnet-0.7.0-py2.7.egg/mxnet/optimizer.py", line 822, in updater optimizer.update(index, weight, grad, states[index]) File "/usr/local/lib/python2.7/dist-packages/mxnet-0.7.0-py2.7.egg/mxnet/optimizer.py", line 298, in update grad = grad self.rescale_grad File "/usr/local/lib/python2.7/dist-packages/mxnet-0.7.0-py2.7.egg/mxnet/ndarray.py", line 138, in mul return multiply(self, other) File "/usr/local/lib/python2.7/dist-packages/mxnet-0.7.0-py2.7.egg/mxnet/ndarray.py", line 744, in multiply None) File "/usr/local/lib/python2.7/dist-packages/mxnet-0.7.0-py2.7.egg/mxnet/ndarray.py", line 655, in _ufunc_helper return lfn_scalar(lhs, float(rhs)) File "/usr/local/lib/python2.7/dist-packages/mxnet-0.7.0-py2.7.egg/mxnet/ndarray.py", line 1263, in generic_ndarray_function c_array(ctypes.c_char_p, [c_str(str(i)) for i in kwargs.values()]))) File "/usr/local/lib/python2.7/dist-packages/mxnet-0.7.0-py2.7.egg/mxnet/base.py", line 77, in check_call raise MXNetError(py_str(_LIB.MXGetLastError())) mxnet.base.MXNetError: [20:41:15] src/storage/./pooled_storage_manager.h:79: cudaMalloc failed: out of memory terminate called without an active exception 已放弃 (核心已转储)

ijkguo commented 7 years ago

6GB memory is enough for all experiments, including Fast R-CNN. However, 4GB memory cannot do that. Please use py-faster-rcnn e2e training scheme (alternate uses 11GB).

ijkguo / mx-rcnn

MEM/HANG: rcnn demo.py error #38