msracver / FCIS

Fully Convolutional Instance-aware Semantic Segmentation
MIT License
1.57k stars 414 forks source link

mxnet.base.MXNetError: [16:31:18] src/operator/nn/./cudnn/cudnn_convolution-inl.h:449: Check failed: e == CUDNN_STATUS_SUCCESS (3 vs. 0) cuDNN: CUDNN_STATUS_BAD_PARAM #147

Open CPFelix opened 5 years ago

CPFelix commented 5 years ago

I make the mxnet and when I runned the FCIS,I got the following problem,and I suppose it's because cudnn version.cuda8.0 cudnn6 loading annotations into memory... Done (t=0.52s) creating index... index created! num_images 6048 prepare gt_sdsdb using 1.19558501244 seconds generate cache_seg_inst using 0.203454017639 seconds append flipped images to roidb filtered 0 roidb entries: 12096 -> 12096 providing maximum shape [('data', (1, 3, 756, 1008)), ('gt_boxes', (1, 100, 5)), ('gt_masks', (1, 100, 756, 1008))] [('proposal_label', (1, 45360)), ('proposal_bbox_target', (1, 60, 48, 63)), ('proposal_bbox_weight', (1, 60, 48, 63))] data shape: {'data': (1L, 3L, 756L, 1008L), 'gt_boxes': (1L, 9L, 5L), 'gt_masks': (9L, 756L, 1008L), 'im_info': (1L, 3L), 'proposal_bbox_target': (1L, 60L, 48L, 63L), 'proposal_bbox_weight': (1L, 60L, 48L, 63L), 'proposal_label': (1L, 45360L)} lr 0.0001 lr_epoch_diff [5.0] lr_iters [60480] Traceback (most recent call last): File "experiments/fcis/fcis_end2end_train_test.py", line 13, in train_end2end.main() File "experiments/fcis/../../fcis/train_end2end.py", line 187, in main config.TRAIN.lr, config.TRAIN.lr_step) File "experiments/fcis/../../fcis/train_end2end.py", line 179, in train_net arg_params=arg_params, aux_params=aux_params, begin_epoch=begin_epoch, num_epoch=end_epoch) File "experiments/fcis/../../fcis/core/module.py", line 961, in fit self.update_metric(eval_metric, data_batch.label) File "experiments/fcis/../../fcis/core/module.py", line 1055, in update_metric self._curr_module.update_metric(eval_metric, labels) File "experiments/fcis/../../fcis/core/module.py", line 669, in update_metric self._exec_group.update_metric(eval_metric, labels) File "experiments/fcis/../../fcis/core/DataParallelExecutorGroup.py", line 529, in update_metric eval_metric.update(labels, texec.outputs) File "/home/scau2/anaconda2/lib/python2.7/site-packages/mxnet/metric.py", line 318, in update metric.update(labels, preds) File "experiments/fcis/../../fcis/core/metric.py", line 45, in update pred_label = mx.ndarray.argmax_channel(pred).asnumpy().astype('int32') File "/home/scau2/anaconda2/lib/python2.7/site-packages/mxnet/ndarray/ndarray.py", line 1980, in asnumpy ctypes.c_size_t(data.size))) File "/home/scau2/anaconda2/lib/python2.7/site-packages/mxnet/base.py", line 252, in check_call raise MXNetError(py_str(_LIB.MXGetLastError())) mxnet.base.MXNetError: [16:31:18] src/operator/nn/./cudnn/cudnn_convolution-inl.h:449: Check failed: e == CUDNN_STATUS_SUCCESS (3 vs. 0) cuDNN: CUDNN_STATUS_BAD_PARAM

Stack trace returned 10 entries: [bt] (0) /home/scau2/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5b) [0x7fd55ea8c2bb] [bt] (1) /home/scau2/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7fd55ea8ce28] [bt] (2) /home/scau2/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so(mxnet::op::CuDNNConvolutionOp::InitDescriptors(std::vector<nnvm::TShape, std::allocator > const&, std::vector<nnvm::TShape, std::allocator > const&, cudnnDataType_t, cudnnDataType_t)+0x149a) [0x7fd56393523a] [bt] (3) /home/scau2/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x5bd0d7b) [0x7fd563912d7b] [bt] (4) /home/scau2/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so(void mxnet::op::ConvolutionCompute(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator > const&, std::vector<mxnet::OpReqType, std::allocator > const&, std::vector<mxnet::TBlob, std::allocator > const&)+0xe30) [0x7fd563917910] [bt] (5) /home/scau2/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so(mxnet::exec::FComputeExecutor::Run(mxnet::RunContext, bool)+0x59) [0x7fd561caa239] [bt] (6) /home/scau2/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x3f32f60) [0x7fd561c74f60] [bt] (7) /home/scau2/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock)+0x8f5) [0x7fd561bc39c5] [bt] (8) /home/scau2/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>, std::shared_ptr const&)+0xeb) [0x7fd561bda10b] [bt] (9) /home/scau2/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so(std::_Function_handler<void (std::shared_ptr), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#4}::operator()() const::{lambda(std::shared_ptr)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr&&)+0x4e) [0x7fd561bda37e]

ccxietao commented 5 years ago

你好,我还是无法解决map=0的原因,我看您也遇到过同样的问题,请教您是怎么解决的,调节超参数应该怎么调节呢