msracver / Relation-Networks-for-Object-Detection

Relation Networks for Object Detection
MIT License
1.09k stars 190 forks source link

mxnet.base.MXNetError: NaiveEngine only support synchronize Push so far #13

Closed Kongsea closed 6 years ago

Kongsea commented 6 years ago

After training for several epoches, it raised the following error:

Epoch[7] Batch [2280]   Speed: 3.99 samples/sec Train-RPNAcc=0.995009,  RPNLogLoss=0.016979,    RPNL1Loss=0.034412, RCNNAcc=0.797916,   RCNNLogLoss=0.433694,   RCNNL1Loss=0.423676,    
Epoch[7] Batch [2300]   Speed: 3.96 samples/sec Train-RPNAcc=0.995029,  RPNLogLoss=0.016926,    RPNL1Loss=0.034388, RCNNAcc=0.797866,   RCNNLogLoss=0.433422,   RCNNL1Loss=0.423580,    
Epoch[7] Batch [2320]   Speed: 3.69 samples/sec Train-RPNAcc=0.995011,  RPNLogLoss=0.017027,    RPNL1Loss=0.034292, RCNNAcc=0.798346,   RCNNLogLoss=0.432565,   RCNNL1Loss=0.422573,
[17:51:53] /home/fallingdust/workspace/mxnet/dmlc-core/include/dmlc/./logging.h:308: [17:51:53] src/engine/naive_engine.cc:168: Check failed: this->req_completed_ NaiveEngine only support synchronize Push so far

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-1.0.0-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet6engine11NaiveEngine9PushAsyncESt8functionIFvNS_10RunContextENS0_18CallbackOnCompleteEEENS_7ContextERKSt6vectorIPNS0_3VarESaISA_EESE_NS_10FnPropertyEiPKc+0x3b3) [0x7f921cb635a3]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet-1.0.0-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet6engine11NaiveEngine4PushEPNS0_3OprENS_7ContextEib+0x8f) [0x7f921cb644af]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet-1.0.0-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet4exec13GraphExecutor6RunOpsEbmm+0x724) [0x7f921cc08e84]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet-1.0.0-py2.7.egg/mxnet/libmxnet.so(MXExecutorForward+0x11) [0x7f921cb9ab81]
[bt] (4) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f9230c02e40]
[bt] (5) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7f9230c028ab]
[bt] (6) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48f) [0x7f9230e123df]
[bt] (7) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(+0x11d82) [0x7f9230e16d82]
[bt] (8) python(PyObject_Call+0x43) [0x4b0c93]
[bt] (9) python(PyEval_EvalFrameEx+0x602f) [0x4c9f9f]

Traceback (most recent call last):
  File "experiments/relation_rcnn/rcnn_end2end_train_test.py", line 21, in <module>
    train_end2end.main()
  File "experiments/relation_rcnn/../../relation_rcnn/train_end2end.py", line 193, in main
    config.TRAIN.begin_epoch, config.TRAIN.end_epoch, config.TRAIN.lr, config.TRAIN.lr_step)
  File "experiments/relation_rcnn/../../relation_rcnn/train_end2end.py", line 186, in train_net
    arg_params=arg_params, aux_params=aux_params, begin_epoch=begin_epoch, num_epoch=end_epoch)
  File "experiments/relation_rcnn/../../relation_rcnn/core/module.py", line 999, in fit
    self.forward_backward(data_batch)
  File "/usr/local/lib/python2.7/dist-packages/mxnet-1.0.0-py2.7.egg/mxnet/module/base_module.py", line 191, in forward_backward
    self.forward(data_batch, is_train=True)
  File "experiments/relation_rcnn/../../relation_rcnn/core/module.py", line 1074, in forward
    self._curr_module.forward(data_batch, is_train=is_train)
  File "experiments/relation_rcnn/../../relation_rcnn/core/module.py", line 554, in forward
    self._exec_group.forward(data_batch, is_train)
  File "experiments/relation_rcnn/../../relation_rcnn/core/DataParallelExecutorGroup.py", line 360, in forward
    exec_.forward(is_train=is_train)
  File "/usr/local/lib/python2.7/dist-packages/mxnet-1.0.0-py2.7.egg/mxnet/executor.py", line 150, in forward
    ctypes.c_int(int(is_train))))
  File "/usr/local/lib/python2.7/dist-packages/mxnet-1.0.0-py2.7.egg/mxnet/base.py", line 146, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [17:51:53] src/engine/naive_engine.cc:168: Check failed: this->req_completed_ NaiveEngine only support synchronize Push so far

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-1.0.0-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet6engine11NaiveEngine9PushAsyncESt8functionIFvNS_10RunContextENS0_18CallbackOnCompleteEEENS_7ContextERKSt6vectorIPNS0_3VarESaISA_EESE_NS_10FnPropertyEiPKc+0x3b3) [0x7f921cb635a3]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet-1.0.0-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet6engine11NaiveEngine4PushEPNS0_3OprENS_7ContextEib+0x8f) [0x7f921cb644af]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet-1.0.0-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet4exec13GraphExecutor6RunOpsEbmm+0x724) [0x7f921cc08e84]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet-1.0.0-py2.7.egg/mxnet/libmxnet.so(MXExecutorForward+0x11) [0x7f921cb9ab81]
[bt] (4) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f9230c02e40]
[bt] (5) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7f9230c028ab]
[bt] (6) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48f) [0x7f9230e123df]
[bt] (7) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(+0x11d82) [0x7f9230e16d82]
[bt] (8) python(PyObject_Call+0x43) [0x4b0c93]
[bt] (9) python(PyEval_EvalFrameEx+0x602f) [0x4c9f9f]

[17:51:53] src/engine/naive_engine.cc:55: Engine shutdown

I trained the network on just one GPU using CUDA_VISIBLE_DEVICES=0 although I have two GPUs. Please give me some advice or help to fix it. Thank you.

chengdazhi commented 6 years ago

Hi, I can see that you are using mxnet 1.0.0. In our readme, however, it is suggested to use the official 1.1.0 version, or the newest 1.2.0(no guarantees). If using a different version of mxnet causes too much trouble for you, I suggest that you do not use NaiveEngine, as we have never tested on this EngineType before. Maybe you can try to use the default MXNET_ENGINE_TYPE: ThreadedEnginePerDevice.

Kongsea commented 6 years ago

After changing to the default MXNET_ENGINE_TYPE, or upgrading to mxnet 1.2.0, it works well now. Thank you.

Kongsea commented 6 years ago

After upgrading to mxnet 1.2.0, this error appears sometimes again when setting MXNET_ENGINE_TYPE to NaiveEngine.

Besides, it raised the following error occasionally:

mxnet.base.MXNetError: [12:01:59] src/operator/nn/./cudnn/cudnn_softmax_activation-inl.h:154: Check failed: e == CUDNN_STATUS_SUCCESS (3 vs. 0) cuDNN: CUDNN_STATUS_BAD_PARAM

But it sometimes work well again if I rerun the program. It seems very weird. Could you give me some help? Thank you. @chengdazhi

chengdazhi commented 6 years ago

Hi, it has also been found that mxnet 1.2.0 installed by pip has this error. I suggest you build it from source or install 1.1.0.

Again, I don't see why you need NaiveEngine, we have never tested our code on NaiveEngine bofore.

Kongsea commented 6 years ago

I have downgraded to mxnet 1.1.0 and deleted NaiveEngine, now it works well. Thank you.