Closed xiaoyongzhu closed 6 years ago
Thanks for reporting this. Currently, both MXNet and MKL-DNN are under intensive development. MXNet is depending on certain MKL-DNN version (for MXNet 1.2, it's commit f5218ff). So using lastest MKL-DNN in MXNet 1.2 is not verified. FYI, we are working on integrating MKL-DNN 0.15 release into MXNet master branch.
Thanks @TaoLv - after changing to this specific commit, the previous problem went away, however I met this new problem... looks like there's something wrong with the malloc process?
[18:33:58] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 37632 bytes with malloc directly
terminate called after throwing an instance of 'dmlc::Error'
what(): [18:33:58] src/engine/./threaded_engine.h:379: std::exception
A fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.
Stack trace returned 8 entries:
[bt] (0) /opt/conda/lib/python2.7/site-packages/mxnet/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5b) [0x7f12e64af2db]
[bt] (1) /opt/conda/lib/python2.7/site-packages/mxnet/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f12e64b0318]
[bt] (2) /opt/conda/lib/python2.7/site-packages/mxnet/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x13a3) [0x7f12e8ef7d23]
[bt] (3) /opt/conda/lib/python2.7/site-packages/mxnet/libmxnet.so(std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#1}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&)+0xd9) [0x7f12e8f09e49]
[bt] (4) /opt/conda/lib/python2.7/site-packages/mxnet/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)> (std::shared_ptr<dmlc::ManualEvent>)> >::_M_run()+0x4a) [0x7f12e8f08f3a]
[bt] (5) /opt/conda/bin/../lib/libstdc++.so.6(+0xafc5c) [0x7f132715ec5c]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x7494) [0x7f132eed4494]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f132e50bacf]
@xiaoyongzhu Does this error also exist in mxnet 1.2 release? You can install it from pip:
pip install mxnet-mkl==1.2.0
There are some known issues in mxnet 1.2.0 and most of them are fixed in 1.2.1 release. Do you mind to try it out? https://github.com/apache/incubator-mxnet/releases/tag/1.2.1
So I've followed the instruction above and the issue disappeared. Looks like there's some integration problem with mxnet.
However... there is a new exception - how should I deal with this though? Is it because of some customized operator is not supported by MKLDNN?
[18:30:44] src/operator/nn/mkldnn/mkldnn_base.cc:68: Allocate 191102976 bytes with malloc directly
Traceback (most recent call last):
File "Deformable-ConvNets/fpn/demo_xview.py", line 360, in <module>
main()
File "Deformable-ConvNets/fpn/demo_xview.py", line 307, in main
boxes, scores, classes = generate_detections(data, data_names, predictor, config, nms, image_list, num_preds)
File "Deformable-ConvNets/fpn/demo_xview.py", line 61, in generate_detections
scores, boxes, data_dict = im_detect(predictor, data_batch, data_names, scales, config)
File "/Deformable-ConvNets/fpn/core/tester.py", line 56, in im_detect
scores = output['cls_prob_reshape_output'].asnumpy()[0]
File "/opt/conda/lib/python2.7/site-packages/mxnet/ndarray/ndarray.py", line 1876, in asnumpy
ctypes.c_size_t(data.size)))
File "/opt/conda/lib/python2.7/site-packages/mxnet/base.py", line 149, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [18:30:58] src/ndarray/ndarray.cc:767: Check failed: !IsMKLDNNData() We can't generate TBlob for MKLDNN data. Please use Reorder2Default() to generate a new NDArray first
Stack trace returned 10 entries:
[bt] (0) /opt/conda/lib/python2.7/site-packages/mxnet/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5b) [0x7f2e326b515b]
[bt] (1) /opt/conda/lib/python2.7/site-packages/mxnet/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f2e326b6198]
[bt] (2) /opt/conda/lib/python2.7/site-packages/mxnet/libmxnet.so(mxnet::NDArray::SetTBlob() const+0x302) [0x7f2e34ca0472]
[bt] (3) /opt/conda/lib/python2.7/site-packages/mxnet/libmxnet.so(mxnet::op::custom::AllocateNDArrayCopy(mxnet::NDArray**, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, unsigned long, int)+0x31d) [0x7f2e329a09bd]
[bt] (4) /opt/conda/lib/python2.7/site-packages/mxnet/libmxnet.so(mxnet::op::custom::ForwardEx(mxnet::OpStatePtr const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&)+0x3e2) [0x7f2e3298dcb2]
[bt] (5) /opt/conda/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x2fa2a4a) [0x7f2e3510aa4a]
[bt] (6) /opt/conda/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x2fa2b01) [0x7f2e3510ab01]
[bt] (7) /opt/conda/lib/python2.7/site-packages/mxnet/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0xcb5) [0x7f2e350696b5]
[bt] (8) /opt/conda/lib/python2.7/site-packages/mxnet/libmxnet.so(std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#1}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&)+0xd9) [0x7f2e3507bec9]
[bt] (9) /opt/conda/lib/python2.7/site-packages/mxnet/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)> (std::shared_ptr<dmlc::ManualEvent>)> >::_M_run()+0x4a) [0x7f2e3507afba]
Do you have any chance to try mxnet 1.2.1 release out? You simply installed it with:
pip install mxnet-mkl==1.2.1
If the problem is still there, could you kindly share a reproducible script for it? Then I can take a look. Thanks for your patience.
Closing due to lack of activity. Feel free to reopen with new data.
Hi team,
I met this problem when I am trying to use MKLDNN with MXNet. The code can be run with regular MKL successfully, but with MKLDNN (the latest version) and with MXNet 1.2, it has the following error:
Looks like there's something that doesn't support MKLDNN? How should I deal with it?
Thanks!