oneapi-src / oneDNN

oneAPI Deep Neural Network Library (oneDNN)
https://uxlfoundation.org
Apache License 2.0
3.58k stars 985 forks source link

Unknown MKLDNN format for 4 dimensions: 35 #281

Closed xiaoyongzhu closed 6 years ago

xiaoyongzhu commented 6 years ago

Hi team,

I met this problem when I am trying to use MKLDNN with MXNet. The code can be run with regular MKL successfully, but with MKLDNN (the latest version) and with MXNet 1.2, it has the following error:

  File "demo.py", line 135, in <module>
    main()
  File "demo.py", line 122, in main
    all_detections.append(tester.get_detections(vis=False, evaluate=False, cache_name=None))
  File "lib/inference.py", line 233, in get_detections
    scores, boxes, data, im_ids = self.detect(batch, scales)
  File "lib/inference.py", line 106, in detect
    gpu_rois = gpu_out[self.rpn_output_names['rois']].**asnumpy()**
  File "/opt/conda/lib/python2.7/site-packages/mxnet/ndarray/ndarray.py", line 1876, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/opt/conda/lib/python2.7/site-packages/mxnet/base.py", line 149, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [00:09:58] src/operator/nn/mkldnn/mkldnn_base.cc:226: Unknown MKLDNN format for 4 dimensions: 35

Looks like there's something that doesn't support MKLDNN? How should I deal with it?

Thanks!

TaoLv commented 6 years ago

Thanks for reporting this. Currently, both MXNet and MKL-DNN are under intensive development. MXNet is depending on certain MKL-DNN version (for MXNet 1.2, it's commit f5218ff). So using lastest MKL-DNN in MXNet 1.2 is not verified. FYI, we are working on integrating MKL-DNN 0.15 release into MXNet master branch.

xiaoyongzhu commented 6 years ago

Thanks @TaoLv - after changing to this specific commit, the previous problem went away, however I met this new problem... looks like there's something wrong with the malloc process?

[18:33:58] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 37632 bytes with malloc directly
terminate called after throwing an instance of 'dmlc::Error'
  what():  [18:33:58] src/engine/./threaded_engine.h:379: std::exception
A fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.

Stack trace returned 8 entries:
[bt] (0) /opt/conda/lib/python2.7/site-packages/mxnet/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5b) [0x7f12e64af2db]
[bt] (1) /opt/conda/lib/python2.7/site-packages/mxnet/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f12e64b0318]
[bt] (2) /opt/conda/lib/python2.7/site-packages/mxnet/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x13a3) [0x7f12e8ef7d23]
[bt] (3) /opt/conda/lib/python2.7/site-packages/mxnet/libmxnet.so(std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#1}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&)+0xd9) [0x7f12e8f09e49]
[bt] (4) /opt/conda/lib/python2.7/site-packages/mxnet/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)> (std::shared_ptr<dmlc::ManualEvent>)> >::_M_run()+0x4a) [0x7f12e8f08f3a]
[bt] (5) /opt/conda/bin/../lib/libstdc++.so.6(+0xafc5c) [0x7f132715ec5c]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x7494) [0x7f132eed4494]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f132e50bacf]
TaoLv commented 6 years ago

@xiaoyongzhu Does this error also exist in mxnet 1.2 release? You can install it from pip:

pip install mxnet-mkl==1.2.0

There are some known issues in mxnet 1.2.0 and most of them are fixed in 1.2.1 release. Do you mind to try it out? https://github.com/apache/incubator-mxnet/releases/tag/1.2.1

xiaoyongzhu commented 6 years ago

So I've followed the instruction above and the issue disappeared. Looks like there's some integration problem with mxnet.

However... there is a new exception - how should I deal with this though? Is it because of some customized operator is not supported by MKLDNN?

[18:30:44] src/operator/nn/mkldnn/mkldnn_base.cc:68: Allocate 191102976 bytes with malloc directly
Traceback (most recent call last):
  File "Deformable-ConvNets/fpn/demo_xview.py", line 360, in <module>
    main()
  File "Deformable-ConvNets/fpn/demo_xview.py", line 307, in main
    boxes, scores, classes = generate_detections(data, data_names, predictor, config, nms, image_list, num_preds)
  File "Deformable-ConvNets/fpn/demo_xview.py", line 61, in generate_detections
    scores, boxes, data_dict = im_detect(predictor, data_batch, data_names, scales, config)
  File "/Deformable-ConvNets/fpn/core/tester.py", line 56, in im_detect
    scores = output['cls_prob_reshape_output'].asnumpy()[0]
  File "/opt/conda/lib/python2.7/site-packages/mxnet/ndarray/ndarray.py", line 1876, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/opt/conda/lib/python2.7/site-packages/mxnet/base.py", line 149, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [18:30:58] src/ndarray/ndarray.cc:767: Check failed: !IsMKLDNNData() We can't generate TBlob for MKLDNN data. Please use Reorder2Default() to generate a new NDArray first

Stack trace returned 10 entries:
[bt] (0) /opt/conda/lib/python2.7/site-packages/mxnet/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5b) [0x7f2e326b515b]
[bt] (1) /opt/conda/lib/python2.7/site-packages/mxnet/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f2e326b6198]
[bt] (2) /opt/conda/lib/python2.7/site-packages/mxnet/libmxnet.so(mxnet::NDArray::SetTBlob() const+0x302) [0x7f2e34ca0472]
[bt] (3) /opt/conda/lib/python2.7/site-packages/mxnet/libmxnet.so(mxnet::op::custom::AllocateNDArrayCopy(mxnet::NDArray**, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, unsigned long, int)+0x31d) [0x7f2e329a09bd]
[bt] (4) /opt/conda/lib/python2.7/site-packages/mxnet/libmxnet.so(mxnet::op::custom::ForwardEx(mxnet::OpStatePtr const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&)+0x3e2) [0x7f2e3298dcb2]
[bt] (5) /opt/conda/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x2fa2a4a) [0x7f2e3510aa4a]
[bt] (6) /opt/conda/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x2fa2b01) [0x7f2e3510ab01]
[bt] (7) /opt/conda/lib/python2.7/site-packages/mxnet/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0xcb5) [0x7f2e350696b5]
[bt] (8) /opt/conda/lib/python2.7/site-packages/mxnet/libmxnet.so(std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#1}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&)+0xd9) [0x7f2e3507bec9]
[bt] (9) /opt/conda/lib/python2.7/site-packages/mxnet/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)> (std::shared_ptr<dmlc::ManualEvent>)> >::_M_run()+0x4a) [0x7f2e3507afba]
TaoLv commented 6 years ago

Do you have any chance to try mxnet 1.2.1 release out? You simply installed it with:

pip install mxnet-mkl==1.2.1

If the problem is still there, could you kindly share a reproducible script for it? Then I can take a look. Thanks for your patience.

vpirogov commented 6 years ago

Closing due to lack of activity. Feel free to reopen with new data.