Closed xkszltl closed 5 years ago
@snnn
I tried replacing the crashed command with our protobuf build on the same system and it works just fine.
Here's how we build protobuf: https://github.com/xkszltl/Roaster/blob/master/pkgs/protobuf.sh
And here's the script for onnxruntime: https://github.com/xkszltl/Roaster/blob/master/pkgs/ort.sh
Interesting...LTO does affect the result. With LTO (confirmed by searching -flto in actual command) on CentOS + gcc 8.2.1 there's no issue. With LTO on Ubuntu + gcc 8.2.0 is crashes. Without LTO it works on Ubuntu and we move forward until getting the next error
/tmp/scratch/onnxruntime/onnxruntime/core/providers/mkldnn/nn/pool.cc:257:103: error: 'ceil_mode_' was not declared in this scope
std::vector<int64_t> y_dims = PoolBase::SetOutputSize(x_shape, x_shape[1], &pads, this->dilations_, ceil_mode_);
On Apr 24, 2019, at 23:21, Changming Sun notifications@github.com wrote:
Could you please let us know how to reproduce this error? We'd like to provide help, but I believe it's not a onnxruntime bug, as protoc is from Google. onnxruntime depends on protoc, not vice versa. If protoc crashed, the only reason is it was built in an supported environment or with unspported complier flags (e.g. lto). Whatever the root cause is, we can't fix it. Only Google can.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Microsoft/onnxruntime/issues/902#issuecomment-486535086, or mute the thread https://github.com/notifications/unsubscribe-auth/ABHWIUIQ5KD2YZYD3IHF3HLPSFEWRANCNFSM4HIJDBKQ.
@xkszltl, thanks for reporting this. As of now we don't support Ubuntu 18+gcc8. However, if this is a strong business requirement, I would strongly encourage you to contribute.
@pranavsharma
We would like to but, based on our tight schedule, we won't have enough time to contribute back shortly.
Since we also have Caffe2 code path we're not totally blocked yet.
And the ceil_mode_
is a regression tonight.
I opened a new issue #903 for it.
Close it as I can confirm it's a gcc(or ld) bug. "The weak symbols in libstdc++.so might make the linker think it doesn't need the real ones from libpthread.so"
I'll submit a PR to bypass it.
FYI I also tried gcc 8.3.0 in Ubuntu released recently and got the same error
GCC 7.4.0 building fails using
./build.sh --config RelWithDebInfo --build_wheel --use_mklml
with commit id
8a9c4cd9368c2157a3a4f1e59c834ff3d0fa3466
outputs:
terminate called after throwing an instance of 'std::system_error'
what(): Unknown error -1
Aborted (core dumped)
onnx/CMakeFiles/onnx_proto.dir/build.make:70: recipe for target 'onnx/onnx-operators-ml.pb.h' failed
make[2]: *** [onnx/onnx-operators-ml.pb.h] Error 134
CMakeFiles/Makefile2:1915: recipe for target 'onnx/CMakeFiles/onnx_proto.dir/all' failed
make[1]: *** [onnx/CMakeFiles/onnx_proto.dir/all] Error 2
Makefile:140: recipe for target 'all' failed
make: *** [all] Error 2
Traceback (most recent call last):
Please set onnxruntime_ENABLE_LTO to OFF to bypass the compiler bug.
@snnn Thank you. It works. But my model is not able to run with MKLDNN. Is there any special graph optimization for MKLDNN? It works for pip installed onnxruntime.
include/onnxruntime/core/graph/graph.h:915 onnxruntime::Node* onnxruntime::Graph::NodeAtIndexImpl(onnxruntime::NodeIndex) const node_index < nodes_.size() was false. Validating no unexpected access using an invalid node_index.
@snnn Thank you. It works. But my model is not able to run with MKLDNN. Is there any special graph optimization for MKLDNN? It works for pip installed onnxruntime.
include/onnxruntime/core/graph/graph.h:915 onnxruntime::Node* onnxruntime::Graph::NodeAtIndexImpl(onnxruntime::NodeIndex) const node_index < nodes_.size() was false. Validating no unexpected access using an invalid node_index.
@jywu-msft could you help ?
@snnn Thank you. It works. But my model is not able to run with MKLDNN. Is there any special graph optimization for MKLDNN? It works for pip installed onnxruntime.
include/onnxruntime/core/graph/graph.h:915 onnxruntime::Node* onnxruntime::Graph::NodeAtIndexImpl(onnxruntime::NodeIndex) const node_index < nodes_.size() was false. Validating no unexpected access using an invalid node_index.
@jywu-msft could you help ?
are we talking about mkl-dnn or mklml? the command line above has --mklml ./build.sh --config RelWithDebInfo --build_wheel --use_mklml
there shouldn't be anything different for mklml path with respect to the graph.
It‘s MKLDNN. If I use this:
./build.sh --config RelWithDebInfo --build_wheel --use_mklml
Everything works fine, but I didn't see any performance gain.So I tried with MKLDNN ON.
It‘s MKLDNN. If I use this:
./build.sh --config RelWithDebInfo --build_wheel --use_mklml
Everything works fine, but I didn't see any performance gain.So I tried with MKLDNN ON.
Thanks for confirming. Is it possible to share the model? can you please file a new issue for tracking? @sreekanth-yalachigere , can you take a look?
@jywu-msft yes. @Godricly, can you share the model?
How can I share it with you? its about 100MB.
google drive?
I'll post a new issue when I get a code sample.
@sreekanth-yalachigere Can you go over this part? graph_viewer.MaxNodeIndex() is 175 for my model. And somehow when temp_index == 175, the node is NULL. It loops to 257 within the while and raises a exception. https://github.com/microsoft/onnxruntime/blob/b43254282fb0699992bcfca63ceed7f74168a663/onnxruntime/core/providers/mkldnn/mkldnn_execution_provider.cc#L301-L323 By commenting the assert, https://github.com/microsoft/onnxruntime/blob/b43254282fb0699992bcfca63ceed7f74168a663/include/onnxruntime/core/graph/graph.h#L915 the backtrace rasied a error in . https://github.com/microsoft/onnxruntime/blob/b43254282fb0699992bcfca63ceed7f74168a663/onnxruntime/core/providers/mkldnn/mkldnn_execution_provider.cc#L310
with ORT_MKLDNN_SUBGRAPH set, it works fine.
Describe the bug
This is working on CentOS-7 with gcc-8 (devtoolset-8). Try to build it on latest Ubuntu docker (18.04.2) with gcc-8 and it doesn't work. Since the option for injecting our own protobuf has been removed a while ago, could you help on this?
Here's the log
Note that this is not that complex because.....
And FYI here's the ldd:
System information