microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.14k stars 2.85k forks source link

Crash on embedded protoc #902

Closed xkszltl closed 5 years ago

xkszltl commented 5 years ago

Describe the bug

This is working on CentOS-7 with gcc-8 (devtoolset-8). Try to build it on latest Ubuntu docker (18.04.2) with gcc-8 and it doesn't work. Since the option for injecting our own protobuf has been removed a while ago, could you help on this?

Here's the log

[254/691] cd /tmp/scratch/onnxruntime/build && /usr/local/bin/cmake -E make_directory /tmp/scratch/onnxruntime/build/CMakeFiles && /usr/local/bin/cmake -E touch /tmp/scratch/onnxruntime/build/CMakeFiles/project_mkldnn-complete && /usr/local/bin/cmake -E touch /tmp/scratch/onnxruntime/build/mkl-dnn/src/project_mkldnn-stamp/project_mkldnn-done
[255/691] : && /usr/bin/g++-8  -fdebug-prefix-map='/tmp/scratch'='/usr/local/src' -g -fopenmp -O3 -DNDEBUG -march=native -mtune=native -flto -fno-fat-lto-objects  -rdynamic external/protobuf/cmake/CMakeFiles/protoc.dir/__/src/google/protobuf/compiler/main.cc.o  -o external/protobuf/cmake/protoc  external/protobuf/cmake/libprotobuf.a external/protobuf/cmake/libprotoc.a external/protobuf/cmake/libprotobuf.a /usr/lib/x86_64-linux-gnu/libz.so && :
[256/691] cd /tmp/scratch/onnxruntime/build/onnx && /tmp/scratch/onnxruntime/build/external/protobuf/cmake/protoc --cpp_out /tmp/scratch/onnxruntime/build/onnx -I /tmp/scratch/onnxruntime/onnxruntime/core/protobuf /tmp/scratch/onnxruntime/onnxruntime/core/protobuf/onnx-operators-ml.proto
FAILED: onnx/onnx-operators-ml.pb.h onnx/onnx-operators-ml.pb.cc
cd /tmp/scratch/onnxruntime/build/onnx && /tmp/scratch/onnxruntime/build/external/protobuf/cmake/protoc --cpp_out /tmp/scratch/onnxruntime/build/onnx -I /tmp/scratch/onnxruntime/onnxruntime/core/protobuf /tmp/scratch/onnxruntime/onnxruntime/core/protobuf/onnx-operators-ml.proto
terminate called after throwing an instance of 'std::system_error'
  what():  Unknown error -1
Aborted (core dumped)
[257/691] cd /tmp/scratch/onnxruntime/build/onnx && /tmp/scratch/onnxruntime/build/external/protobuf/cmake/protoc --cpp_out /tmp/scratch/onnxruntime/build/onnx -I /tmp/scratch/onnxruntime/onnxruntime/core/protobuf /tmp/scratch/onnxruntime/onnxruntime/core/protobuf/onnx-ml.proto
FAILED: onnx/onnx-ml.pb.h onnx/onnx-ml.pb.cc
cd /tmp/scratch/onnxruntime/build/onnx && /tmp/scratch/onnxruntime/build/external/protobuf/cmake/protoc --cpp_out /tmp/scratch/onnxruntime/build/onnx -I /tmp/scratch/onnxruntime/onnxruntime/core/protobuf /tmp/scratch/onnxruntime/onnxruntime/core/protobuf/onnx-ml.proto
terminate called after throwing an instance of 'std::system_error'
  what():  Unknown error -1
Aborted (core dumped)
ninja: build stopped: subcommand failed.

Note that this is not that complex because.....

/tmp/scratch/onnxruntime/build/external/protobuf/cmake/protoc --version
terminate called after throwing an instance of 'std::system_error'
  what():  Unknown error -1
Aborted (core dumped)

And FYI here's the ldd:

ldd /tmp/scratch/onnxruntime/build/external/protobuf/cmake/protoc
        linux-vdso.so.1 (0x00007ffc01b2f000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fa124ee3000)
        libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fa124b5a000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fa124942000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fa125824000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fa1245a4000)

System information

xkszltl commented 5 years ago

@snnn

I tried replacing the crashed command with our protobuf build on the same system and it works just fine.

Here's how we build protobuf: https://github.com/xkszltl/Roaster/blob/master/pkgs/protobuf.sh

And here's the script for onnxruntime: https://github.com/xkszltl/Roaster/blob/master/pkgs/ort.sh

xkszltl commented 5 years ago

Interesting...LTO does affect the result. With LTO (confirmed by searching -flto in actual command) on CentOS + gcc 8.2.1 there's no issue. With LTO on Ubuntu + gcc 8.2.0 is crashes. Without LTO it works on Ubuntu and we move forward until getting the next error

/tmp/scratch/onnxruntime/onnxruntime/core/providers/mkldnn/nn/pool.cc:257:103: error: 'ceil_mode_' was not declared in this scope
   std::vector<int64_t> y_dims = PoolBase::SetOutputSize(x_shape, x_shape[1], &pads, this->dilations_, ceil_mode_);

On Apr 24, 2019, at 23:21, Changming Sun notifications@github.com wrote:

Could you please let us know how to reproduce this error? We'd like to provide help, but I believe it's not a onnxruntime bug, as protoc is from Google. onnxruntime depends on protoc, not vice versa. If protoc crashed, the only reason is it was built in an supported environment or with unspported complier flags (e.g. lto). Whatever the root cause is, we can't fix it. Only Google can.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Microsoft/onnxruntime/issues/902#issuecomment-486535086, or mute the thread https://github.com/notifications/unsubscribe-auth/ABHWIUIQ5KD2YZYD3IHF3HLPSFEWRANCNFSM4HIJDBKQ.

pranavsharma commented 5 years ago

@xkszltl, thanks for reporting this. As of now we don't support Ubuntu 18+gcc8. However, if this is a strong business requirement, I would strongly encourage you to contribute.

xkszltl commented 5 years ago

@pranavsharma We would like to but, based on our tight schedule, we won't have enough time to contribute back shortly. Since we also have Caffe2 code path we're not totally blocked yet. And the ceil_mode_ is a regression tonight. I opened a new issue #903 for it.

snnn commented 5 years ago

Close it as I can confirm it's a gcc(or ld) bug. "The weak symbols in libstdc++.so might make the linker think it doesn't need the real ones from libpthread.so"

I'll submit a PR to bypass it.

xkszltl commented 5 years ago

FYI I also tried gcc 8.3.0 in Ubuntu released recently and got the same error

Godricly commented 5 years ago

GCC 7.4.0 building fails using

./build.sh --config RelWithDebInfo --build_wheel --use_mklml

with commit id

8a9c4cd9368c2157a3a4f1e59c834ff3d0fa3466

outputs:

terminate called after throwing an instance of 'std::system_error'
  what():  Unknown error -1
Aborted (core dumped)
onnx/CMakeFiles/onnx_proto.dir/build.make:70: recipe for target 'onnx/onnx-operators-ml.pb.h' failed
make[2]: *** [onnx/onnx-operators-ml.pb.h] Error 134
CMakeFiles/Makefile2:1915: recipe for target 'onnx/CMakeFiles/onnx_proto.dir/all' failed
make[1]: *** [onnx/CMakeFiles/onnx_proto.dir/all] Error 2
Makefile:140: recipe for target 'all' failed
make: *** [all] Error 2
Traceback (most recent call last):
snnn commented 5 years ago

Please set onnxruntime_ENABLE_LTO to OFF to bypass the compiler bug.

Godricly commented 5 years ago

@snnn Thank you. It works. But my model is not able to run with MKLDNN. Is there any special graph optimization for MKLDNN? It works for pip installed onnxruntime.

include/onnxruntime/core/graph/graph.h:915 onnxruntime::Node* onnxruntime::Graph::NodeAtIndexImpl(onnxruntime::NodeIndex) const node_index < nodes_.size() was false. Validating no unexpected access using an invalid node_index.
snnn commented 5 years ago

@snnn Thank you. It works. But my model is not able to run with MKLDNN. Is there any special graph optimization for MKLDNN? It works for pip installed onnxruntime.

include/onnxruntime/core/graph/graph.h:915 onnxruntime::Node* onnxruntime::Graph::NodeAtIndexImpl(onnxruntime::NodeIndex) const node_index < nodes_.size() was false. Validating no unexpected access using an invalid node_index.

@jywu-msft could you help ?

jywu-msft commented 5 years ago

@snnn Thank you. It works. But my model is not able to run with MKLDNN. Is there any special graph optimization for MKLDNN? It works for pip installed onnxruntime.

include/onnxruntime/core/graph/graph.h:915 onnxruntime::Node* onnxruntime::Graph::NodeAtIndexImpl(onnxruntime::NodeIndex) const node_index < nodes_.size() was false. Validating no unexpected access using an invalid node_index.

@jywu-msft could you help ?

are we talking about mkl-dnn or mklml? the command line above has --mklml ./build.sh --config RelWithDebInfo --build_wheel --use_mklml

there shouldn't be anything different for mklml path with respect to the graph.

Godricly commented 5 years ago

It‘s MKLDNN. If I use this:

./build.sh --config RelWithDebInfo --build_wheel --use_mklml

Everything works fine, but I didn't see any performance gain.So I tried with MKLDNN ON.

jywu-msft commented 5 years ago

It‘s MKLDNN. If I use this:

./build.sh --config RelWithDebInfo --build_wheel --use_mklml

Everything works fine, but I didn't see any performance gain.So I tried with MKLDNN ON.

Thanks for confirming. Is it possible to share the model? can you please file a new issue for tracking? @sreekanth-yalachigere , can you take a look?

sreekanth-yalachigere commented 5 years ago

@jywu-msft yes. @Godricly, can you share the model?

Godricly commented 5 years ago

How can I share it with you? its about 100MB.

sreekanth-yalachigere commented 5 years ago

google drive?

Godricly commented 5 years ago

I'll post a new issue when I get a code sample.

Godricly commented 5 years ago

@sreekanth-yalachigere Can you go over this part? graph_viewer.MaxNodeIndex() is 175 for my model. And somehow when temp_index == 175, the node is NULL. It loops to 257 within the while and raises a exception. https://github.com/microsoft/onnxruntime/blob/b43254282fb0699992bcfca63ceed7f74168a663/onnxruntime/core/providers/mkldnn/mkldnn_execution_provider.cc#L301-L323 By commenting the assert, https://github.com/microsoft/onnxruntime/blob/b43254282fb0699992bcfca63ceed7f74168a663/include/onnxruntime/core/graph/graph.h#L915 the backtrace rasied a error in . https://github.com/microsoft/onnxruntime/blob/b43254282fb0699992bcfca63ceed7f74168a663/onnxruntime/core/providers/mkldnn/mkldnn_execution_provider.cc#L310

with ORT_MKLDNN_SUBGRAPH set, it works fine.