tensorflow / serving

A flexible, high-performance serving system for machine learning models
https://www.tensorflow.org/serving
Apache License 2.0
6.2k stars 2.19k forks source link

Fully static linked tensorflow_model_server gives seg fault #179

Closed Vesnica closed 6 years ago

Vesnica commented 8 years ago

Hi, community

I'm trying to build a fully static linked tensorflow_model_server, so it can running on alpine linux which didn't have glibc and other necessary libraries.

I checked bazel's doc(https://www.bazel.io/versions/master/docs/be/c-cpp.html), and found that linkopts = ["-static"] can produce a fully static binary. so I modify the serving/tensorflow_serving/model_servers/BUILD as following:

cc_binary(
    name = "tensorflow_model_server",
    linkopts = ["-static"],   # ADDED
    srcs = [
        "main.cc",
    ],
    visibility = ["//tensorflow_serving:internal"],
    deps = [
        ":server_core",
        "@protobuf//:cc_wkt_protos",
        "@org_tensorflow//tensorflow/core:lib",
        "@org_tensorflow//tensorflow/core/platform/cloud:gcs_file_system",
        "//tensorflow_serving/apis:prediction_service_proto",
        "//tensorflow_serving/core:servable_state_monitor",
        "@grpc//:grpc++",
    ] + TENSORFLOW_DEPS + SUPPORTED_TENSORFLOW_OPS,
)

Build process completed successfully, and ldd tensorflow_model_server produce this: not a dynamic executable

But it seg fault when trying to serve a model:

vesnica@vesnica:~/serving_base$ ./tensorflow_model_server
Usage: model_server [--port=8500] [--enable_batching] [--model_name=my_name] --model_base_path=/path/to/export
vesnica@vesnica:~/serving_base$ ./tensorflow_model_server --model_base_path=model
I tensorflow_serving/model_servers/main.cc:122] Building single TensorFlow model file config:  model_name: default model_base_path: model
I tensorflow_serving/core/basic_manager.cc:190] Using InlineExecutor for BasicManager.
I tensorflow_serving/model_servers/server_core.cc:128] Adding models to manager.
I tensorflow_serving/model_servers/server_core.cc:77]  Adding model: default
I tensorflow_serving/sources/storage_path/file_system_storage_path_source.cc:252] File-system polling update: Servable:{name: default version: 1}; Servable path: model/00000001; Polling frequency: 30
I tensorflow_serving/core/loader_harness.cc:70] Approving load for servable version {name: default version: 1}
I tensorflow_serving/core/loader_harness.cc:85] Loading servable version {name: default version: 1}
I external/org_tensorflow/tensorflow/contrib/session_bundle/session_bundle.cc:142] Attempting to load a SessionBundle from: model/00000001
I external/org_tensorflow/tensorflow/contrib/session_bundle/session_bundle.cc:143] Using RunOptions: 
I external/org_tensorflow/tensorflow/contrib/session_bundle/session_bundle.cc:106] Running restore op for SessionBundle
I external/org_tensorflow/tensorflow/contrib/session_bundle/session_bundle.cc:218] Done loading SessionBundle. Took 1 seconds.
I tensorflow_serving/core/loader_harness.cc:118] Successfully loaded servable version {name: default version: 1}
Segmentation fault (core dumped)

gdb information:

vesnica@vesnica:~/serving_base$ gdb tensorflow_model_server 
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from tensorflow_model_server...(no debugging symbols found)...done.
(gdb) run --model_base_path=model
Starting program: /home/vesnica/serving_base/tensorflow_model_server --model_base_path=model
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
I tensorflow_serving/model_servers/main.cc:122] Building single TensorFlow model file config:  model_name: default model_base_path: model
I tensorflow_serving/core/basic_manager.cc:190] Using InlineExecutor for BasicManager.
[New Thread 0x7ffff7ffa700 (LWP 18028)]
I tensorflow_serving/model_servers/server_core.cc:128] Adding models to manager.
I tensorflow_serving/model_servers/server_core.cc:77]  Adding model: default
[New Thread 0x7ffff77f9700 (LWP 18029)]
I tensorflow_serving/sources/storage_path/file_system_storage_path_source.cc:252] File-system polling update: Servable:{name: default version: 1}; Servable path: model/00000001; Polling frequency: 30
I tensorflow_serving/core/loader_harness.cc:70] Approving load for servable version {name: default version: 1}
I tensorflow_serving/core/loader_harness.cc:85] Loading servable version {name: default version: 1}
I external/org_tensorflow/tensorflow/contrib/session_bundle/session_bundle.cc:142] Attempting to load a SessionBundle from: model/00000001
I external/org_tensorflow/tensorflow/contrib/session_bundle/session_bundle.cc:143] Using RunOptions: 
[New Thread 0x7ffff6ff8700 (LWP 18030)]
[New Thread 0x7ffff67f7700 (LWP 18031)]
[New Thread 0x7ffff5ff6700 (LWP 18032)]
[New Thread 0x7ffff57f5700 (LWP 18033)]
[New Thread 0x7ffff4ff4700 (LWP 18034)]
[New Thread 0x7fffeffff700 (LWP 18035)]
[New Thread 0x7fffef7fe700 (LWP 18036)]
[New Thread 0x7fffeeffd700 (LWP 18037)]
I external/org_tensorflow/tensorflow/contrib/session_bundle/session_bundle.cc:106] Running restore op for SessionBundle
[New Thread 0x7fffee7fc700 (LWP 18038)]
I external/org_tensorflow/tensorflow/contrib/session_bundle/session_bundle.cc:218] Done loading SessionBundle. Took 1 seconds.
I tensorflow_serving/core/loader_harness.cc:118] Successfully loaded servable version {name: default version: 1}
[New Thread 0x7fffedffb700 (LWP 18039)]
[New Thread 0x7fffed7fa700 (LWP 18040)]
[New Thread 0x7fffecff9700 (LWP 18041)]
[New Thread 0x7fffe3fff700 (LWP 18042)]

Program received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x0000000000545c23 in gpr_getenv ()
#2  0x0000000000545fd0 in gpr_log_verbosity_init ()
#3  0x000000000045d24f in do_basic_init ()
#4  0x0000000002fbe800 in pthread_once ()
#5  0x0000000000549e86 in gpr_once_init ()
#6  0x000000000045d57e in grpc_init ()
#7  0x000000000044aef7 in grpc::internal::GrpcLibrary::init() ()
#8  0x0000000000442b92 in grpc::GrpcLibraryCodegen::GrpcLibraryCodegen() ()
#9  0x0000000000448d81 in grpc::Server::Server(grpc::ThreadPoolInterface*, bool, int, grpc::ChannelArguments*) ()
#10 0x000000000044f7dc in grpc::ServerBuilder::BuildAndStart() ()
#11 0x0000000000403d3e in (anonymous namespace)::RunServer(int, std::unique_ptr<tensorflow::serving::ServerCore, std::default_delete<tensorflow::serving::ServerCore> >) ()
#12 0x00000000004043ba in main ()
(gdb) 

Is there anything I'm done wrong or missing some critical steps? Any help is greatly appreciated.

kirilg commented 8 years ago

I reproduced the steps you mentioned, loaded half_plus_two as a simple test, and was able to query it successfully with no segfault. Do you always get a segfault or with some specific model? Can you try half_plus_two if you haven't already?

steps: The linkopts changes you made look right. I also confirmed ldd tensorflow_model_server returns not a dynamic executable on my machine.

Export the model to /tmp/half_plus_two rm /tmp/half_plus_two bazel run tensorflow_serving/servables/tensorflow/testdata:export_half_plus_two

Start a server bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --model_base_path=/tmp/half_plus_two

Query it using a custom test client bazel build tensorflow_serving/model_servers:tensorflow_model_server_test_client bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server_test_client

Vesnica commented 8 years ago

Unfortunately, serve half_plus_two gives the same result:

ubuntu@7ab04e2b2ec6:~/serving$ bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --model_base_path=/tmp/half_plus_two/
I tensorflow_serving/model_servers/main.cc:122] Building single TensorFlow model file config:  model_name: default model_base_path: /tmp/half_plus_two/
I tensorflow_serving/core/basic_manager.cc:190] Using InlineExecutor for BasicManager.
I tensorflow_serving/model_servers/server_core.cc:128] Adding models to manager.
I tensorflow_serving/model_servers/server_core.cc:77]  Adding model: default
I tensorflow_serving/sources/storage_path/file_system_storage_path_source.cc:252] File-system polling update: Servable:{name: default version: 123}; Servable path: /tmp/half_plus_two/00000123; Polling frequency: 30
I tensorflow_serving/core/loader_harness.cc:70] Approving load for servable version {name: default version: 123}
I tensorflow_serving/core/loader_harness.cc:85] Loading servable version {name: default version: 123}
I external/org_tensorflow/tensorflow/contrib/session_bundle/session_bundle.cc:142] Attempting to load a SessionBundle from: /tmp/half_plus_two/00000123
I external/org_tensorflow/tensorflow/contrib/session_bundle/session_bundle.cc:143] Using RunOptions:
I external/org_tensorflow/tensorflow/contrib/session_bundle/session_bundle.cc:106] Running restore op for SessionBundle
I external/org_tensorflow/tensorflow/contrib/session_bundle/session_bundle.cc:218] Done loading SessionBundle. Took 0 seconds.
I tensorflow_serving/core/loader_harness.cc:118] Successfully loaded servable version {name: default version: 123}
Segmentation fault (core dumped)

What bothers me most is the file size of static linked binary is even smaller than the dynamic linked binary (223M vs 224M), which I thought should never have happened.

I'll pull the master branch today and try again, with hope that commits in this two weeks may can fix this problem.

Vesnica commented 8 years ago

Sorry for the delay, but I have some good news: binary compiled from current repo(https://github.com/tensorflow/serving/commit/e9d01c00aba8f843a20afb9117c1347a9f4b3b2f) just works!

It's size grows to 259M from 224M, but it can running on alpine linux without any dependencies, that allows me to cut the deploy package size by half! :tada:

Thanks for the help!

sendit2me commented 7 years ago

Hi @Vesnica do you have a docker file for your Alpine Tensorflow Serving solution.

Thanks

Vesnica commented 7 years ago

I'll make a latest one, stay tuned.

Vesnica commented 7 years ago

Bad news: current statically linked tensorflow_model_server produce seg fault again, while dynamically linked(default setting) binary works fine.

Reproduce steps:

  1. modify serving/tensorflow_serving/model_servers/BUILD to add "linkopts = ["-static"]"
  2. cd serving && bazel build tensorflow_serving/model_servers:tensorflow_model_server
  3. bazel run tensorflow_serving/servables/tensorflow/testdata:export_half_plus_two
  4. bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --model_base_path=/tmp/half_plus_two
  5. SegFault

Logs is shown below:

ubuntu@NIV-AI:~/docker/serving$ bazel-bin/tensorflow_serving/tensorflow_model_server --model_base_path="/tmp/half_plus_two/"
I tensorflow_serving/model_servers/main.cc:118] Building single TensorFlow model file config:  model_name: default model_base_path: /tmp/half_plus_two/ model_version_policy: 0
I tensorflow_serving/model_servers/server_core.cc:337] Adding/updating models.
I tensorflow_serving/model_servers/server_core.cc:383]  (Re-)adding model: default
I tensorflow_serving/core/basic_manager.cc:693] Successfully reserved resources to load servable {name: default version: 123}
I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: default version: 123}
I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: default version: 123}
I external/org_tensorflow/tensorflow/contrib/session_bundle/bundle_shim.cc:291] Attempting to up-convert SessionBundle to SavedModelBundle in bundle-shim from: /tmp/half_plus_two/00000123
I external/org_tensorflow/tensorflow/contrib/session_bundle/session_bundle.cc:161] Attempting to load a SessionBundle from: /tmp/half_plus_two/00000123
I external/org_tensorflow/tensorflow/contrib/session_bundle/session_bundle.cc:162] Using RunOptions:
W external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
W external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
W external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
I external/org_tensorflow/tensorflow/contrib/session_bundle/session_bundle.cc:135] Running restore op for SessionBundle: save/restore_all, save/Const:0
Segmentation fault

My computer spec (produced by lshw): spec.zip

uname -a:

Linux b0f38963d70a 4.4.0-46-generic #67-Ubuntu SMP Thu Oct 20 15:05:12 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
gautamvasudevan commented 6 years ago

Is this still a problem?