tensorflow / tfx

TFX is an end-to-end platform for deploying production ML pipelines
https://tensorflow.org/tfx
Apache License 2.0
2.11k stars 708 forks source link

tensorflow_model_server not running, when run the example code at TensorFlow Serving on databricks? #2222

Closed umusa closed 3 years ago

umusa commented 4 years ago

Hi, I am trying to run the TFX example code at https://www.tensorflow.org/tfx/tutorials/serving/rest_simple on databricks GPU cluster with env:

  7.1 ML Spark 3.0.0 Scal 2.12 GPU

For the cell:

 %sh
echo "deb http://storage.googleapis.com/tensorflow-serving-apt stable tensorflow-model-server tensorflow-model-server. -universal" | tee /etc/apt/sources.list.d/tensorflow-serving.list && \
 curl https://storage.googleapis.com/tensorflow-serving-apt/tensorflow-serving.release.pub.gpg | apt-key add -
 apt update

I got :

  ```

deb http://storage.googleapis.com/tensorflow-serving-apt stable tensorflow-model-server tensorflow-model-server-universal % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed

0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0Warning: apt-key output should not be parsed (stdout is not a terminal)

100 2943 100 2943 0 0 22127 0 --:--:-- --:--:-- --:--:-- 22127 OK

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

Hit:1 http://storage.googleapis.com/tensorflow-serving-apt stable InRelease Hit:2 http://security.ubuntu.com/ubuntu bionic-security InRelease Hit:3 http://archive.ubuntu.com/ubuntu bionic InRelease Hit:4 http://archive.ubuntu.com/ubuntu bionic-updates InRelease Hit:5 http://archive.ubuntu.com/ubuntu bionic-backports InRelease Reading package lists... Building dependency tree... Reading state information... 20 packages can be upgraded. Run 'apt list --upgradable' to see them.


For the cell:

       %%bash --bg
       nohup tensorflow_model_server \
      --rest_api_port=8501 \
      --model_name=fashion_model \
      --model_base_path="${MODEL_DIR}" >server.log 2>&1

I got no output.

For the  cell:

        %sh
       tail server.log

I got:
   2020-07-26 18:09:08.235722: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:234] Restoring SavedModel bundle.

2020-07-26 18:09:08.262389: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:183] Running initialization op on SavedModel bundle at path: /tmp/1 2020-07-26 18:09:08.268744: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:364] SavedModel load for tags { serve }; Status: success: OK. Took 58757 microseconds. 2020-07-26 18:09:08.269224: I tensorflow_serving/servables/tensorflow/saved_model_warmup.cc:105] No warmup data file found at /tmp/1/assets.extra/tf_serving_warmup_requests 2020-07-26 18:09:08.269580: I tensorflow_serving/core/loader_harness.cc:87] Successfully loaded servable version {name: fashion_model version: 1} 2020-07-26 18:09:08.271348: I tensorflow_serving/model_servers/server.cc:355] Running gRPC ModelServer at 0.0.0.0:8500 ... [evhttp_server.cc : 223] NET_LOG: Couldn't bind to port 8501 [evhttp_server.cc : 63] NET_LOG: Server has not been terminated. Force termination now. [evhttp_server.cc : 258] NET_LOG: Server is not running ... 2020-07-26 18:09:08.272867: E tensorflow_serving/model_servers/server.cc:377] Failed to start HTTP Server at localhost:8501



It seems that the "tensorflow_model_server" is not bindded to the port 8501.

When I run the same cell by Colab, 

       for the cell:
            !echo "deb http://storage.googleapis.com/tensorflow-serving-apt stable tensorflow-model-server tensorflow-model-server-universal" | tee /etc/apt/sources.list.d/tensorflow-serving.list && \
curl https://storage.googleapis.com/tensorflow-serving-apt/tensorflow-serving.release.pub.gpg | apt-key add -
!apt update

I got:

       deb http://storage.googleapis.com/tensorflow-serving-apt stable tensorflow-model-server tensorflow-model-server-universal
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2943  100  2943    0     0  10740      0 --:--:-- --:--:-- --:--:-- 10740
OK
Get:1 http://storage.googleapis.com/tensorflow-serving-apt stable InRelease [3,012 B]
Get:2 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic InRelease [21.3 kB]
Get:3 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Hit:4 http://archive.ubuntu.com/ubuntu bionic InRelease
Get:5 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Get:6 http://ppa.launchpad.net/marutter/c2d4u3.5/ubuntu bionic InRelease [15.4 kB]
Get:7 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB]
Get:8 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran35/ InRelease [3,626 B]
Ign:9 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Ign:10 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Get:11 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release [697 B]
Hit:12 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Get:13 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release.gpg [836 B]
Get:14 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic/main amd64 Packages [43.3 kB]
Get:15 http://storage.googleapis.com/tensorflow-serving-apt stable/tensorflow-model-server amd64 Packages [341 B]
Get:16 http://storage.googleapis.com/tensorflow-serving-apt stable/tensorflow-model-server-universal amd64 Packages [349 B]
Get:17 http://security.ubuntu.com/ubuntu bionic-security/universe amd64 Packages [870 kB]
Get:18 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages [1,023 kB]
Get:19 http://security.ubuntu.com/ubuntu bionic-security/restricted amd64 Packages [98.3 kB]
Get:20 http://security.ubuntu.com/ubuntu bionic-security/multiverse amd64 Packages [9,279 B]
Get:21 http://ppa.launchpad.net/marutter/c2d4u3.5/ubuntu bionic/main Sources [1,851 kB]
Get:22 http://ppa.launchpad.net/marutter/c2d4u3.5/ubuntu bionic/main amd64 Packages [893 kB]
Get:23 http://archive.ubuntu.com/ubuntu bionic-updates/multiverse amd64 Packages [13.6 kB]
Get:24 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages [1,322 kB]
Get:25 http://archive.ubuntu.com/ubuntu bionic-updates/restricted amd64 Packages [111 kB]
Get:26 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 Packages [1,409 kB]
Get:27 http://archive.ubuntu.com/ubuntu bionic-backports/universe amd64 Packages [8,432 B]
Ign:29 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Packages
Get:29 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Packages [232 kB]
Fetched 8,182 kB in 2s (3,337 kB/s)
Reading package lists... Done
Building dependency tree       
Reading state information... Done
111 packages can be upgraded. Run 'apt list --upgradable' to see them.

For the cell:
      !tail server.log

I got:

      2020-07-26 17:29:33.452466: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-07-26 17:29:33.468274: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:234] Restoring SavedModel bundle.
2020-07-26 17:29:33.490536: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:183] Running initialization op on SavedModel bundle at path: /tmp/1
2020-07-26 17:29:33.495508: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:364] SavedModel load for tags { serve }; Status: success: OK. Took 44529 microseconds.
2020-07-26 17:29:33.496013: I tensorflow_serving/servables/tensorflow/saved_model_warmup.cc:105] No warmup data file found at /tmp/1/assets.extra/tf_serving_warmup_requests
2020-07-26 17:29:33.496125: I tensorflow_serving/core/loader_harness.cc:87] Successfully loaded servable version {name: fashion_model version: 1}
2020-07-26 17:29:33.497246: I tensorflow_serving/model_servers/server.cc:355] Running gRPC ModelServer at 0.0.0.0:8500 ...
[warn] getaddrinfo: address family for nodename not supported
2020-07-26 17:29:33.498036: I tensorflow_serving/model_servers/server.cc:375] Exporting HTTP/REST API at:localhost:8501 ...
[evhttp_server.cc : 238] NET_LOG: Entering the event loop ...

--------------------

When I tried to access 

         localhost:8501

from my laptop,
I got:
            This site can’t be reached 
            localhost refused to connect.

I am confused that where the localhost is located ?

Also, why the server.log is different for running the same code on Colab and on databricks ?
Why 
      NET_LOG: Server is not running   
if I run it on databricks ? 

But, it is successful on Colab ? 

     Exporting HTTP/REST API at:localhost:8501 ...

thanks
zhitaoli commented 4 years ago

This seems like a problem of Tensorflow serving. Can you file an issue to that team instead?

Please note that Tensorflow org may not have much experience with Databricks environment, if it's related to that.

cc @misterpeddy to see what's the best way to report issues for TF serving.

misterpeddy commented 4 years ago

This issue is indeed best suited to be opened in the TF Serving Github repo. However, looking at the specifics here, I don't think folks there will be able to help much. The issue you're running into is not being able to bind to a certain port and starting up a server in your environment. Databricks is not a supported environment for TF Serving so folks will not be able to troubleshoot for you. My suggestion would be to try to figure out how you'd bring up a simple server (not TF Serving) in the Databricks environment (I'd imagine they have guides / tutorials) and once you're successful, apply the same settings to TF Serving (maybe you need to run it on a different port than 8501 or turn on some sort of proxy/firewall setting in your Databricks environment?). Hope that helps.

umusa commented 4 years ago

It seems that https://github.com/tensorflow/tfx/issues/github.com/tesnorflow/serving cannot be accessed. thanks

umusa commented 4 years ago

TF and TFX are wonderful tools, which may help us do the whole ML platform building work.

Databricks's backend is Apache Hadoop, Apache Spark and Ubuntu and Python. All of them are popular tools and packages. Databricks also support TF by installing it in their environment. They have updated TF to 2.2 in their latest release.

Through databricks as UI, we access AWS/EC2 instances and we can install libs and packages of python, java/scala on databricks. So, our code will finally run on AWS.

Could you please let me know TFX can work well on which cloud systems beside google cloud ?

thanks

zhitaoli commented 4 years ago

It seems that https://github.com/tensorflow/tfx/issues/github.com/tesnorflow/serving cannot be accessed. thanks

The link should be https://github.com/tensorflow/serving/issues

Taking a shot at the other effort:

TF and TFX are wonderful tools, which may help us do the whole ML platform building work.

Databricks's backend is Apache Hadoop, Apache Spark and Ubuntu and Python. All of them are popular tools and packages.

TFX runs on Apache Beam, which provides a runner on Apache Spark. However, we don't have real experience with using or testing this runner yet, so there could be very rough edges, especially when it gets to low level deploying of the workers.

We have one intern testing using the same runner on top of Apache Flink at this point, but that should be viewed as exploratory at this point.

Databricks also support TF by installing it in their environment. They have updated TF to 2.2 in their latest release.

Through databricks as UI, we access AWS/EC2 instances and we can install libs and packages of python, java/scala on databricks. So, our code will finally run on AWS.

Could you please let me know TFX can work well on which cloud systems beside google cloud ?

Given our limited bandwidth, we are only targeting the following combinations:

If there is a significant partner possibility, we are happy to work with external community to expand TFX influence, but that is not something we can take by ourselves right now.

Hope this is helpful.

thanks

google-ml-butler[bot] commented 3 years ago

Are you satisfied with the resolution of your issue? Yes No