tensorflow / serving

A flexible, high-performance serving system for machine learning models
https://www.tensorflow.org/serving
Apache License 2.0
6.15k stars 2.19k forks source link

Can't specify single gpu on multi-box GPU #779

Closed TonyChouZJU closed 5 years ago

TonyChouZJU commented 6 years ago

I successfully installed tf-serving with

bazel build -c opt --config=cuda tensorflow_serving/...

But when I launch a tf-serving model, All the 4 GPUs allocate all available GPU memory.

 bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server  --port=9000 --model_name=inception_gpu --model_base_path=./tf_servables/inception/inception_gpu 

Log

2018-02-23 16:58:28.822617: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1208] Found device 3 with properties: 
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:83:00.0
totalMemory: 15.89GiB freeMemory: 15.60GiB
2018-02-23 16:58:28.825720: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1223] Device peer to peer matrix
2018-02-23 16:58:28.825818: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1229] DMA: 0 1 2 3 
2018-02-23 16:58:28.825830: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1239] 0:   Y Y N N 
2018-02-23 16:58:28.825838: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1239] 1:   Y Y N N 
2018-02-23 16:58:28.825846: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1239] 2:   N N Y Y 
2018-02-23 16:58:28.825853: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1239] 3:   N N Y Y 
2018-02-23 16:58:28.825866: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1308] Adding visible gpu devices: 0, 1, 2, 3
2018-02-23 16:58:31.573712: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:989] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15130 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:02:00.0, compute capability: 6.0)
2018-02-23 16:58:31.881263: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:989] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 8561 MB memory) -> physical GPU (device: 1, name: Tesla P100-PCIE-16GB, pci bus id: 0000:03:00.0, compute capability: 6.0)
2018-02-23 16:58:32.083473: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:989] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 15130 MB memory) -> physical GPU (device: 2, name: Tesla P100-PCIE-16GB, pci bus id: 0000:82:00.0, compute capability: 6.0)
2018-02-23 16:58:32.374781: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:989] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 15130 MB memory) -> physical GPU (device: 3, name: Tesla P100-PCIE-16GB, pci bus id: 0000:83:00.0, compute capability: 6.0)
2018-02-23 16:58:32.763353: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:159] Restoring SavedModel bundle.
2018-02-23 16:58:33.022217: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:194] Running LegacyInitOp on SavedModel bundle.
2018-02-23 16:58:33.105502: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:289] SavedModel load for tags { serve }; Status: success. Took 9150846 microseconds.
2018-02-23 16:58:33.105750: I tensorflow_serving/core/loader_harness.cc:86] Successfully loaded servable version {name: inception_gpu version: 1}
2018-02-23 16:58:33.122060: I tensorflow_serving/model_servers/main.cc:280] Running ModelServer at 0.0.0.0:9000 ...

And I check it with nvidia-smi

Fri Feb 23 17:04:47 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111                Driver Version: 384.111                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:02:00.0 Off |                    0 |
| N/A   30C    P0    30W / 250W |  15481MiB / 16276MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE...  Off  | 00000000:03:00.0 Off |                    0 |
| N/A   31C    P0    31W / 250W |  15826MiB / 16276MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P100-PCIE...  Off  | 00000000:82:00.0 Off |                    0 |
| N/A   30C    P0    31W / 250W |  15481MiB / 16276MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P100-PCIE...  Off  | 00000000:83:00.0 Off |                    0 |
| N/A   30C    P0    32W / 250W |  15481MiB / 16276MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     18242      C   ...g/model_servers/tensorflow_model_server 15471MiB |
|    1     15809      C   python                                      6915MiB |
|    1     18242      C   ...g/model_servers/tensorflow_model_server  8901MiB |
|    2     18242      C   ...g/model_servers/tensorflow_model_server 15471MiB |
|    3     18242      C   ...g/model_servers/tensorflow_model_server 15471MiB |
+-----------------------------------------------------------------------------+

(PID 15809 should be ignored as it is another process)

I export the tf-serving model following the inception instructions and try to only specify the 4-th gpu (for example) with gpu-option

 # Restore variables from training checkpoint.                                                                                                                                                                   
    variable_averages = tf.train.ExponentialMovingAverage(                                                                                                                                                          
        inception_model.MOVING_AVERAGE_DECAY)                                                                                                                                                                       
    variables_to_restore = variable_averages.variables_to_restore()                                                                                                                                                 
    saver = tf.train.Saver(variables_to_restore)                                                                                                                                                                    
    config = tf.ConfigProto(                                                                                                                                                                                        
        device_count = {                                                                                                                                                                                            
                'GPU':1                                                                                                                                                                                             
        },                                                                                                                                                                                                          
        gpu_options = {                                                                                                                                                                                             
            'allow_growth':1,                                                                                                                                                                                       
            # 'per_process_gpu_memory_fraction':0.01                                                                                                                                                                
            'visible_device_list':"3"                                                                                                                                                                               
        },                                                                                                                                                                                                          
        allow_soft_placement=True,                                                                                                                                                                                  
        log_device_placement=False                                                                                                                                                                                  
    )                                                                                                                                                                                                               
    with tf.Session(config=config) as sess:                                                                                                                                                                        
      # Restore variables from training checkpoints.                                                                                                                                                                
      ckpt = tf.train.get_checkpoint_state(FLAGS.checkpoint_dir) 

Log

2018-02-23 17:31:29.760189: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties: 
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:83:00.0
totalMemory: 15.89GiB freeMemory: 15.60GiB
2018-02-23 17:31:29.760217: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 3, name: Tesla P100-PCIE-16GB, pci bus id: 0000:83:00.0, compute capability: 6.0)
Successfully loaded model from /home/tonychou.zyb/Tensorflow/ModelZoo/tf_checkpoints/inception/20160301/model.ckpt-157585 at step=157585.

I also tried

import os                                                                                                                                                                                                           

os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"                                                                                                                                                                        os.environ["CUDA_VISIBLE_DEVICES"]="2"   

Neither works.

Could anyone help me figure out? I should be able to specify both the GPU and GPU-memory. Thanks in advance!

vitalyli commented 6 years ago

The only way at the moment is to launch multiple dockers, one for each GPU and limiting scope such that each docker only sees one GPU. Here is example when machine has 4 GPUs:

NV_GPU=0 nvidia-docker run -d -v /tmp/data:/data -p 9100:9100 --name tensorflow_serving_gpu0 -it c0a8204d1ead NV_GPU=1 nvidia-docker run -d -v /tmp/data:/data -p 9101:9100 --name tensorflow_serving_gpu1 -it c0a8204d1ead NV_GPU=2 nvidia-docker run -d -v /tmp/data:/data -p 9102:9100 --name tensorflow_serving_gpu2 -it c0a8204d1ead NV_GPU=3 nvidia-docker run -d -v /tmp/data:/data -p 9103:9100 --name tensorflow_serving_gpu3 -it c0a8204d1ead

TonyChouZJU commented 6 years ago

@vitalyli I think it is not very flexible. Thanks anyway.

Li-Shu14 commented 6 years ago

@TonyChouZJU When you want to restrict the use of GPU you should set flags and writing configurations at tensorflow_model_server. I recommend you see #836 and that may answer your question. By the way if you want to specify which GPU to serve you could just add CUDA_VISIBLE_DEVICES=0 at the beginning of your command. Things like per_process_gpu_memory_fraction ,allow_growth and allow_soft_placement should be written in platform_config_file. You just use these commands during export model so they just work during that period but not serving period. The final command to run server should be like: CUDA_VISIBLE_DEVICES=0 bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=9000 --model_name=inception_gpu --model_base_path=./tf_servables/inception/inception_gpu --platform_config_file=platform_session.conf.

sugartom commented 6 years ago

@Li-Shu14 , hi, I would like to ask do you have any experience on running a separate tensorflow serving server for each GPU.

I have a machine with two 1080 Tis. My TF-Serving is able to correctly identify both of the GPUs when the CUDA_VISIBLE_DEVICES flag is not set. And I was trying to run one TF-Serving server per GPU, so I opened two terminals to run the following two commands:

CUDA_VISIBLE_DEVICES=0 bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=9000 --model_name=inception --model_base_path=/path/to/inception_model

CUDA_VISIBLE_DEVICES=1 bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=9001 --model_name=mnist --model_base_path=/path/to/mnist_model

The first command was running fine, but the second one will reach error says

terminate called after throwing an instance of 'std::system_error' what(): Resource temporarily unavailable [1] 4021 abort (core dumped) CUDA_VISIBLE_DEVICES=1 --port=9001 --model_name=mnist

And idea or suggestions? Thanks in advance!

Li-Shu14 commented 6 years ago

@sugartom Running separate server for each GPU is feasible from my experience, but I have never encountered the error you described. I wonder whether this error will occur when these two commands are running in the other order, and if so, which means you can run only one server each time, then it's likely to be the problem of hardware. Maybe "Resource temporarily unavailable" has something to do with the number of CPU-cores (or some settings which limits the use of them)? Sorry I cannot think out more ideas on your problem.

sugartom commented 6 years ago

@Li-Shu14 , hi, thanks for your reply! I did try running those two commands in different order. Namely, start one server on gpu:1 first, and then another server on gpu:0. Again, the first server on gpu:1 will run normally, but the second server on gpu:0 will fail. So I think the order is not the reason for the failure. I will check whether number of CPU-cores is the reason. And thanks again for your suggestions! :-)

Harshini-Gadige commented 5 years ago

@TonyChouZJU - Hi, is this still an issue ? If not please feel free to close this. Thanks !

echan00 commented 5 years ago

I have tried both ways according to this document to allow only one GPU in docker but somehow all 6 of my GPUs are being used:

docker run --runtime=nvidia -p 8500:8500 \
  --device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia0 \
  --mount type=bind,source=/home/eee/T2T_Model/t2t_train/translate_enzh_wmt32k/transformer-transformer_base/export/,target=/models/my_model \
  --mount type=bind,source=/home/eee/T2T_Model/batching.conf,target=/models/batching.conf \
  -e MODEL_NAME=my_model -t tensorflow/serving:latest-gpu --batching_parameters_file=/models/batching.conf --enable_batching
NV_GPU=0 docker run --runtime=nvidia -p 8500:8500 \
  --mount type=bind,source=/home/eee/T2T_Model/t2t_train/translate_enzh_wmt32k/transformer-transformer_base/export/,target=/models/my_model \
  --mount type=bind,source=/home/eee/T2T_Model/batching.conf,target=/models/batching.conf \
  -e MODEL_NAME=my_model -t tensorflow/serving:latest-gpu --batching_parameters_file=/models/batching.conf --enable_batching

Am I missing something?

SOLVED This command solved my problem:

docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=0 -p 8500:8500 \
  --mount type=bind,source=/home/eee/T2T_Model/t2t_train/translate_enzh_wmt32k/transformer-transformer_base/export/,target=/models/my_model \
  --mount type=bind,source=/home/eee/T2T_Model/batching.conf,target=/models/batching.conf \
  -e MODEL_NAME=my_model -t tensorflow/serving:latest-gpu --batching_parameters_file=/models/batching.conf --enable_batching
echan00 commented 5 years ago

The only way at the moment is to launch multiple dockers, one for each GPU and limiting scope such that each docker only sees one GPU. Here is example when machine has 4 GPUs:

NV_GPU=0 nvidia-docker run -d -v /tmp/data:/data -p 9100:9100 --name tensorflow_serving_gpu0 -it c0a8204d1ead NV_GPU=1 nvidia-docker run -d -v /tmp/data:/data -p 9101:9100 --name tensorflow_serving_gpu1 -it c0a8204d1ead NV_GPU=2 nvidia-docker run -d -v /tmp/data:/data -p 9102:9100 --name tensorflow_serving_gpu2 -it c0a8204d1ead NV_GPU=3 nvidia-docker run -d -v /tmp/data:/data -p 9103:9100 --name tensorflow_serving_gpu3 -it c0a8204d1ead

@vitalyli I'm trying to do the same thing as you by launching multiple docker containers of the same image. However, I am only able to get one docker container to do inference, the rest of my clients give me the error:

error: <_Rendezvous of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "OS Error"
        debug_error_string = "{"created":"@1547479281.497664676","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1036,"grpc_message":"OS Error","grpc_status":14}"

This is the command I use to launch docker containers:

docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=0 -p 8500:8500 \
  --mount type=bind,source=/T2T_Model/t2t_train/translate_enzh_wmt32k/transformer-transformer_base/export/,target=/models/my_model \
  --mount type=bind,source=/T2T_Model/batching.conf,target=/models/batching.conf \
  -e MODEL_NAME=my_model -t tensorflow/serving:latest-gpu --batching_parameters_file=/models/batching.conf --enable_batching

docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=1 -p 8510:8510 \
  --mount type=bind,source=/T2T_Model/t2t_train/translate_enzh_wmt32k/transformer-transformer_base/export/,target=/models/my_model \
  --mount type=bind,source=/T2T_Model/batching.conf,target=/models/batching.conf \
  -e MODEL_NAME=my_model -t tensorflow/serving:latest-gpu --batching_parameters_file=/models/batching.conf --enable_batching

This is the command I use to launch my tf client for inference:

t2t-query-server \
  --server=0.0.0.0:8500 \
  --servable_name=my_model \
  --problem=translate_enzh_wmt32k \
  --data_dir=/T2T_Model/t2t_data/1 \
  --timeout_secs=30 \

t2t-query-server \
  --server=0.0.0.0:8510 \
  --servable_name=my_model \
  --problem=translate_enzh_wmt32k \
  --data_dir=/T2T_Model/t2t_data/1 \
  --timeout_secs=30 \

What could I possibly be doing wrong?

UPDATE: The problem seems to be that gRPC is on 0.0.0.0:8500 for all my docker containers despite running my docker with different ports docker run -p 8510:8510: tensorflow_serving/model_servers/server.cc:286] Running gRPC ModelServer at 0.0.0.0:8500 ...

SOLVED: docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=0 -p 8510:8500

Harshini-Gadige commented 5 years ago

Closing this at it is in "awaiting response" status for more than a week. Feel free to add comments(if any), we will reopen the issue. Thanks !