tensorflow / serving

A flexible, high-performance serving system for machine learning models
Apache License 2.0
6.18k stars 2.19k forks source link

There are problem with official serving docker image gpu version #1474

Closed CoinCheung closed 5 years ago

CoinCheung commented 5 years ago

In order to reproduce the problem, first I downloaded the model in the modelzoo:

wget -c http://download.tensorflow.org/models/object_detection/faster_rcnn_resnet101_coco_2018_01_28.tar.gz
tar -zxvf *tar.gz
mkdir -p faster/1
mv faster_rcnn_resnet101_coco_2018_01_28/saved_model/* faster/1/

Then I launched the service:

nvidia-docker run -p 8500:8500 -p 8501:8501 --name faster-serving --mount type=bind,source=`pwd`/faster,target=/models/faster -e MODEL_NAME=faster -t tensorflow/serving:latest-gpu

I call the service like this:

import grpc
import cv2
import tensorflow as tf

from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc

def main():
    server = 'localhost:8500'
    channel = grpc.insecure_channel(server)
    stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
    data = cv2.imread(impth)
    shape = [1, ] + list(data.shape)

    request = predict_pb2.PredictRequest()
    request.model_spec.name = 'faster'
    request.model_spec.signature_name = 'serving_default'
    request.model_spec.version.value = 1

        tf.make_tensor_proto(data, shape=shape)
    result = stub.Predict(request, 10.)

if __name__ == "__main__":

And I got the error message of:

Traceback (most recent call last): File "client.py", line 38, in main() File "client.py", line 33, in main result = stub.Predict(request, 10.) File "/miniconda3/envs/py36/lib/python3.6/site-packages/grpc/_channel.py", line 565, in call return _end_unary_response_blocking(state, call, False, None) File "/miniconda3/envs/py36/lib/python3.6/site-packages/grpc/_channel.py", line 467, in _end_unary_response_blocking raise _Rendezvous(state, None, None, deadline) grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with: status = StatusCode.UNKNOWN details = "2 root error(s) found. (0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[{{node FirstStageFeatureExtractor/resnet_v1_101/resnet_v1_101/conv1/Conv2D}}]] [[SecondStagePostprocessor/BatchMultiClassNonMaxSuppression/map/while/Switch_5/_1117]] (1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[{{node FirstStageFeatureExtractor/resnet_v1_101/resnet_v1_101/conv1/Conv2D}}]] 0 successful operations. 0 derived errors ignored." debug_error_string = "{"created":"@1572596191.210766189","description":"Error received from peer ipv6:[::1]:8500","file":"src/core/lib/surface/call.cc","file_line":1052,"grpc_message":"2 root error(s) found.\n (0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.\n\t [[{{node FirstStageFeatureExtractor/resnet_v1_101/resnet_v1_101/conv1/Conv2D}}]]\n\t [[SecondStagePostprocessor/BatchMultiClassNonMaxSuppression/map/while/Switch_5/_1117]]\n (1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.\n\t [[{{node FirstStageFeatureExtractor/resnet_v1_101/resnet_v1_101/conv1/Conv2D}}]]\n0 successful operations.\n0 derived errors ignored.","grpc_status":2}"

Besides, in the server side, I got the error message of:

2019-11-01 08:14:38.127754: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR 2019-11-01 08:14:38.143970: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR 2019-11-01 08:16:31.206414: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR 2019-11-01 08:16:31.210188: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

My environment:

OS: CentOS Linux 7 (Core) GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36) CMake version: version 3.15.0-rc4

Python version: 3.6 Is CUDA available: Yes CUDA runtime version: Could not collect GPU models and configuration: GPU 0: GeForce RTX 2080 Ti GPU 1: GeForce RTX 2080 Ti GPU 2: GeForce RTX 2080 Ti GPU 3: GeForce RTX 2080 Ti GPU 4: GeForce RTX 2080 Ti GPU 5: GeForce RTX 2080 Ti GPU 6: GeForce RTX 2080 Ti GPU 7: GeForce RTX 2080 Ti

Nvidia driver version: 418.74 cuDNN version: Could not collect

Versions of relevant libraries: [pip3] numpy==1.17.2 [conda] blas 1.0 mkl defaults [conda] mkl 2019.4 243 defaults [conda] mkl-service 2.3.0 py36he904b0f_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main [conda] mkl_fft 1.0.12 py36ha843d7b_0 defaults [conda] mkl_random 1.0.2 py36hd81dba3_0 defaults

CoinCheung commented 5 years ago

There is one thing that I did not mention:

If I replace the docker image from tensorflow/serving:latest-gpu to tensorflow/serving:latest, the problem will no longer exists.

gowthamkpr commented 5 years ago

@CoinCheung Please provide the details below:

  1. Tensorflow version:
  2. Tensrflow serving version:
  3. CUDA version:
  4. cuDNN version:
CoinCheung commented 5 years ago

I actually used the pretrained model from the official tensorflow model zoo, the download link is here. From the Readme of that repository, I know that the tensorflow version is 1.12.0.

As for the tensorflow serving, I pulled the official docker image of tensorflow/serving:latest-gpu. The tf-serving version is 2.0.0, and the cuda version is 10.0.130, the cudnn version is 7.4.1.

gowthamkpr commented 5 years ago

@CoinCheung This is primarily a Tensorflow compatibility issue with your GPU as Tensorflow-gpu is only supported by cuDNN 7 and CUDA 9. Please upgrade your tensorflow version to 1.14 or 1.15 and it should fix this issue. You can find more detailed information about version compatibility here.

wdirons commented 5 years ago

@CoinCheung , Can you check the version of nvidia-docker you are using? For CUDA 10.0, you need to using nvidia-docker 2. (https://github.com/nvidia/nvidia-docker/wiki/Installation-(version-2.0))

CoinCheung commented 5 years ago

The problem is brought by model exported from old version tensorflow. A newer version helps to solve the problem. I am closing this, thanks for support !!