Closed CoinCheung closed 5 years ago
There is one thing that I did not mention:
If I replace the docker image from tensorflow/serving:latest-gpu
to tensorflow/serving:latest
, the problem will no longer exists.
@CoinCheung Please provide the details below:
I actually used the pretrained model from the official tensorflow model zoo, the download link is here. From the Readme of that repository, I know that the tensorflow version is 1.12.0.
As for the tensorflow serving, I pulled the official docker image of tensorflow/serving:latest-gpu
.
The tf-serving version is 2.0.0, and the cuda version is 10.0.130, the cudnn version is 7.4.1.
@CoinCheung This is primarily a Tensorflow compatibility issue with your GPU as Tensorflow-gpu is only supported by cuDNN 7 and CUDA 9. Please upgrade your tensorflow version to 1.14 or 1.15 and it should fix this issue. You can find more detailed information about version compatibility here.
@CoinCheung , Can you check the version of nvidia-docker you are using? For CUDA 10.0, you need to using nvidia-docker 2. (https://github.com/nvidia/nvidia-docker/wiki/Installation-(version-2.0))
The problem is brought by model exported from old version tensorflow. A newer version helps to solve the problem. I am closing this, thanks for support !!
In order to reproduce the problem, first I downloaded the model in the modelzoo:
Then I launched the service:
I call the service like this:
And I got the error message of:
Traceback (most recent call last): File "client.py", line 38, in
main()
File "client.py", line 33, in main
result = stub.Predict(request, 10.)
File "/miniconda3/envs/py36/lib/python3.6/site-packages/grpc/_channel.py", line 565, in call
return _end_unary_response_blocking(state, call, False, None)
File "/miniconda3/envs/py36/lib/python3.6/site-packages/grpc/_channel.py", line 467, in _end_unary_response_blocking
raise _Rendezvous(state, None, None, deadline)
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
status = StatusCode.UNKNOWN
details = "2 root error(s) found.
(0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node FirstStageFeatureExtractor/resnet_v1_101/resnet_v1_101/conv1/Conv2D}}]]
[[SecondStagePostprocessor/BatchMultiClassNonMaxSuppression/map/while/Switch_5/_1117]]
(1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node FirstStageFeatureExtractor/resnet_v1_101/resnet_v1_101/conv1/Conv2D}}]]
0 successful operations.
0 derived errors ignored."
debug_error_string = "{"created":"@1572596191.210766189","description":"Error received from peer ipv6:[::1]:8500","file":"src/core/lib/surface/call.cc","file_line":1052,"grpc_message":"2 root error(s) found.\n (0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.\n\t [[{{node FirstStageFeatureExtractor/resnet_v1_101/resnet_v1_101/conv1/Conv2D}}]]\n\t [[SecondStagePostprocessor/BatchMultiClassNonMaxSuppression/map/while/Switch_5/_1117]]\n (1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.\n\t [[{{node FirstStageFeatureExtractor/resnet_v1_101/resnet_v1_101/conv1/Conv2D}}]]\n0 successful operations.\n0 derived errors ignored.","grpc_status":2}"
Besides, in the server side, I got the error message of:
2019-11-01 08:14:38.127754: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR 2019-11-01 08:14:38.143970: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR 2019-11-01 08:16:31.206414: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR 2019-11-01 08:16:31.210188: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
My environment:
OS: CentOS Linux 7 (Core) GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36) CMake version: version 3.15.0-rc4
Python version: 3.6 Is CUDA available: Yes CUDA runtime version: Could not collect GPU models and configuration: GPU 0: GeForce RTX 2080 Ti GPU 1: GeForce RTX 2080 Ti GPU 2: GeForce RTX 2080 Ti GPU 3: GeForce RTX 2080 Ti GPU 4: GeForce RTX 2080 Ti GPU 5: GeForce RTX 2080 Ti GPU 6: GeForce RTX 2080 Ti GPU 7: GeForce RTX 2080 Ti
Nvidia driver version: 418.74 cuDNN version: Could not collect
Versions of relevant libraries: [pip3] numpy==1.17.2 [conda] blas 1.0 mkl defaults [conda] mkl 2019.4 243 defaults [conda] mkl-service 2.3.0 py36he904b0f_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main [conda] mkl_fft 1.0.12 py36ha843d7b_0 defaults [conda] mkl_random 1.0.2 py36hd81dba3_0 defaults