E0315 1107 server.cc:201] Failed to finalize CUDA memory manager: CNMEM_STATUS_CUDA_ERROR

Description

branch: dev/t5_gptj_blog
triton version: 22.03
GPU: A100-40G

Reproduced Steps

I refer to https://github.com/triton-inference-server/fastertransformer_backend/blob/dev/t5_gptj_blog/notebooks/GPT-J_and_T5_inference.ipynb to operate. The triton version I am using is 22.03, because the driver version of my machine is 510. Everything went well until starting tritonserver, 

CUDA_VISIBLE_DEVICES=0,1 /opt/tritonserver/bin/tritonserver  --model-repository=./triton-model-store/gptj/. 

There is an error `E0315 1107 server.cc:201] Failed to finalize CUDA memory manager: CNMEM_STATUS_CUDA_ERROR`. And the service is stuck, the output is as follows:

```text
I0315 09:22:49.889551 1475 libtorch.cc:1309] TRITONBACKEND_Initialize: pytorch
I0315 09:22:49.889660 1475 libtorch.cc:1319] Triton TRITONBACKEND API version: 1.8
I0315 09:22:49.889668 1475 libtorch.cc:1325] 'pytorch' TRITONBACKEND API version: 1.8
2023-03-15 09:22:50.139130: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2023-03-15 09:22:50.183355: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
I0315 09:22:50.183496 1475 tensorflow.cc:2176] TRITONBACKEND_Initialize: tensorflow
I0315 09:22:50.183525 1475 tensorflow.cc:2186] Triton TRITONBACKEND API version: 1.8
I0315 09:22:50.183533 1475 tensorflow.cc:2192] 'tensorflow' TRITONBACKEND API version: 1.8
I0315 09:22:50.183544 1475 tensorflow.cc:2216] backend configuration:
{}
I0315 09:22:50.208311 1475 onnxruntime.cc:2319] TRITONBACKEND_Initialize: onnxruntime
I0315 09:22:50.208333 1475 onnxruntime.cc:2329] Triton TRITONBACKEND API version: 1.8
I0315 09:22:50.208341 1475 onnxruntime.cc:2335] 'onnxruntime' TRITONBACKEND API version: 1.8
I0315 09:22:50.208352 1475 onnxruntime.cc:2365] backend configuration:
{}
I0315 09:22:50.232132 1475 openvino.cc:1207] TRITONBACKEND_Initialize: openvino
I0315 09:22:50.232170 1475 openvino.cc:1217] Triton TRITONBACKEND API version: 1.8
I0315 09:22:50.232251 1475 openvino.cc:1223] 'openvino' TRITONBACKEND API version: 1.8
I0315 09:22:50.709292 1475 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7fed74000000' with size 268435456
I0315 09:22:50.733248 1475 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0315 09:22:50.733263 1475 cuda_memory_manager.cc:105] CUDA memory pool is created on device 1 with size 67108864
E0315 09:22:51.549144 1475 server.cc:201] Failed to finalize CUDA memory manager: CNMEM_STATUS_CUDA_ERROR
W0315 09:22:52.705603 1475 server.cc:208] failed to enable peer access for some device pairs
I0315 09:22:52.758174 1475 model_repository_manager.cc:997] loading: preprocessing:1
I0315 09:22:52.859095 1475 model_repository_manager.cc:997] loading: postprocessing:1
I0315 09:22:52.879054 1475 python.cc:1903] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0 (CPU device 0)
I0315 09:22:52.962396 1475 model_repository_manager.cc:997] loading: fastertransformer:1
I0315 09:22:54.029058 1475 model_repository_manager.cc:1152] successfully loaded 'preprocessing' version 1
I0315 09:22:55.514393 1475 libfastertransformer.cc:1226] TRITONBACKEND_Initialize: fastertransformer
I0315 09:22:55.514437 1475 libfastertransformer.cc:1236] Triton TRITONBACKEND API version: 1.8
I0315 09:22:55.514526 1475 libfastertransformer.cc:1242] 'fastertransformer' TRITONBACKEND API version: 1.8
I0315 09:22:55.514582 1475 python.cc:1903] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0 (CPU device 0)
I0315 09:22:56.579728 1475 libfastertransformer.cc:1274] TRITONBACKEND_ModelInitialize: fastertransformer (version 1)
W0315 09:22:56.580620 1475 libfastertransformer.cc:149] model configuration:
{
omitted here
}
I0315 09:22:56.583496 1475 model_repository_manager.cc:1152] successfully loaded 'postprocessing' version 1
I0315 09:22:56.591632 1475 libfastertransformer.cc:1320] TRITONBACKEND_ModelInstanceInitialize: fastertransformer_0 (device 0)
W0315 09:22:56.591658 1475 libfastertransformer.cc:453] Faster transformer model instance is created at GPU '0'
W0315 09:22:56.591665 1475 libfastertransformer.cc:459] Model name gpt-j-6b
W0315 09:22:56.591677 1475 libfastertransformer.cc:578] Get input name: input_ids, type: TYPE_UINT32, shape: [-1]
W0315 09:22:56.591686 1475 libfastertransformer.cc:578] Get input name: start_id, type: TYPE_UINT32, shape: [1]
W0315 09:22:56.591694 1475 libfastertransformer.cc:578] Get input name: end_id, type: TYPE_UINT32, shape: [1]
W0315 09:22:56.591701 1475 libfastertransformer.cc:578] Get input name: input_lengths, type: TYPE_UINT32, shape: [1]
W0315 09:22:56.591709 1475 libfastertransformer.cc:578] Get input name: request_output_len, type: TYPE_UINT32, shape: [-1]
W0315 09:22:56.591717 1475 libfastertransformer.cc:578] Get input name: runtime_top_k, type: TYPE_UINT32, shape: [1]
W0315 09:22:56.591725 1475 libfastertransformer.cc:578] Get input name: runtime_top_p, type: TYPE_FP32, shape: [1]
W0315 09:22:56.591733 1475 libfastertransformer.cc:578] Get input name: beam_search_diversity_rate, type: TYPE_FP32, shape: [1]
W0315 09:22:56.591741 1475 libfastertransformer.cc:578] Get input name: temperature, type: TYPE_FP32, shape: [1]
W0315 09:22:56.591748 1475 libfastertransformer.cc:578] Get input name: len_penalty, type: TYPE_FP32, shape: [1]
W0315 09:22:56.591756 1475 libfastertransformer.cc:578] Get input name: repetition_penalty, type: TYPE_FP32, shape: [1]
W0315 09:22:56.591764 1475 libfastertransformer.cc:578] Get input name: random_seed, type: TYPE_INT32, shape: [1]
W0315 09:22:56.591771 1475 libfastertransformer.cc:578] Get input name: is_return_log_probs, type: TYPE_BOOL, shape: [1]
W0315 09:22:56.591779 1475 libfastertransformer.cc:578] Get input name: beam_width, type: TYPE_UINT32, shape: [1]
W0315 09:22:56.591787 1475 libfastertransformer.cc:578] Get input name: bad_words_list, type: TYPE_INT32, shape: [2, -1]
W0315 09:22:56.591796 1475 libfastertransformer.cc:578] Get input name: stop_words_list, type: TYPE_INT32, shape: [2, -1]
W0315 09:22:56.591807 1475 libfastertransformer.cc:620] Get output name: output_ids, type: TYPE_UINT32, shape: [-1, -1]
W0315 09:22:56.591815 1475 libfastertransformer.cc:620] Get output name: sequence_length, type: TYPE_UINT32, shape: [-1]
W0315 09:22:56.591823 1475 libfastertransformer.cc:620] Get output name: cum_log_probs, type: TYPE_FP32, shape: [-1]
W0315 09:22:56.591831 1475 libfastertransformer.cc:620] Get output name: output_log_probs, type: TYPE_FP32, shape: [-1, -1]

its stuck.

triton-inference-server / fastertransformer_backend

E0315 1107 server.cc:201] Failed to finalize CUDA memory manager: CNMEM_STATUS_CUDA_ERROR #103

Description

Reproduced Steps