CUDA: Operation Not Supported

Description

Hi, I'm trying to run triton:22.03 / FasterTransformer within a kubernetes pod.

Running

CUDA_VISIBLE_DEVICES=0 mpirun -n 1 --allow-run-as-root /opt/tritonserver/bin/tritonserver  --model-repository=${WORKSPACE}/all_models/gptj/

gives me this error:

  what():  [FT][ERROR] CUDA runtime error: operation not supported /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/utils/allocator.h:160

The operation in question is

check_cuda_error(cudaDeviceGetDefaultMemPool(&mempool, device_id));

**Click here for full error log**

``` root@triton-deployment:/workspace/build/fastertransformer_backend/all_models/gptj/fastertransformer# CUDA_VISIBLE_DEVICES=0 mpirun -n 1 --allow-run-as-root /opt/tritonserver/bin/tritonserver --model-repository=${WORKSPACE}/all_models/gptj/ I0504 19:10:00.078200 3296 libtorch.cc:1309] TRITONBACKEND_Initialize: pytorch I0504 19:10:00.078309 3296 libtorch.cc:1319] Triton TRITONBACKEND API version: 1.8 I0504 19:10:00.078314 3296 libtorch.cc:1325] 'pytorch' TRITONBACKEND API version: 1.8 2023-05-04 19:10:00.248359: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 2023-05-04 19:10:00.281753: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 I0504 19:10:00.281830 3296 tensorflow.cc:2176] TRITONBACKEND_Initialize: tensorflow I0504 19:10:00.281850 3296 tensorflow.cc:2186] Triton TRITONBACKEND API version: 1.8 I0504 19:10:00.281854 3296 tensorflow.cc:2192] 'tensorflow' TRITONBACKEND API version: 1.8 I0504 19:10:00.281858 3296 tensorflow.cc:2216] backend configuration: {} I0504 19:10:00.283495 3296 onnxruntime.cc:2319] TRITONBACKEND_Initialize: onnxruntime I0504 19:10:00.283521 3296 onnxruntime.cc:2329] Triton TRITONBACKEND API version: 1.8 I0504 19:10:00.283526 3296 onnxruntime.cc:2335] 'onnxruntime' TRITONBACKEND API version: 1.8 I0504 19:10:00.283529 3296 onnxruntime.cc:2365] backend configuration: {} I0504 19:10:00.299472 3296 openvino.cc:1207] TRITONBACKEND_Initialize: openvino I0504 19:10:00.299491 3296 openvino.cc:1217] Triton TRITONBACKEND API version: 1.8 I0504 19:10:00.299496 3296 openvino.cc:1223] 'openvino' TRITONBACKEND API version: 1.8 I0504 19:10:00.588906 3296 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x10018000000' with size 268435456 I0504 19:10:00.589474 3296 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864 E0504 19:10:00.591299 3296 model_repository_manager.cc:1927] Poll failed for model directory 'ensemble': ensemble input 'runtime_top_k' is optional, optional ensemble input is not currently supported I0504 19:10:00.595117 3296 model_repository_manager.cc:997] loading: preprocessing:1 I0504 19:10:00.695602 3296 model_repository_manager.cc:997] loading: postprocessing:1 I0504 19:10:00.703348 3296 python.cc:1903] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0 (CPU device 0) I0504 19:10:00.796223 3296 model_repository_manager.cc:997] loading: fastertransformer:1 I0504 19:10:02.929660 3296 model_repository_manager.cc:1152] successfully loaded 'preprocessing' version 1 I0504 19:10:03.063112 3296 libfastertransformer.cc:1828] TRITONBACKEND_Initialize: fastertransformer I0504 19:10:03.063213 3296 libfastertransformer.cc:1838] Triton TRITONBACKEND API version: 1.8 I0504 19:10:03.063328 3296 libfastertransformer.cc:1844] 'fastertransformer' TRITONBACKEND API version: 1.8 I0504 19:10:03.063415 3296 libfastertransformer.cc:1876] TRITONBACKEND_ModelInitialize: fastertransformer (version 1) I0504 19:10:03.064111 3296 libfastertransformer.cc:372] Instance group type: KIND_CPU count: 1 I0504 19:10:03.064170 3296 libfastertransformer.cc:402] Sequence Batching: disabled I0504 19:10:03.064207 3296 libfastertransformer.cc:412] Dynamic Batching: disabled I0504 19:10:03.064357 3296 libfastertransformer.cc:438] Before Loading Weights: after allocation : free: 28.85 GB, total: 32.00 GB, used: 3.15 GB I0504 19:10:16.380581 3296 libfastertransformer.cc:448] After Loading Weights: after allocation : free: 17.58 GB, total: 32.00 GB, used: 14.42 GB I0504 19:10:16.381466 3296 python.cc:1903] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0 (CPU device 0) I0504 19:10:17.839193 3296 libfastertransformer.cc:472] Before Loading Model: I0504 19:10:17.839390 3296 model_repository_manager.cc:1152] successfully loaded 'postprocessing' version 1 after allocation : free: 17.58 GB, total: 32.00 GB, used: 14.42 GB terminate called after throwing an instance of 'std::runtime_error' what(): [FT][ERROR] CUDA runtime error: operation not supported /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/utils/allocator.h:160 [triton-deployment:03296] *** Process received signal *** [triton-deployment:03296] Signal: Aborted (6) [triton-deployment:03296] Signal code: (-6) [triton-deployment:03296] [ 0] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f4715572420] [triton-deployment:03296] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f4714cf800b] [triton-deployment:03296] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f4714cd7859] [triton-deployment:03296] [ 3] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911)[0x7f47150b1911] [triton-deployment:03296] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c)[0x7f47150bd38c] [triton-deployment:03296] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3f7)[0x7f47150bd3f7] [triton-deployment:03296] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa6a9)[0x7f47150bd6a9] [triton-deployment:03296] [ 7] /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so(_ZN17fastertransformer5checkI9cudaErrorEEvT_PKcS4_i+0x219)[0x7f459d04ebe9] [triton-deployment:03296] [ 8] /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so(_ZN17fastertransformer9AllocatorILNS_13AllocatorTypeE0EEC1Ei+0x123)[0x7f459d08b813] [triton-deployment:03296] [ 9] /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so(_ZN15GptJTritonModelI6__halfE19createModelInstanceEiiP11CUstream_stSt4pairISt6vectorIN17fastertransformer9NcclParamESaIS7_EES9_ESt10shared_ptrINS6_18AbstractCustomCommEE+0xad)[0x7f459d14a68d] [triton-deployment:03296] [10] /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so(+0x19a38)[0x7f46402d6a38] [triton-deployment:03296] [11] /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so(+0x1a323)[0x7f46402d7323] [triton-deployment:03296] [12] /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so(+0x3c11e)[0x7f46402f911e] [triton-deployment:03296] [13] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4)[0x7f47150e9de4] [triton-deployment:03296] [14] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7f4715566609] [triton-deployment:03296] [15] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f4714dd4133] [triton-deployment:03296] *** End of error message *** -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- 0504 19:10:18.364089 3359 pb_stub.cc:821] Non-graceful termination detected. 0504 19:10:18.512814 3300 pb_stub.cc:821] Non-graceful termination detected. -------------------------------------------------------------------------- mpirun noticed that process rank 0 with PID 0 on node triton-deployment exited on signal 6 (Aborted). ```

I've gotten this same error with both GPT-J and T5. Its likely a CUDA problem but as far as I know, I have the correct versions..

Here is my NVIDIA-SMI:

root@triton-deployment:/workspace/build/fastertransformer_backend/all_models/gptj/fastertransformer# nvidia-smi
Thu May  4 19:15:20 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.02    Driver Version: 510.85.02    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GRID V100D-32C      On   | 00000000:02:00.0 Off |                    0 |
| N/A   N/A    P0    N/A /  N/A |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

and "nvcc -version":

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Thu_Feb_10_18:23:41_PST_2022
Cuda compilation tools, release 11.6, V11.6.112
Build cuda_11.6.r11.6/compiler.30978841_0

Help would be appreciated, thanks!


### Reproduced Steps

CUDA_VISIBLE_DEVICES=0 mpirun -n 1 --allow-run-as-root /opt/tritonserver/bin/tritonserver --model-repository=${WORKSPACE}/all_models/gptj/

triton-inference-server / fastertransformer_backend

CUDA: Operation Not Supported #127

Description