Closed zoltan-fedor closed 1 year ago
I thought that the issue might be the GPU being Quadro P3200, which is a Pascal GPU with capability 6.1, so I have recompiled the Triton server with setting -D SM=61
(https://github.com/triton-inference-server/fastertransformer_backend/blob/main/docker/create_dockerfile_and_build.py#L105):
create_dockerfile_and_build.py
:
...
RUN cmake \\
-D SM=61 \\
-D CMAKE_EXPORT_COMPILE_COMMANDS=1 \\
-D CMAKE_BUILD_TYPE=Release \\
-D CMAKE_INSTALL_PREFIX=/opt/tritonserver \\
-D TRITON_COMMON_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" \\
-D TRITON_CORE_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" \\
-D TRITON_BACKEND_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" \\
..
RUN make -j"$(grep -c ^processor /proc/cpuinfo)" install
...
That has solved the issue!
Description
Is it possible that this is an issue because of the Pascal GPU? But it is strange that the Triton server seems to load, no error, just maxes out the CPU at inference time, never completing.
Reproduced Steps
I am trying to reproduce https://github.com/triton-inference-server/fastertransformer_backend/issues/95 and https://github.com/triton-inference-server/fastertransformer_backend/blob/main/docs/t5_guide.md#run-t5-v11flan-t5mt5
Build the Triton image:
Start the Triton server:
The Triton server's startup logs:
Then I run the summarization example against it:
What I observe that the Triton server maxes out a single CPU core and show no utilization of the GPU:
While it maxes out the CPU:
And this is going on for 20+ minutes. No observable GPU utilization at all.
Any ideas why it would not utilizing the GPU at all, while the Triton server startup log above shows that the model got loaded onto the GPU?