CPU maxed out, no GPU utilization, inference never completing

Description

Branch: main
Docker Version: 20.10.21
GPU Type: Quadro P3200 with Max-Q Design
Triton Docker Image: triton_with_ft:22.12

Is it possible that this is an issue because of the Pascal GPU? But it is strange that the Triton server seems to load, no error, just maxes out the CPU at inference time, never completing.

Reproduced Steps

I am trying to reproduce https://github.com/triton-inference-server/fastertransformer_backend/issues/95 and https://github.com/triton-inference-server/fastertransformer_backend/blob/main/docs/t5_guide.md#run-t5-v11flan-t5mt5

sudo apt-get install git-lfs
git lfs install
git lfs clone https://huggingface.co/google/flan-t5-small

python3 ./build/_deps/repo-ft-src/examples/pytorch/t5/utils/huggingface_t5_ckpt_convert.py \
        -saved_dir flan-t5-small/c-models \
        -in_file flan-t5-small/ \
        -inference_tensor_para_size 1 \
        -weight_data_type fp32

Build the Triton image:

$ git clone https://github.com/triton-inference-server/fastertransformer_backend
$ cd fastertransformer_backend
$ python docker/create_dockerfile_and_build.py --triton-version 22.12

Start the Triton server:

$ sudo docker run --gpus=1 --rm -p8000:8000 -p8001:8001 -p8002:8002 -v ${PWD}/models:/models -v ${PWD}/fastertransformer_backend/all_models/t5:/t5-models tritonserver_with_ft tritonserver --model-repository=/t5-models

The Triton server's startup logs:

=============================
== Triton Inference Server ==
=============================

NVIDIA Release 22.12 (build 50109463)
Triton Server Version 2.29.0

Copyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

NOTE: CUDA Forward Compatibility mode ENABLED.
  Using CUDA 11.8 driver version 520.61.05 with kernel driver version 470.161.03.
  See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

I0304 00:42:40.902896 1 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7fe464000000' with size 268435456
I0304 00:42:40.903284 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0304 00:42:40.905394 1 model_lifecycle.cc:459] loading: fastertransformer:1
I0304 00:42:41.084832 1 libfastertransformer.cc:1828] TRITONBACKEND_Initialize: fastertransformer
I0304 00:42:41.084861 1 libfastertransformer.cc:1838] Triton TRITONBACKEND API version: 1.10
I0304 00:42:41.084867 1 libfastertransformer.cc:1844] 'fastertransformer' TRITONBACKEND API version: 1.10
I0304 00:42:41.085178 1 libfastertransformer.cc:1876] TRITONBACKEND_ModelInitialize: fastertransformer (version 1)
I0304 00:42:41.086185 1 libfastertransformer.cc:372] Instance group type: KIND_CPU count: 1
I0304 00:42:41.086196 1 libfastertransformer.cc:402] Sequence Batching: disabled
I0304 00:42:41.086201 1 libfastertransformer.cc:412] Dynamic Batching: disabled
I0304 00:42:41.086420 1 libfastertransformer.cc:438] Before Loading Weights:
after allocation    : free:  4.35 GB, total:  5.93 GB, used:  1.57 GB
after allocation    : free:  4.17 GB, total:  5.93 GB, used:  1.76 GB
I0304 00:42:41.290559 1 libfastertransformer.cc:448] After Loading Weights:
W0304 00:42:41.290611 1 libfastertransformer.cc:572] skipping model configuration auto-complete for 'fastertransformer': not supported for fastertransformer backend
I0304 00:42:41.291280 1 libfastertransformer.cc:472] Before Loading Model:
after allocation    : free:  4.17 GB, total:  5.93 GB, used:  1.76 GB
[WARNING] gemm_config.in is not found; using default GEMM algo
I0304 00:42:41.522746 1 libfastertransformer.cc:489] After Loading Model:
I0304 00:42:41.522853 1 libfastertransformer.cc:824] Model instance is created on GPU [ 0 ]
I0304 00:42:41.522874 1 libfastertransformer.cc:1940] TRITONBACKEND_ModelInstanceInitialize: fastertransformer_0 (count 1) (instance_id 0)
after allocation    : free:  4.07 GB, total:  5.93 GB, used:  1.86 GB
I0304 00:42:41.523061 1 model_lifecycle.cc:694] successfully loaded 'fastertransformer' version 1
I0304 00:42:41.523155 1 server.cc:563] 
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0304 00:42:41.523234 1 server.cc:590] 
+-------------------+-----------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Backend           | Path                                                                        | Config                                                                                                                                                        |
+-------------------+-----------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| fastertransformer | /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so | {"cmdline":{"auto-complete-config":"true","min-compute-capability":"6.000000","backend-directory":"/opt/tritonserver/backends","default-max-batch-size":"4"}} |
+-------------------+-----------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0304 00:42:41.523301 1 server.cc:633] 
+-------------------+---------+--------+
| Model             | Version | Status |
+-------------------+---------+--------+
| fastertransformer | 1       | READY  |
+-------------------+---------+--------+

I0304 00:42:41.577945 1 metrics.cc:864] Collecting metrics for GPU 0: Quadro P3200 with Max-Q Design
I0304 00:42:41.578342 1 metrics.cc:757] Collecting CPU metrics
I0304 00:42:41.578586 1 tritonserver.cc:2264] 
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                                                |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                                               |
| server_version                   | 2.29.0                                                                                                                                                                                               |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace logging |
| model_repository_path[0]         | /t5-models                                                                                                                                                                                           |
| model_control_mode               | MODE_NONE                                                                                                                                                                                            |
| strict_model_config              | 0                                                                                                                                                                                                    |
| rate_limit                       | OFF                                                                                                                                                                                                  |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                            |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                             |
| response_cache_byte_size         | 0                                                                                                                                                                                                    |
| min_supported_compute_capability | 6.0                                                                                                                                                                                                  |
| strict_readiness                 | 1                                                                                                                                                                                                    |
| exit_timeout                     | 30                                                                                                                                                                                                   |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0304 00:42:41.579751 1 grpc_server.cc:4819] Started GRPCInferenceService at 0.0.0.0:8001
I0304 00:42:41.580033 1 http_server.cc:3477] Started HTTPService at 0.0.0.0:8000
I0304 00:42:41.621799 1 http_server.cc:184] Started Metrics Service at 0.0.0.0:8002
W0304 00:42:42.582581 1 metrics.cc:603] Unable to get power limit for GPU 0. Status:Success, value:0.000000
W0304 00:42:43.583204 1 metrics.cc:603] Unable to get power limit for GPU 0. Status:Success, value:0.000000
W0304 00:42:44.586661 1 metrics.cc:603] Unable to get power limit for GPU 0. Status:Success, value:0.000000

Then I run the summarization example against it:

$ python fastertransformer_backend/tools/t5_utils/summarization.py --ft_model_location models/flan-t5-small/c-models/1-gpu/ \
                                        --hf_model_location models/flan-t5-small/ \
                                        --test_ft \
                                        --test_hf \
                                        --cache_path /tmp/workdir/datasets/ccdv/ \
                                        --data_type fp16 \
                                        --protocol grpc

What I observe that the Triton server maxes out a single CPU core and show no utilization of the GPU:

root@538e8ab8af60:/workspace# nvidia-smi
Sat Mar  4 00:53:52 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03   Driver Version: 470.161.03   CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro P3200 wi...  Off  | 00000000:01:00.0  On |                  N/A |
| N/A   55C    P0    23W /  N/A |   2618MiB /  6069MiB |      4%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

While it maxes out the CPU:

root@538e8ab8af60:/workspace# ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1 96.8  1.9 26509836 940144 ?     Ssl  00:42  11:18 tritonserver --model-repository=/t5-models
root         109  0.0  0.0 174776 17008 ?        Ssl  00:42   0:00 orted --hnp --set-sid --report-uri 27 --singleton-died-pipe 28 -mca state_novm_select 1 -mca ess hnp -mca pmix ^s1,s2,cray,isola
root         164  0.0  0.0   4248  3536 pts/0    Ss   00:53   0:00 bash
root         189  0.0  0.0   5900  2840 pts/0    R+   00:54   0:00 ps aux

And this is going on for 20+ minutes. No observable GPU utilization at all.

Any ideas why it would not utilizing the GPU at all, while the Triton server startup log above shows that the model got loaded onto the GPU?

triton-inference-server / fastertransformer_backend

CPU maxed out, no GPU utilization, inference never completing #100

Description

Reproduced Steps