triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
657 stars 94 forks source link

Is the TRT-LLM backend supported on MIG-enabled node pool? #334

Open kelkarn opened 7 months ago

kelkarn commented 7 months ago

TRT-LLM version: 0.5.0 Triton server version: 23.10 GPU type: A100, 80GB, with MIG enabled (20gb GPU memory per split, 3 splits per node).

I am trying to run a Falcon-7B model with TRT-LLM on a MIG-enabled node pool in AKS, but run into this error:

│ ias-sdep-1 I0206 01:52:07.209989 1 server.cc:674]                                                                                                                                          │
│ ias-sdep-1 +-------+---------+--------+                                                                                                                                                    │
│ ias-sdep-1 | Model | Version | Status |                                                                                                                                                    │
│ ias-sdep-1 +-------+---------+--------+                                                                                                                                                    │
│ ias-sdep-1 +-------+---------+--------+                                                                                                                                                    │
│ ias-sdep-1                                                                                                                                                                                 │
│ ias-sdep-1 CacheManager Init Failed. Error: -29                                                                                                                                            │
│ ias-sdep-1 W0206 01:52:07.225109 1 metrics.cc:731] DCGM unable to start: DCGM initialization error                                                                                         │
│ ias-sdep-1 I0206 01:52:07.225340 1 metrics.cc:703] Collecting CPU metrics                                                                                                                  │
│ ias-sdep-1 I0206 01:52:07.225466 1 tritonserver.cc:2435]                                                                                                                                   │
│ ias-sdep-1 +----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------- │
│ ias-sdep-1 | Option                           | Value                                                                                                                                      │
│ ias-sdep-1 +----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------- │
│ ias-sdep-1 | server_id                        | triton                                                                                                                                     │
│ ias-sdep-1 | server_version                   | 2.37.0                                                                                                                                     │
│ ias-sdep-1 | server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda │
│ ias-sdep-1 | model_repository_path[0]         | /mnt/cache/ias-models/ias-sdep-1                                                                                                           │
│ ias-sdep-1 | model_control_mode               | MODE_NONE                                                                                                                                  │
│ ias-sdep-1 | strict_model_config              | 0                                                                                                                                          │
│ ias-sdep-1 | rate_limit                       | OFF                                                                                                                                        │
│ ias-sdep-1 | pinned_memory_pool_byte_size     | 268435456                                                                                                                                  │
│ ias-sdep-1 | cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                   │
│ ias-sdep-1 | min_supported_compute_capability | 6.0                                                                                                                                        │
│ ias-sdep-1 | strict_readiness                 | 1                                                                                                                                          │
│ ias-sdep-1 | exit_timeout                     | 30                                                                                                                                         │
│ ias-sdep-1 | cache_enabled                    | 0                                                                                                                                          │
│ ias-sdep-1 +----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------- │
│ ias-sdep-1                                                                                                                                                                                 │
│ ias-sdep-1 I0206 01:52:07.227076 1 grpc_server.cc:2451] Started GRPCInferenceService at 0.0.0.0:9500                                                                                       │
│ ias-sdep-1 I0206 01:52:07.227260 1 http_server.cc:3558] Started HTTPService at 0.0.0.0:9000                                                                                                │
│ ias-sdep-1 I0206 01:52:07.268460 1 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002

Strangely enough, my Triton server pod shows 'Running' state in Kubernetes but when I check the logs above it shows that no models are running and the above CacheManager error.

Does TRT-LLM work with MIG-enabled GPUs?

kelkarn commented 7 months ago

As an additional data point, I tried running a Triton server with an ONNX backend and deploying an ONNX model on the MIG-enabled node pool, and that worked. I was able to see the Triton server transition to 'Running' state in Kubernetes, and the logs showed that all the ONNX models were in READY state.

kelkarn commented 5 months ago

CC @byshiue here to re-surface this issue.

Note that the error:

│ ias-sdep-1 CacheManager Init Failed. Error: -29                                                                                                                                            │
│ ias-sdep-1 W0206 01:52:07.225109 1 metrics.cc:731] DCGM unable to start: DCGM initialization error                                                                                         

I think is just relevant to the DCGM not being able to publish Prometheus metrics around GPU utilization etc. which is fine for me and not a blocker; the main issue is why the model does not show up as 'READY' in the model status table that is printed out.

kelkarn commented 4 months ago

CC @schetlur-nv - any updates/notes on this Sharan?

schetlur-nv commented 2 months ago

TRT-LLM should not be affected by the use of MIG. I am not sure how TRT-LLM would be causing a DCGM startup failure. If it is still being seen, please share the server logs.