Open kelkarn opened 7 months ago
As an additional data point, I tried running a Triton server with an ONNX backend and deploying an ONNX model on the MIG-enabled node pool, and that worked. I was able to see the Triton server transition to 'Running' state in Kubernetes, and the logs showed that all the ONNX models were in READY
state.
CC @byshiue here to re-surface this issue.
Note that the error:
│ ias-sdep-1 CacheManager Init Failed. Error: -29 │
│ ias-sdep-1 W0206 01:52:07.225109 1 metrics.cc:731] DCGM unable to start: DCGM initialization error
I think is just relevant to the DCGM not being able to publish Prometheus metrics around GPU utilization etc. which is fine for me and not a blocker; the main issue is why the model does not show up as 'READY' in the model status table that is printed out.
CC @schetlur-nv - any updates/notes on this Sharan?
TRT-LLM should not be affected by the use of MIG. I am not sure how TRT-LLM would be causing a DCGM startup failure. If it is still being seen, please share the server logs.
TRT-LLM version: 0.5.0 Triton server version: 23.10 GPU type: A100, 80GB, with MIG enabled (20gb GPU memory per split, 3 splits per node).
I am trying to run a Falcon-7B model with TRT-LLM on a MIG-enabled node pool in AKS, but run into this error:
Strangely enough, my Triton server pod shows 'Running' state in Kubernetes but when I check the logs above it shows that no models are running and the above
CacheManager
error.Does TRT-LLM work with MIG-enabled GPUs?