triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.3k stars 1.48k forks source link

Enhancement Request: Additional GPU Information in Prometheus Metrics #6384

Open levipereira opened 1 year ago

levipereira commented 1 year ago

Is your feature request related to a problem? Please describe. no

Currently, the triton-server provides GPU utilization metrics in Prometheus format, like so:

# HELP nv_gpu_utilization GPU utilization rate [0.0 - 1.0)
# TYPE nv_gpu_utilization gauge
nv_gpu_utilization{gpu_uuid="GPU-3fed825f-252b-32ea-e3d7-266c45b62ce7"} 0

I would like to request the inclusion of additional information, specifically the GPU number and GPU name, similar to what can be obtained using nvidia-smi -L. This information would greatly aid in creating dynamic Grafana dashboards without the need to consult additional identification information on the physical host.

Example output of nvidia-smi -L:

GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-c8a1aa60-c24c-5ce2-fc43-068d14542d00)
GPU 1: NVIDIA GeForce RTX 3090 (UUID: GPU-04727ce0-d35e-c535-9a43-b989af8d016f)

Including the GPU number and GPU name in the Prometheus metrics would improve the user experience and ease the dynamic creation of monitoring dashboards.

Thank you for considering this enhancement request.

Best regards, Levi Pereira

ClifHouck commented 10 months ago

I'm going to take a crack at this.

dyastremsky commented 8 months ago

@rmccorm4, what are your thoughts on this feature request? Let me know if you would like me to open a ticket.

dyastremsky commented 8 months ago

@ClifHouck, did you have success with this enhancement? Thanks for working on this!

ClifHouck commented 8 months ago

@dyastremsky Yes, but I ran into this bug: https://github.com/triton-inference-server/server/issues/6815

I've opened a PR to address it: https://github.com/triton-inference-server/core/pull/321

I was waiting for that to be resolved before opening another PR to address this issue.

dyastremsky commented 8 months ago

Thanks for letting me know, Clif. I'll take a look.

stefansedich commented 3 months ago

@ClifHouck did your efforts on this stall? we have just setup triton and it would be great to have the GPU metrics tagged with GPU #.