Thank you for your dedication to developing a GPU memory oversubscription solution, which has immensely beneficial to our work.
I've conducted local tests involving various processes; however, the GPU utilization data obtained via nvidia-smi appears to be rather granular. Upon reviewing the README, I didn't discover a more refined monitoring approach, akin to Prometheus metrics.
Could you offer some suggestions for GPU usage by individual pods and processes?
Thank you for your dedication to developing a GPU memory oversubscription solution, which has immensely beneficial to our work.
I've conducted local tests involving various processes; however, the GPU utilization data obtained via nvidia-smi appears to be rather granular. Upon reviewing the README, I didn't discover a more refined monitoring approach, akin to Prometheus metrics.
Could you offer some suggestions for GPU usage by individual pods and processes?