Closed ivelichkovich closed 2 months ago
After deploying GPUd to our clusters, we observed an 80% reduction in human intervention and a 50% decrease in GPU unavailability
We will be open sourcing more internal toolings in this repo, in the coming weeks, but to give you some ideas
80% reduction in human intervention
Most time is saved in root causing the GPU issues (dmesg parsing, xid events, clock events). Some automated remediation includes xid event detection -> restart nvidia device plugin (for k8s) until the unhealthy GPU comes back. But we find most GPU issues non-actionable, so reporting or root causing sooner definitely helps.
Hello,
I read the article here: https://blog.lepton.ai/introducing-gpud-the-missing-gpu-management-for-ai-0f0d026337e3 that sent me to this repo. The article says
After deploying GPUd to our clusters, we observed an 80% reduction in human intervention and a 50% decrease in GPU unavailability.
This seems to imply GPUd includes some automatic remediation, is that the case? I can't seem to find how it's doing that