leptonai / gpud

Apache License 2.0
215 stars 14 forks source link

Question Regarding Remediation #75

Closed ivelichkovich closed 2 months ago

ivelichkovich commented 2 months ago

Hello,

I read the article here: https://blog.lepton.ai/introducing-gpud-the-missing-gpu-management-for-ai-0f0d026337e3 that sent me to this repo. The article says After deploying GPUd to our clusters, we observed an 80% reduction in human intervention and a 50% decrease in GPU unavailability. This seems to imply GPUd includes some automatic remediation, is that the case? I can't seem to find how it's doing that

gyuho commented 2 months ago

After deploying GPUd to our clusters, we observed an 80% reduction in human intervention and a 50% decrease in GPU unavailability

We will be open sourcing more internal toolings in this repo, in the coming weeks, but to give you some ideas

80% reduction in human intervention

Most time is saved in root causing the GPU issues (dmesg parsing, xid events, clock events). Some automated remediation includes xid event detection -> restart nvidia device plugin (for k8s) until the unhealthy GPU comes back. But we find most GPU issues non-actionable, so reporting or root causing sooner definitely helps.