Question Regarding Remediation

leptonai / gpud

Apache License 2.0

215 stars 14 forks source link

After deploying GPUd to our clusters, we observed an 80% reduction in human intervention and a 50% decrease in GPU unavailability

We will be open sourcing more internal toolings in this repo, in the coming weeks, but to give you some ideas

80% reduction in human intervention

Most time is saved in root causing the GPU issues (dmesg parsing, xid events, clock events). Some automated remediation includes xid event detection -> restart nvidia device plugin (for k8s) until the unhealthy GPU comes back. But we find most GPU issues non-actionable, so reporting or root causing sooner definitely helps.

leptonai / gpud

Question Regarding Remediation #75