Clusters using aws-virtual-gpu-device-plugin have negative GPU idle cost

keithhand commented 1 year ago

Describe the bug A user reported they noticed GPU costs on their cluster. After looking into the details of their environment, we noticed that they use the aws-virtual-gpu-device-plugin to manage their GPU devices on their cluster. I was able to reproduce the same issue by deploying an AWS GPU-supported node and deploying the controller. Before deploying the DaemonSet controller, I had valid values displaying for the underlying node GPU cost, but after deploying, my node_gpu_count metric emitted from /model/metrics is 0.

To Reproduce Steps to reproduce the behavior:

Deploy an AWS cluster with a GPU-supported device (g4dn.xlarge)
Deploy Kubecost and aws-virtual-gpu-device-plugin
Deploy an example application using a GPU
See that the node associated with the GPU has 0 GPUs, and the deployment has a negative GPU cost.

Expected behavior Idle should still be associated with the node to attribute total cost to an allocation correctly.

Screenshots Namespace with GPU related to Negative Idle:

Node with no controller deployed:

Node deployed, then controller added:

Prometheus metrics corresponding with toggling the label for the controller DaemonSet:

┆Issue is synchronized with this Jira Task by Unito

github-actions[bot] commented 1 year ago

This issue has been marked as stale because it has been open for 360 days with no activity. Please remove the stale label or comment or this issue will be closed in 5 days.

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 365 days with no activity.

AjayTripathy commented 1 year ago

@keithhand @kaelanspatel do you know if this is still an issue?

keithhand commented 1 year ago

I haven't tested since the initial finding, so I'm not sure

chipzoller commented 1 year ago

Transferred to features-bugs as this is not Helm chart specific.

AjayTripathy commented 1 year ago

@thomasvn seems mid priority but let's look to repro and potentially set up a QA cluster here to make sure it doesn't regress.

Historically it has been tough for us to set QA clusters with GPUs...because of the high cost of keeping them running, but maybe there's a chance to spin up/spin down here and run the test suite now.

chipzoller commented 6 months ago

If this issue only impacts the AWS vGPU, I think we can probably disregard as it's fairly clear (although not openly stated that I can find) that this project has been put out to pasture. The more modern approach here is one of several GPU-sharing strategies/technologies officially developed by NVIDIA. If this issue is more general to GPUs in nature, we'd want to check it out.

AjayTripathy commented 6 months ago

I'm fairly confident we've actually fixed the issue. I'm going to close this for now.

kubecost / features-bugs

Clusters using aws-virtual-gpu-device-plugin have negative GPU idle cost #15