Closed keithhand closed 6 months ago
This issue has been marked as stale because it has been open for 360 days with no activity. Please remove the stale label or comment or this issue will be closed in 5 days.
This issue was closed because it has been inactive for 365 days with no activity.
@keithhand @kaelanspatel do you know if this is still an issue?
I haven't tested since the initial finding, so I'm not sure
Transferred to features-bugs as this is not Helm chart specific.
@thomasvn seems mid priority but let's look to repro and potentially set up a QA cluster here to make sure it doesn't regress.
Historically it has been tough for us to set QA clusters with GPUs...because of the high cost of keeping them running, but maybe there's a chance to spin up/spin down here and run the test suite now.
If this issue only impacts the AWS vGPU, I think we can probably disregard as it's fairly clear (although not openly stated that I can find) that this project has been put out to pasture. The more modern approach here is one of several GPU-sharing strategies/technologies officially developed by NVIDIA. If this issue is more general to GPUs in nature, we'd want to check it out.
I'm fairly confident we've actually fixed the issue. I'm going to close this for now.
Describe the bug A user reported they noticed GPU costs on their cluster. After looking into the details of their environment, we noticed that they use the aws-virtual-gpu-device-plugin to manage their GPU devices on their cluster. I was able to reproduce the same issue by deploying an AWS GPU-supported node and deploying the controller. Before deploying the DaemonSet controller, I had valid values displaying for the underlying node GPU cost, but after deploying, my
node_gpu_count
metric emitted from/model/metrics
is 0.To Reproduce Steps to reproduce the behavior:
Expected behavior Idle should still be associated with the node to attribute total cost to an allocation correctly.
Screenshots Namespace with GPU related to Negative Idle:
Node with no controller deployed:
Node deployed, then controller added:
Prometheus metrics corresponding with toggling the label for the controller DaemonSet:
┆Issue is synchronized with this Jira Task by Unito