nickbabcock / OhmGraphite

Expose hardware sensor data to Graphite / InfluxDB / Prometheus / Postgres / Timescaledb
Other
425 stars 38 forks source link

Wrong GPU total memory reported #371

Closed nitroxis closed 1 month ago

nitroxis commented 1 year ago

Hi, I just noticed that OhmGraphite reports an incorrect total GPU memory size when there are multiple GPUs.

OhmGraphite's prometheus endpoint reports:

ohm_gpunvidia_bytes{hardware="NVIDIA GeForce GTX 1060 6GB",sensor="GPU Memory Total",hw_instance="0"} 11811160064
ohm_gpunvidia_bytes{hardware="NVIDIA GeForce RTX 2080 Ti",sensor="GPU Memory Total",hw_instance="1"} 11811160064

whereas the "1060 6GB" should have 6GB, as the name implies. It shows up correctly in LibreHardwareMonitor, so this does not appear to be the cause: image

nickbabcock commented 1 year ago

Thanks for the bug report! Couple of questions to help narrow in on the problem:

nitroxis commented 1 year ago

The screenshot was made with the current release version from their GitHub (v0.9.2). I've checked again - it is indeed all three GPU Memory ... metrics that are the same. Here is the full list of ohm_gpunvidia_bytes:

# HELP ohm_gpunvidia_bytes Metric reported by open hardware sensor
# TYPE ohm_gpunvidia_bytes gauge
ohm_gpunvidia_bytes{hardware="NVIDIA GeForce RTX 2080 Ti",sensor="GPU Memory Free",hw_instance="0"} 11092885504
ohm_gpunvidia_bytes{hardware="NVIDIA GeForce GTX 1060 6GB",sensor="D3D Shared Memory Used",hw_instance="1"} 155197440
ohm_gpunvidia_bytes{hardware="NVIDIA GeForce GTX 1060 6GB",sensor="GPU Memory Used",hw_instance="1"} 717225984
ohm_gpunvidia_bytes{hardware="NVIDIA GeForce GTX 1060 6GB",sensor="GPU Memory Total",hw_instance="1"} 11811160064
ohm_gpunvidia_bytes{hardware="NVIDIA GeForce RTX 2080 Ti",sensor="GPU Memory Total",hw_instance="0"} 11811160064
ohm_gpunvidia_bytes{hardware="NVIDIA GeForce RTX 2080 Ti",sensor="GPU Memory Used",hw_instance="0"} 717225984
ohm_gpunvidia_bytes{hardware="NVIDIA GeForce GTX 1060 6GB",sensor="GPU Memory Free",hw_instance="1"} 11092885504
ohm_gpunvidia_bytes{hardware="NVIDIA GeForce GTX 1060 6GB",sensor="D3D Dedicated Memory Used",hw_instance="1"} 1205383168
ohm_gpunvidia_bytes{hardware="NVIDIA GeForce RTX 2080 Ti",sensor="D3D Dedicated Memory Used",hw_instance="0"} 489439232
ohm_gpunvidia_bytes{hardware="NVIDIA GeForce RTX 2080 Ti",sensor="D3D Shared Memory Used",hw_instance="0"} 110133248
nitroxis commented 1 year ago

The other ohm_gpunvidia_... metrics appear to be working correctly.

nickbabcock commented 1 year ago

One thing you can try is the nightly build of OhmGraphite built with LibreHardwareMonitor 0.9.2 (https://github.com/nickbabcock/OhmGraphite/suites/11729082221/artifacts/610719590)

If that doesn't fix things, are other sensors like load, wattage, and fans duplicated too? Got it

nitroxis commented 1 year ago

The nightly build still has this issue.

nitroxis commented 1 year ago

Strange, if I compile it myself and launch it in the debugger, it works fine.

nickbabcock commented 1 year ago

Strange, if I compile it myself and launch it in the debugger, it works fine.

When you compile and run OhmGraphite yourself, it works!? 😨

That completely stumps me.

Copied below is a bit of an investigation that I went on, but if compiling it yourself works, then it can be ignored.


My best guess is that there's a difference in how LibreHardwareMonitor and OhmGraphite are refreshing sensors. OhmGraphite refreshes all hardware whenever it needs to send out new metrics. I can see that if LibreHardwareMonitor batches the refresh and UI update for each hardware component before going onto the next component, it would sidestep the possibility of a hardware sensors relying on a global value.

I feel like this is partially corroborated by the fact that it is only the memory sensors that use a display handle instead of a physical handle: https://github.com/LibreHardwareMonitor/LibreHardwareMonitor/blob/6066b1a79737bb7e23217f0d2bb1b14fab04b9aa/LibreHardwareMonitorLib/Hardware/Gpu/NvidiaGpu.cs#L967

nickbabcock commented 1 year ago

I wonder, if you execute:

dotnet publish -c Release .\OhmGraphite\

And run the resulting zip, if that'll also show the problem.

nitroxis commented 1 year ago

I've looked into it a bit more and it appears that it is related to whether the program runs as a normal process or as a service. Running it with OhmGraphite.exe run yields correct results, running it as a service (e.g. OhmGraphite.exe start) yields the wrong results.

nickbabcock commented 1 year ago

Thanks for looking into it further. This issue looks like a variant of #153 (there are various possible solutions within that thread (like https://github.com/nickbabcock/OhmGraphite/issues/153#issuecomment-674433563), though the user ultimately went with the workaround in https://github.com/nickbabcock/OhmGraphite/issues/153#issuecomment-706311993). Their issue involved an AMD GPU, not Nvidia, yet seems eerily similar.

nitroxis commented 1 year ago

It might be related, though it is strange that all other NVIDIA metrics appear to be working fine, it is only those 3 that are wrong. If it were some kind of permission/session thing, I would've thought either all metrics work, or none (like in the linked issue). Why only the memory metrics, and only for one GPU? Checking the "Interact with desktop" checkbox makes no difference for me. I don't really know how to investigate this further.

roy-spark commented 1 year ago

Are these still problems that are persisting in 0.3x? (Issues are not closed)

What are the workarounds in that case?

roy-spark commented 1 year ago

I changed to run OhmGraphihte from service to "OhmGraphite run" and finally it reported GPU load percentage. (It was constantly zero when running in service mode)

nickbabcock commented 1 year ago

Thanks for confirming. Looks like this issue is decently widespread. I'm not sure what causes the issue or what the fix is. Now that OhmGraphite recently started targeting .net 6, it looks like there is an easy and official way to create windows services that doesn't rely on a 3rd party library: https://learn.microsoft.com/en-us/dotnet/core/extensions/windows-service?pivots=dotnet-6-0

I might poke at it and see if it's viable and fixes issues.

nickbabcock commented 9 months ago

Since OhmGraphite v0.31, the old windows service library has been replaced with the newer, official microsoft implementation. Let me know if this fixes the situation.

nickbabcock commented 1 month ago

Let me know if this is still an issue