hexagon-oss / openhardwaremonitor

Open Hardware Monitor - a tool for monitoring hardware performance. Includes support for various temperature sensors, disk I/O ratings and power consumption.
Mozilla Public License 2.0
188 stars 21 forks source link

Fixed multi-GPU NVIDIA memory reporting, fixed null ref in NVMeGeneric #28

Closed msawyer91 closed 2 years ago

msawyer91 commented 2 years ago

Fixed NVIDIA GPU Memory Stats Misreporting #1386 where multiple NVIDIA cards did not show the correct memory. Only the first GPU's memory was shown correctly; all other GPUs simply repeated the exact same details as the first.

This was done by referring to the NVDIA NVAPI and changing the method call from NvAPI_GetDisplayDriverMemoryInfo to NvAPI_GPU_GetMemoryInfo. The former requires a display handle, and in a computer with multiple GPUs that do not have attached displays (VERY common on mining rigs), the display handle value, while not null, is 0x00000000. Therefore the NVAPI was returning the same values as the first GPU. By calling NvAPI_GPU_GetMemoryInfo instead, which takes the GPU hardware handle as a parameter, the correct memory values are returned.

While testing this, I noticed a null reference exception being thrown while trying to generate a report. In NVMeGeneric, there is a line which prints out the info.IEEE array values. However, if info.IEEE is null, there was no null check and a NullReferenceException was thrown and the app would crash.

Below images show the issue as it exists in the alpa-2 build (as well as the 0.9.6 release of Open Hardware monitor, as well as the corrected "0.10.0-alpha3" build in this PR. Values are now represented as expected on the RTX 3060 (12GB) and RTX 2060 (6GB) for memory load, memory free, memory used, memory total.

0.10.0-alpha2/0.9.6 - GPU MEMORY DUPLICATION ISSUE PRESENT image

PULL REQUEST 0.10.0-alpha3 - GPU MEMORY SHOWN IS CORRECT image

msawyer91 commented 2 years ago

Hi @pgrawehr appreciate the feedback. I created the readme.md for the purpose of giving some context to my msawyer91/openhardwaremonitor repo. Unfortunately it was picked up in my pull request and I can see that being confusing. I deleted the readme.md.

pgrawehr commented 2 years ago

Thanks, this is looking good now. I'll merge it when I have time to give it a quick test (on existing hardware, I don't have any such powerful systems with multiple GPUs)

msawyer91 commented 2 years ago

No problem. Let me know if there's anything I can do to help. I am going to be on holiday for the next few days, however. But if you need someone to run test cases, and need screen recording or something similar to verify the test results, we can discuss that next week if you'd like.