nickbabcock / OhmGraphite

Expose hardware sensor data to Graphite / InfluxDB / Prometheus / Postgres / Timescaledb
Other
426 stars 38 forks source link

Aqua Computer Quadro data read inconsistently #440

Open sgade opened 8 months ago

sgade commented 8 months ago

I added an Aqua Computer Quadro to my setup and noticed that both its temperature reading and reported fan RPM values were incorrect in Grafana. I checked with Libre Hardware Monitor and FanControl as well as HWInfo64, all report correct values while OhmGraphite seems incorrect.

The observed behavior is very confusing since the reported value seems to lag behind its actual value. Example: The water temperature starts out correct at ~22 degrees C and stays at that level for a few minutes while the water is actually already heating up to say >30° C. Then, while all other programs are reporting the water to cool down to ~28°, it is reported at >30° in Grafana. At some point, it catches up but only keeps it at 30° for a while, while it actually already is at 27-28°. Similar behavior can be seen for the RPM reporting. Sometimes, values randomly jump or drop to zero.

In the following image, you can see the water pump's speed (yellow) and the reported flow (blue). Other tools have it at zero for a few minutes already (which might be a problem on my side), but OhmGraphite still reports a different value for a while, then jumps up and drops to zero.

image

Now, since other programs are agreeing on the same values and restarting the OhmGraphite service does trigger jumps to "correct" values, it seems like some sort of wrong reading of the data? Maybe not reading enough of some buffer? Since it's using LibreHardwareMonitor under the hood, I'm not sure how that would be when their app works fine.

Happy to test out things and report back!

sgade commented 8 months ago

Another thing I just noticed after a fresh boot: While the water was apparently at around 23° C, OhmGraphite was reporting 32° C. Might this be a wrong reading of the bits, maybe because of an old version or something similar?

nickbabcock commented 8 months ago

Can you try out the build from #441: https://github.com/nickbabcock/OhmGraphite/actions/runs/8214992430/artifacts/1311727196

It adds trace logging of the sensors that will get reported downstream. You'll need to swap the level of the file logger to trace:

    <logger name="*" minlevel="Trace" writeTo="file" />

Then tail OhmGraphite.log for the logged values that you are interested in. Let me know what you observe.

sgade commented 8 months ago

Will do!

Just for reference, this is a jump (red line is the coolant) I just observed after the second boot. So it stays across multiple reboots but at some point jumps to more closely align with the actual value (now reporting 22 instead of 23 C).

image

sgade commented 8 months ago

I looked at the logs for this sensor "Temperatuer 1" on my Quadro, and while it started out quite right, it slowly gets out of sync with the reporting from my other tools.

I am currently sitting at 26.8° C water, while OhmGraphite has 25.6, now 25.62 in its logs and reporting:

2024-03-09 15:42:52.6059|TRACE|OhmGraphite.SensorCollector|Value: ID "/\\?\hid#vid_0c70&pid_f00d&mi_01#a&114aadff&0&0000#{4d1e55b2-f16f-11cf-88cb-001111000030}/temperature/0", sensor: "Temperature 1", value: 25.62, sensor type: Temperature, hardware: "QUADRO", hardware type: Cooler, sensor index: 0, hardware index: "\\?\hid#vid_0c70&pid_f00d&mi_01#a&114aadff&0&0000#{4d1e55b2-f16f-11cf-88cb-001111000030}"

Two minutes later, at 27 degrees it is reporting 25.67.

I am running LibreHardwareMonitor.exe on the side which is reporting the same 27 degrees, as is HWInfo64.


Another data point: 3 tools (FanControl, LibreHardwareMonitor, HWInfo) reporting 27.4, OhmGraphite reading 25.8.

nickbabcock commented 7 months ago

Hmm I wonder if somehow previous reported values are influencing current ones. I pushed commit that makes it a bit more explicit. I don't /think/ it'll change anything but perhaps worth a shot: https://github.com/nickbabcock/OhmGraphite/actions/runs/8215332382/artifacts/1311777388

sgade commented 7 months ago

Sorry for the late test. I just tested it. Stable version reported 28°C, the one you linked lasted jumped down to 21.8° but now that water is at 22.7 it still says 22.0, lagging behind in reporting. So nothing changed, unfortunately.

I can see the same values being reported in the log. Can I look somewhere else for more details? Its only for the sensors connected to the Quadro, so maybe there is something special with that?

nickbabcock commented 7 months ago

Thank you for your patience. I am equally stumped. But a couple of things to try:

One is to set the update interval in either LibreHardwareMonitor or OhmGraphite to match each other. By default, LibreHardwareMonitoris every second while OhmGraphite is once every 5 seconds. Maybe try changing the OhmGraphite update interval to 5 second (or if you are using prometheus, update the interval there).

Another test is to run OhmGraphite interactively with OhmGraphite.exe run this way we can see if running as a windows service is somehow getting in the way.

sgade commented 7 months ago

I was actually running the test versions interactively but had already played around with other settings for the service you suggested for simular problems, like "allow service to interact with desktop". Nothing helped.

Now, upon setting the interval to "1" (down from "5"), Grafana started reporting correct results! There are minimal differences that are probably due to the fact that my dashboard is only reloading every 5 seconds anyway.

While the data is up to date, I notice that the update interval keeps on increasing. It starts with saying "Sent 197 metrics in 70ms". This time slowly increases with each reporting, and after around 20 minutes it was at around 9770ms. Might this be the/a bug? Could also be a problem on my slow server but hadn't seen such behaviour before with the 5 seconds reporting interval. Update: After 25 minutes I'm at 14500ms and increasing.

Now, I would like to not have that much data and keep my reporting interval at 5 seconds. Since it is only affecting the Quadro, is this a fix you can do in OhmGraphite? Thinking about it, maybe the Quadro is sending data for every seconds and if its not picked up, it might just keep piling up?

nickbabcock commented 7 months ago

Now, upon setting the interval to "1" (down from "5"), Grafana started reporting correct results!

This is excellent news! We have found the culprit. Nicely done.

Since it is only affecting the Quadro, is this a fix you can do in OhmGraphite?

Polling hardware at different intervals is not currently supported. It is technically possible to solve it at the OhmGraphite level, but I'm wary of the time effort needed to workaround this bugs.

Thinking about it, maybe the Quadro is sending data for every seconds and if its not picked up, it might just keep piling up?

I'm not sure how the Quadro works under the hood, so your guess is as good as mine.

What I would do is report this issue upstream in LibreHardwareMonitor (I'm assuming you can reproduce the issue if you change the LibreHardwareMonitor refresh rate to something like 5 seconds). And tag the people who wrote the feature initially to see if they have any insights on how this problem can be resolved. https://github.com/LibreHardwareMonitor/LibreHardwareMonitor/pull/938

Update: After 25 minutes I'm at 14500ms and increasing.

This is concerning. What backend are you using? Hopefully the machine it's running on is something underpowered 😄

Most of the metrics sinks are well optimized to send metrics in batches, but if something still seems fishy, let me know.

sgade commented 7 months ago

I'm assuming you can reproduce the issue if you change the LibreHardwareMonitor refresh rate to something like 5 seconds

Even better. I set it to 5 seconds in the LibreHardwareMonitor UI, then increased and decreased the actual temperature value (as reported by FanControl) from 29 up to 32, and back to 29. No changes in the UI. You can, however, set the refresh interval to 250ms which works like a "fast-forward" in this case and I was able to observe the increase and decrease after the fact in a sort of timelapse! 😆

Hopefully the machine it's running on is something underpowered

Its a VM with an SSD and enough free capacity in space and CPU. My reporting didn't say anything about any spikes and when looking at the time for 5 second report intervals, it says at around 30-40 ms constantly. Might keep this in mind but my first prio is to report the error to upstream.

I had already looked at the PR before but couldn't find any documentation on the reverse engineering of the protocol, so maybe they know more.