tobyweston / temperature-machine

Data logger for multiple DS18B20 temperature sensors on one or more machines
Apache License 2.0
67 stars 22 forks source link

Client progressively dropping #58

Closed Quaxo76 closed 6 years ago

Quaxo76 commented 6 years ago

I don't know if this is a real issue or something related to the Pi environment. It has happened to me twice. Basically, while the server sensors keep working fine, the client ones start missing more and more readings, see attached graph. One reboot didn't fix it (I rebooted where there's a vertical line), while another reboot fixed it (both the server and client were rebooted). I should mention that while the problem was present, I didn't even get the numeric temperature reading. But the wifi connection to the client was good: ssh sessions were rock solid, and "wavemon" showed a perfectly usable signal level... And reading the sensor data via command line through ssh worked fine.

temperature-1-days_prob

tobyweston commented 6 years ago

I see a lot on my setup and had assumed it was flakey network connections from the Pi (I mostly have wireless adapters), so it's interesting to hear your connections seem solid. I always see a numeric connection though. Did you see an error on the page (for the temperatures?).

Did you see a lot of errors in the log like this:

ERROR UnexpectedError(Failed to PUT temperature data to http://10.0.1.10:11900/temperature, response was 502 Bad Gateway: Right(Error in RRD Bad sample time: 1518039320. Last update time was 1518039320, at least one second step is required))

It happens when more than one sample comes in (although in theory, this shouldn't really happen) and RRD's resolution is a minimum of 1 second and thinks they clash. I'm going to change the whole model to prevent this at some point (move to a more staged event driven model).

Short answer is I don't know why it does this. If you do a reboot, I'd reboot them all (or experiment with just the server first) and always have a look in the logs.

Quaxo76 commented 6 years ago

Yes, me too, initially thought it was the wifi. Until I realized that I was still doing all my maintenance (and sensor sanity checks) via wifi... :) I didn't see an error on the temperatures page, just only the server's temps were listed. The little program "wavemon" is great for checking, live, the wifi status. I remember seeing the error you mention, in the past. But sadly, I forgot to save the logs before rebooting, and after rebooting, they were wiped. I remember that the server's log didn't have many errors logged, and not at the times where the dropout is. And I totally forgot to check the client's logs...

Quaxo76 commented 6 years ago

I have an update... I just had the issue reappear again, and I noticed that every single time that it happened, was when the temperature of a sensor of the client had gone below 1 degree. Is it possible that the system can't cope with temperatures starting with zero (or negative)?

Quaxo76 commented 6 years ago

Is it possible that I'm experiencing what I feared could happen, here https://github.com/tobyweston/temperature-machine/issues/9#issuecomment-363933288 ? If the temp reading outside is, say, 0.5 degrees, and on the next reading it's 0.4, that's 20% less, and it would trigger the spike-rejecting code (I have it set to 10%). Could it be this? The client keeps being "healthy", and the Wi-Fi connection is solid. If this is the issue, could the spike fix be changed so it reacts not to a percentage difference, but to an absolute difference of X degrees?

Then again: when the spike fix checks a temperature, I suppose it checks it against the last known good one? If so, it's obvious that once this issue is triggered, and the outside temp starts going up again, then every single reading will be discarded, because the reading will be to high... Does this make sense?

temperature-1-days

tobyweston commented 6 years ago

It could be. There may be a bug when it reaches -ve numbers but I suspect (hope) that's not the case.

You could test with some iced water though. When the spike blocking code kicks in, you get an entry in the log saying as much. So, if it were to kick in on a -ve temperature, you'd see something matching in the log. You should also see it recover when it comes back. At least in theory.

You can also try disabling the avoid.spikes feature.

I like the idea of abs degrees though rather than %. I prob should have done that in the first place. What's a sensible number to default, 5 degrees?

Quaxo76 commented 6 years ago

Hi Toby, I don't know if you saw the edit to my comment above. I don't think it could recover by itself... Imagine this scenario: the temp goes down, reaches a valid reading of 0.1, then the temp starts rising again. Say the next reading is 0.3, which is discarded because the jump is too high. Then the temp keeps increasing, but every new value is always discarded because they're all much higher than 0.1. So, either the system changes to abs temperatures (yes, 5 degrees could be a sensible threshold, but I'd like it to be configurable) or there should be a sort of "watchdog" system: I.e. If more than 10 readings are discarded in a row, it can't be 10 spikes, so the reference value is updated with the current reading... Actually, the best of both worlds world be to switch to configurable abs temperature differences AND add a watchdog... :)

tobyweston commented 6 years ago

Hmmm, just wrote a test and I think you're right.

Assuming you have 20% as you're limit, #9 will kick in:

  "Negative spikes values (single sensor), example from production" >> {
    val delegate = new StubWriter
    val writer = new ErrorOnTemperatureSpike(delegate, percentageSpike = 20)
    writer.write(Measurement(Host("example", None), Seconds(1), List(SensorReading("A", Temperature(0.5)))))
    writer.write(Measurement(Host("example", None), Seconds(2), List(SensorReading("A", Temperature(0.5)))))
    writer.write(Measurement(Host("example", None), Seconds(3), List(SensorReading("A", Temperature(0.4))))) must be_-\/.like {
      case e @ _ => println(e.message); ok
    }
    writer.write(Measurement(Host("example", None), Seconds(4), List(SensorReading("A", Temperature(0.4)))))
    delegate.temperatures must_== List(
      Measurement(Host("example", None), Seconds(1), List(SensorReading("A", Temperature(0.5)))),
      Measurement(Host("example", None), Seconds(2), List(SensorReading("A", Temperature(0.5))))
    )
  }

In the log, you should see something like this

An unexpected spike was encountered on:
 sensor(s)             : A
 previous temperatures : 0.5 °C
 spiked temperatures   : 0.4 °C (-20.0)
tobyweston commented 6 years ago

Yep, it's the last known good value that's used to check for % differences.

If you can decipher it, examples are in the following test.

https://github.com/tobyweston/temperature-machine/blob/master/src/test/scala/bad/robot/temperature/ErrorOnTemperatureSpikeTest.scala

My thinking was that if a % is breached, when the value comes within the normal range again, it'll carry on; when temperatures get back to normal, it should carry on. I'll have a look at switching to abs numbers and track against #9

tobyweston commented 6 years ago

I've done some commits for #9 to switch to absolute values rather than %. I haven't deployed it to my setup yet but if you want to test the bleeding edge, go ahead! I'll close this ticket pending your feedback.

Quaxo76 commented 6 years ago

Thank you Toby... Compiling now. I'll let you know how it behaves.

Cristian

Quaxo76 commented 6 years ago

Toby, I've let it run the whole night. There are a few issues. First, at every reboot, I get a huge spike (hundreds of degrees) so the graphs are unreadable. The spike removal fix is installed, as reported by the log, and set to 10 degrees. On the charts, I can only see the temperatures from the server and not the client. Even on the live graph, only the server's sensors are listed. But the numeric readings are present and correct for both server and client sensors.

The server's log:

Thu 15-Feb-201802:27:18.142[main]INFOStarting temperature-machine (server mode)...
Thu 15-Feb-201802:27:18.307[main]INFORRD initialising for 'PiZero', 'PiOld' (with up to 5 sensors each)...
Thu 15-Feb-201802:27:18.383[main]INFOOk
Thu 15-Feb-201802:27:18.406[main]INFOStarting Discovery Server, listening for 'PiZero', 'PiOld'...
Thu 15-Feb-201802:27:18.544[temperature-machine-discovery-server-1]INFOListening for broadcast messages...
Thu 15-Feb-201802:27:21.225[temperature-machine-discovery-server-1]INFOListening for broadcast messages...
Thu 15-Feb-201802:27:21.232[temperature-machine-discovery-server-1]INFOListening for broadcast messages...
Thu 15-Feb-201802:27:42.204[main]INFOMonitoring sensor file(s) on 'PiZero' 
    /sys/bus/w1/devices/28-041500a0f0ff/w1_slave
    /sys/bus/w1/devices/28-031500e9feff/w1_slave

Thu 15-Feb-201802:27:44.015[main]INFOTemperature spikes greater than +/-10 °C will not be recorded
Thu 15-Feb-201802:27:50.302[main]INFOHTTP Server started on http://127.0.1.1:11900
Thu 15-Feb-201804:07:14.384[temperature-reading-thread-1]ERRORUnexpectedError(Failed to PUT temperature data to http://127.0.1.1:11900/temperature, response was 502 Bad Gateway: Right(Error in RRD Bad sample time: 1518664034. Last update time was 1518664034, at least one second step is required))
Thu 15-Feb-201804:07:14.384[temperature-reading-thread-1]ERRORUnexpectedError(Failed to PUT temperature data to http://127.0.1.1:11900/temperature, response was 502 Bad Gateway: Right(Error in RRD Bad sample time: 1518664034. Last update time was 1518664034, at least one second step is required))

The client's log:

Thu 15-Feb-201802:27:31.251[main]INFOInitialising client 'PiOld' (with 2 of a maximum of 5 sensors)...
Thu 15-Feb-201802:27:32.254[main]INFOSending ping request to /255.255.255.255
Thu 15-Feb-201802:27:32.264[main]INFOSending ping request to /192.168.0.255
Thu 15-Feb-201802:27:32.268[main]INFOAwaiting discovery server...
Thu 15-Feb-201802:27:32.867[main]INFOServer discovered on 192.168.0.22, monitoring temperatures...
Thu 15-Feb-201802:27:48.699[main]INFOMonitoring sensor file(s) on 'PiOld' 
    /sys/bus/w1/devices/28-051690939aff/w1_slave
    /sys/bus/w1/devices/28-0315009723ff/w1_slave

Thu 15-Feb-201802:27:52.498[main]INFOHTTP Server started to serve logs on http://127.0.1.1:11900
Thu 15-Feb-201802:28:15.654[temperature-reading-thread-1]ERRORUnexpectedError(Failed attempting to connect to /192.168.0.22 to send Measurement(Host(PiOld,Some(+01:00)),1518658078 seconds,List(SensorReading(28-051690939aff,Temperature(1.5625)), SensorReading(28-0315009723ff,Temperature(25.1875))))

Error was: java.util.concurrent.TimeoutException: Client response header timeout after 8609 milliseconds.)
Thu 15-Feb-201802:28:15.654[temperature-reading-thread-1]ERRORUnexpectedError(Failed attempting to connect to /192.168.0.22 to send Measurement(Host(PiOld,Some(+01:00)),1518658078 seconds,List(SensorReading(28-051690939aff,Temperature(1.5625)), SensorReading(28-0315009723ff,Temperature(25.1875))))

Error was: java.util.concurrent.TimeoutException: Client response header timeout after 8609 milliseconds.)
Quaxo76 commented 6 years ago

It's weird, something is plotted from 4 sensors, but only two are shown in the list on the bottom: screenshot_20180215-103322

screenshot_20180215-103527

Quaxo76 commented 6 years ago

Toby, something very strange is happening with the latest build. I have a server and a client, with two sensors each. Basically, as shown on the screenshots above, only two sensors are plotted and listed at the bottom; they're listed as "PiZero" sensors, which is the server. But - here's the weird thing - their plotted values are NOT their actual value. What is plotted, is actually an average between one of the server's sensors, and one of the client's sensors. So, one of the two graph lines shows "(Client_A + Server_A)/2" and the other shows "(Client_B + Server_B)/2". The lines are also much more "noisy" and with many spikes, which are not caught by the fix code.

The numeric readings of the sensors are fine (I can see all four of the sensors, and the values are correct). I'm wondering: does the last fix break the RRD compatibility? Should I have erased it? Starman See the inconsistencies on the top part of the screenshot attached. Also see that on the graphs on the bottom, there's a "discontinuity" where I updated the software; and even the old readings from the other two sensors, have disappeared from the weekly and monthly charts... temperature5

tobyweston commented 6 years ago

Bit of an odd one but I think adding timezone to the Host class broke the RRD writing. Will post a fix asap.

tobyweston commented 6 years ago

This should be fixed now. If you can confirm, I'll close the issue.

Quaxo76 commented 6 years ago

Ok, the graph is back with all four sensors now. But the spikes are back (even though the fix is activated, according to the log). There is no mention of removed spikes in the log, and the temperature change during the spike is greater than the set threshold. Concerning the original issue, I couldn't test it yet, because the temperatures haven't gone so low yet, and the sensor itself is semi-permanently mounted in a non-easily-reachable place (bad idea, I know, but that's the best I could do). If the outside temperature stays high, I'll try experimenting with ice water with one of the indoor sensors... :)

Quaxo76 commented 6 years ago

The outside temperature just got (much) lower than zero again, and it appears that - as expected - there are no more dropouts. So, while the spike fix doesn't actually catch (all the) spikes, at least it doesn't put a whole client offline near 0C anymore... Thank you!

tobyweston commented 6 years ago

K, gonna close for now 👍

NB. The avoid spikes fix has been temp disabled (but we didn't think it worked well anyway).