tobyweston / temperature-machine

Data logger for multiple DS18B20 temperature sensors on one or more machines
Apache License 2.0
67 stars 22 forks source link

Avoid spikes in temperature data #9

Closed tobyweston closed 6 years ago

tobyweston commented 7 years ago

I'm not sure why it started happening but I'm seeing extreme spikes in temperature data.

screen shot 2017-04-12 at 17 02 30

Potentially ignore volitile readings or check the CRC when reading (as suggested in the Arduino forum).

tobyweston commented 7 years ago

CRC is already checked when parsing the file. A failed CRC check should result in a -\/ which will end up as an error in the log.

class ParserTest extends Specification {
  "Fails to extract temperature with failed CRC check" >> {
    val output =
      """|72 01 4b 46 7f ff 0e 10 57 : crc=57 NO
         |72 01 4b 46 7f ff 0e 10 57 t=23125
      """.stripMargin
    Parser.parse(output) must be_-\/.like {
      case error: CrcFailure => ok
    }
  }
}
case class RecordTemperature(host: Host, input: TemperatureReader, output: TemperatureWriter, error: PrintStream = System.err) extends Runnable {
  def run(): Unit = {
    input.read.fold(error.println, temperatures => {
      output.write(Measurement(host, now(), temperatures)).leftMap(error.println)
    })
  }
}
case class CrcFailure() extends Error("CRC failure, this could be caused by a physical interruption of signal due to shorts, a newly arriving 1-Wire device issuing a 'presence pulse' or gremlins.")
tobyweston commented 7 years ago

Not quite as high as the above but the datasheet says 85 is a starting temp. Probably worth filtering out 85 degrees.

screen shot 2017-06-12 at 18 41 53

tobyweston commented 6 years ago

If enabled, ignore temperatures with a +/-25% fluctuation. Enable and configure by adding -Davoid.spikes=30 to your server startup command.

Quaxo76 commented 6 years ago

I'm adding, as requested, an image of a spike with the corresponding csv file. The spike was at 23:09, but due to the time code difference (I suppose) on the csv file it shows at 22:09. By the way, how can I fix this time code difference? temperature2

temperatures.csv.zip

Quaxo76 commented 6 years ago

One more, and this is weirder.

I have two machines, a server named PiZero, and a client named PiOld. Until now, only PiOld had been subject to the spikes, so I installed the "temporary fix" as described (both are running the latest version). The server had never shown any spike, so I did not do the fix there.

Now I got this. A spike on BOTH machines at exactly the same time. But on the server (which doesn't have the fix) the reading went down to zero. On the client, ad EXACTLY the same frame, it went down only about 15% (so it's not wrong that the fix didn't catch it).

Now I'll apply the fix to the server too, but how can it be that 4 different sensors all had the problem on two separate machine at exactly the same moment? Must be something server-side... I'm attaching the csv and a screenshot.

temperature3

temperatures.csv.zip

Cristian

tobyweston commented 6 years ago

Sorry, I should have made it clearer: the (possible) fix only runs on the server.

The server stores the values and so the fix works by looking at previous values that are received to work out the % difference, discarding if it's too big. I didn't consider not sending, if the same is true on the client.

Not sure if it's meaningful, but I see similar spikes across a spread of machines at about the same time. I have like 5 running and anywhere between 1 and all 5 may spike at the same time. That's why I kind of started thinking about environmental problems (like maybe my microwave freaking the wifi signal out).

Quaxo76 commented 6 years ago

Ah ok, I thought the value would just be discarded without being sent. I'll install the fix on the server, though of course this won't catch minor spikes. Since the spike also happened on the server, how can wifi disruptions influence the reading? It should be local... And anyway, with this morning's spike, no one was using a microwave or other powerful appliances here, and I also live in a pretty isolated area.

tobyweston commented 6 years ago

You can always lower the threshold to say 15% to try and catch the minor spikes. You might need to experiment a little.

Good points. I'm struggling to debug it atm so reaching a little :wink:

tobyweston commented 6 years ago

I'm assuming that in your most recent example, there's nothing in the log at around 19/01/2018 07:56 ?

Quaxo76 commented 6 years ago

Just thinking... To debug it, maybe you could set a low threshold (like 5%) and when a spike occurs, the machine could also log somewhere the actual scratchpad from the sensor, and any intermediate processing that was done on that data? I don't believe the sensors actually read a spike, so something must happen along the path that the raw data follows, so logging all intermediate points might show where the problem is...

tobyweston commented 6 years ago

:+1:

Quaxo76 commented 6 years ago

Nothing on the log around that time. I have many errors at other times, mainly "CRC error" and "error in RequestLoop()"... But I've been trying several times to recompile this morning, so that may have slowed down the systems to the point of unresponsiveness...

By the way, I just had another spike. Do you want me to keep sending screenshots and CSVs?

tobyweston commented 6 years ago

Curious about the CRC error, can you share?

The RequestLoop error is some oddity with the underlying HTTP library I use. I've raised a request with the library and am experimenting with an upgrade. Unfortunately, that's taking time because its a big, bumpy upgrade path to get to their latest library.

Maybe hold on the CSVs for a bit.

Quaxo76 commented 6 years ago

Here's the current log from the server. There have been spikes at 7:55, 10:03 and 11:41 (though the recorded times might be off by an hour due to the time zone difference). ServerLogs.txt

Quaxo76 commented 6 years ago

Not sure if this is meaningful, but spikes here tend to happen when I'm doing something else with the pi (mainly compiling). I've left it alone for about 10 hours, and had no spike whatsoever; after coming home I updated the source and recompiled, and got like 10 spikes in an hour...

Cristian

Quaxo76 commented 6 years ago

OK, with the help of the new document, I installed the "temporary fix". Installation was successful, as indicated by the log:

Sun 21-Jan-201811:35:26.080[main]INFOStarting temperature-machine (server mode)...
Sun 21-Jan-201811:35:26.270[main]INFORRD initialising for 'PiZero', 'PiOld' (with up to 5 sensors each)...
Sun 21-Jan-201811:35:27.404[main]INFOcreate "/home/pi/.temperature/temperature.rrd" --version 2 --start 1516530926 --step 30 DS:PiZero-sensor-1:GAUGE:35:U:U DS:PiZero-sensor-2:GAUGE:35:U:U DS:PiZero-sensor-3:GAUGE:35:U:U DS:PiZero-sensor-4:GAUGE:35:U:U DS:PiZero-sensor-5:GAUGE:35:U:U DS:PiOld-sensor-1:GAUGE:35:U:U DS:PiOld-sensor-2:GAUGE:35:U:U DS:PiOld-sensor-3:GAUGE:35:U:U DS:PiOld-sensor-4:GAUGE:35:U:U DS:PiOld-sensor-5:GAUGE:35:U:U RRA:AVERAGE:0.5:1:2880 RRA:AVERAGE:0.5:120:168 RRA:AVERAGE:0.5:240:360
Sun 21-Jan-201811:35:32.664[main]INFOStarting Discovery Server, listening for 'PiZero', 'PiOld'...
Sun 21-Jan-201811:35:32.838[temperature-machine-discovery-server-1]INFOListening for broadcast messages...
Sun 21-Jan-201811:35:33.214[main]INFOMonitoring sensor file(s) on 'PiZero' 
    /sys/bus/w1/devices/28-041500a0f0ff/w1_slave
    /sys/bus/w1/devices/28-031500e9feff/w1_slave

Sun 21-Jan-201811:35:35.373[main]INFOTemperature spikes greater than +/-15% will not be recorded
Sun 21-Jan-201811:35:48.482[main]INFOHTTP Server started on http://127.0.1.1:11900

But I still see spikes. Even major ones, up to above 400 degrees. Attaching screenshot, csv and log in case it helps. This one is unusual in that only one sensor got it wrong, the readings from the other 3 seem ok.

temperature4

ServerLogs_2.txt

temperatures (1).csv.zip

Quaxo76 commented 6 years ago

I just realized that every spike that I've ever seen, only ever affects one single frame, i.e. one reading is wrong, and is surrounded on both sides by good values. So, since the "temporary fix" does not appear to be working, at least for me, have you ever thought of checking the readings, and discard any reading that is different from the surrounding ones, if the ones before and after show the same value (or almost the same value)?

Quaxo76 commented 6 years ago

I was experimenting with the fix (since I still get spikes occasionally) and I thought of something. If a change is classified as a spike when it's more than X% change, is that a change in degrees Celsius or Kelvin? If it's Celsius, what happens when the temperature is around 0C? I.e. if the temp goes from 0.1 to 0.2 C, it's obviously not a spike, but it would be read as a 100% increase, and then tagged as a spike, no? And by the way, without the fix installed, I get very large spikes (up to hundreds of degrees); withe the fix, I still get spikes but of a lower value (but still over the set threshold, i.e. if the temp is 18C I get a spike to 40C...)

tobyweston commented 6 years ago

Thanks for the thoughts.

It's very difficult to gauge cause and effect here. I'm planning to tidy a few other things up just in case they are causing problems (like making sure no exceptions could be thrown causing weird behaviour), then I think I'm going to log every measurement and cross reference that to what's in the RRD database. That way I can see if the cause is with RRD. I'm also thinking of changing the way its logged to RRD as there's a slim chance more clients could increase the odds of spikes... not sure yet.

To answer your question though, everything is in centigrade and the comparison is between the last value and current. You can see it in the CSV and the % difference.

tobyweston commented 6 years ago

Closing for now as recent work seems to have fixed this... at least, I haven't seen it for a while. If I see it again or it's reported, I'll reopen.