rszimm / sprinklers_pi

Sprinkling System Control Program for the Raspberry Pi
GNU General Public License v2.0
310 stars 100 forks source link

Graceful failure uncertain #19

Closed sweeney4 closed 10 years ago

sweeney4 commented 10 years ago

I have been using sprinklers_pi for a few weeks now. It's great. However, on April 18 at about 5:30am for some unknown reason the spinklers_pi program on the Raspberry Pi hung while the sprinklers were on. I can estimate the time because that's the time of the last log entry. The sprinkler station remained on for two hours before I luckily noticed the problem. Yikes... that could be expensive. I could still log into the Pi via Wifi SSH (that's where I viewed the log using sqlite3 and entered the reboot command) but the sprinklers_pi web interface was down until after the reboot (Apache remained up). Everything worked normally after the "sudo reboot".

One idea is to modify the sprinklers_pi code to toggle an unused Pi pin every minute (or so) and an autonomous logic circuit looks for that "heart-beat" signal. If the heart-beat stops the sprinkler power is cut.

BTW, if you use the manual control to turn on a station and then somehow lose communication with the Pi server the sprinkler station will remain on indefinitely. EVEN THOUGH the web page toggle shows the station is off (I think that's because jquery on the client controls the toggle appearance.)

rszimm commented 10 years ago

Hmmm, that's really bad. Is it possible for you to send me the log file at /var/log/sprinklers_pi and let me know where in the log the freeze happened?

I don't suppose you can recreate the issue? Would you happen to know if the process hung or just terminated?

So there's a couple of possibilities here. There are service control processes that I could run that would monitor the sprinklers_pi service and keep it running should it terminate unexpectedly. There is also a hardware watchdog on the raspberrypi board that requires a ping every 16 seconds or else it reboots the board. That's also a possibility (albeit it ends up being very hardware specific). I'll give it some thought, but this is a high priority right now,.

sweeney4 commented 10 years ago

Attached is the requested log file. I also posted a response on github.

Following is from attached log starting at about line 3486:

2014/04/18 05:00:00 Adjusting H(-28)T(-20)R(0):52

2014/04/18 05:00:00 Turning on Zone 1 ß- correct start of schedule

2014/04/18 05:16:00 Turning on Zone 2

2014/04/18 05:26:00 Turning on Zone 3 ß zone did not turn off until I rebooted at 7:45. Expected run time about 10 minutes.

2014/04/18 05:26:20 Got a client

2014/04/18 05:26:21 ERROR!

2014/04/18 07:45:47 Starting v1.0.7.. ß- my reboot

2014/04/18 07:45:47 Turning Off All Zones

2014/04/18 07:45:47 Listening on Port 8080

2014/04/18 07:45:47 Turning Off All Zones

2014/04/18 11:49:51 Got a client

From: rszimm [mailto:notifications@github.com] Sent: Saturday, April 19, 2014 7:29 AM To: rszimm/sprinklers_pi Cc: Donald Sweeney Subject: Re: [sprinklers_pi] Graceful failure uncertain (#19)

Hmmm, that's really bad. Is it possible for you to send me the log file at /var/log/sprinklers_pi and let me know where in the log the freeze happened?

I don't suppose you can recreate the issue? Would you happen to know if the process hung or just terminated?

So there's a couple of possibilities here. There are service control processes that I could run that would monitor the sprinklers_pi service and keep it running should it terminate unexpectedly. There is also a hardware watchdog on the raspberrypi board that requires a ping every 16 seconds or else it reboots the board. That's also a possibility (albeit it ends up being very hardware specific). I'll give it some thought, but this is a high priority right now,.

— Reply to this email directly or view it on GitHub https://github.com/rszimm/sprinklers_pi/issues/19#issuecomment-40870869 .

sweeney4 commented 10 years ago

OK, I've sent the log separately. I can spend time at this end to determine if it is something I've done causing the problem. Send me any suggestions; also, see my thoughts on this below. Even if it's my own fault, I need backup to be sure the water cannot remain running indefinitely.

I don't know if the process hung of just terminated but it left a station running until I rebooted. After reboot I confirmed the all stations successfully toggled on and off manually using sprinklers_pi..

Three things that I've done (or let happen) that may be at fault:

  1. While I'm using a 16GB SD card I let the root partition / nearly fill up. I created a third partition where I try to put new files, etc. but I haven't made more room on /. This is not good but hasn't caused a problem elsewhere. I'll fix this but I don't want to make multiple changes until we isolate this issue.
  2. I added a commercial water meter with a reed switch to measure outdoor water flow. The switch contacts once for each gallon of water use. I' m using Pin 25 of the RPi with the internal RPi pull-up resistor inserted. I isolated the RPi from the meter reed switch circuit with an optical isolator. A falling edge on Pin 25 causes an interrupt for a background Python code. Results are written to an SqLite3 database. The code and database are independent of the sprinklers_pi program.
  3. For completeness, that morning of the problem I discovered my office windows computer rebooted itself. There was an automatic install of a Microsoft update for IE-11 at 3:00AM. I may have left my office Chrome browser with a connection into the sprinklers_pi server.
sweeney4 commented 10 years ago

I strongly expect I caused the unfortunate result of the sprinklers remaining on for several hours. I was sloppy in letting the SD card root partition essentially fill-up. This likely caused unexpected behavior. Some things worked and some didn't. Sprinklers_pi worked fine for several days and then one morning left a station on for several hours. After that, I could start sprinklers_pi and a few hours later the process would just stop.

To address this, I started with a new 16GB card and flashed Debian and expanded the root directory properly (now at 17% used). Things worked great this morning. (I've retained the former card in its final state if someone wants to try to reproduce the error.)

None the less, I'd feel more comfortable if the sprinkler system had software and/or hardware to assure a graceful failure - even if I cause the failure.