kizniche / Mycodo

An environmental monitoring and regulation system
http://kylegabriel.com/projects/
GNU General Public License v3.0
2.99k stars 499 forks source link

Mycodo 4.1.16 daemon crashes after a few days and does not automatically restart. #198

Closed lucasvickers closed 7 years ago

lucasvickers commented 7 years ago

Mycodo Issue Report:

Problem Description

Daemon shuts down / crashes without log entries.

Errors

Daemon Log: Below is all I could find. The rest of the log is repeating mysql queries that I assume are the result of a persistent web-client / graphing.

Feb 27 20:45:55 nibbler systemd[1]: mycodo.service: main process exited, code=killed, status=6/ABRT
Feb 27 20:45:55 nibbler mycodo_client.py[20340]: 2017-02-27 20:45:55,935 Connection refused. Is the daemon running?

Steps to Reproduce the issue:

How can this issue be reproduced?

  1. Run setup for multiple days.

Additional Notes

I've been able to do this twice now, so I believe it's consistent. Please let me know what other types of log entries I should look for or if I can enable some increased logging, etc.

lucasvickers commented 7 years ago

Also note that the /var/lock/mycodo.pid file gets left and I need to remove it before restarting daemon.

kizniche commented 7 years ago

Run the daemon manually in debug mode and see what error it spits back at you.

sudo python ~/Mycodo/mycodo/mycodo_daemon.py -d

lucasvickers commented 7 years ago

Thanks will do. I think it runs for a solid 4-5 days before crashing so I'll update this ticket when I can.

kizniche commented 7 years ago

Sounds good. Periodically check the size of the log, making sure it doesn't use all the free space on your SD card. Debug mode produces a lot of log lines!

kizniche commented 7 years ago

Also, there wasn't an error in /var/log/mycodo/mycodo.log when not running in debug mode?

lucasvickers commented 7 years ago

32GB card so should be good, but thanks for the tip.

I couldn't find any specific error, tho the log is busy. I also checked the system log and saw the same info. Are your logging levels standardized and I could just grep the log for "ERROR", etc, to figure out if one was thrown way earlier on?

kizniche commented 7 years ago

Yes, it should be entered as error (but not always). Can you attach the log here for me to look at. There shouldn't be any sensitive info logged.

lucasvickers commented 7 years ago

I'm running in debug now.

Logs here. The restart on 2017-02-26 10:34:02 was on purpose and errors before that were due to experimenting with sensors.

lucasvickers commented 7 years ago

The below occurred when the last measurement was taken. Any insight? I've also attached both logs. mycodo-2.zip

Mar  4 04:28:16 nibbler systemd[1]: Stopping User Manager for UID 1001...
Mar  4 04:28:16 nibbler systemd[22374]: Stopping Default.
Mar  4 04:28:16 nibbler systemd[22374]: Stopped target Default.
Mar  4 04:28:16 nibbler systemd[22374]: Stopping Basic System.
Mar  4 04:28:16 nibbler systemd[22374]: Stopped target Basic System.
Mar  4 04:28:16 nibbler systemd[22374]: Stopping Paths.
Mar  4 04:28:16 nibbler systemd[22374]: Stopped target Paths.
Mar  4 04:28:16 nibbler systemd[22374]: Stopping Timers.
Mar  4 04:28:16 nibbler systemd[22374]: Stopped target Timers.
Mar  4 04:28:16 nibbler systemd[22374]: Stopping Sockets.
Mar  4 04:28:16 nibbler systemd[22374]: Stopped target Sockets.
Mar  4 04:28:16 nibbler systemd[22374]: Starting Shutdown.
Mar  4 04:28:16 nibbler systemd[22374]: Reached target Shutdown.
Mar  4 04:28:16 nibbler systemd[22374]: Starting Exit the Session...
Mar  4 04:28:16 nibbler systemd[22374]: Received SIGRTMIN+24 from PID 13300 (kill).
Mar  4 04:28:16 nibbler systemd[1]: Stopped User Manager for UID 1001.
Mar  4 04:28:16 nibbler systemd[1]: Stopping user-1001.slice.
Mar  4 04:28:16 nibbler systemd[1]: Removed slice user-1001.slice.
kizniche commented 7 years ago

Both logs you sent stop March 1st, but you said your error occurred March 4th. Were there no log lines at the time of the issue?

lucasvickers commented 7 years ago

I just double checked and both logs in the mycodo-2.zip zip file end March 4.

Can you double check?

kizniche commented 7 years ago

Looking over the log, it doesn't indicate an issue. Can you enable components one by one to determine what part is causing the issue, either sensor measurements, LCD output, or other parts of the system you have active?

lucasvickers commented 7 years ago

Sure. I currently have the following:

I'll start one by one.

kizniche commented 7 years ago

Graphs shouldn't be an issue, but the others could.

lucasvickers commented 7 years ago

If you can think of any places additional logging would help, nows the time :) I'll add more elements after 5 days of successful running, but troubleshooting may wind up taking a while.

You should think about a tool to monitor your daemon similar to how you monitor sensors. Could be useful to notify people that it crashed, etc.

kizniche commented 7 years ago

Without knowing exactly what controller(s) is causing the issue, it's difficult to know where to add more logging lines. The daemon-status check is a good idea. I'll put it on the todo list for the 5.0 release.

lucasvickers commented 7 years ago

In case it's helpful at all, whenever I write similar complex systems I always add a verbose level of logging. At the very least I log when I begin a major operation (reading a specific sensor, outputting to an LCD screen, etc). It creates a lot of log but it gives a definitive idea of where an issue occurred.

The type of black box troubleshooting we're doing right now doesn't scale very well.

Either way, I'm happy to do this debugging, just providing some input.

kizniche commented 7 years ago

That should be what happens in debug mode (sudo python ~/Mycodo/mycodo/mycodo_daemon.py -d). That sets the logging level to debug, for which there are a lot of logging instances of. It might just happen that you came across a bug that is in an area of the code that doesn't have good exception and/or logging coverage, or something else.

kizniche commented 7 years ago

I think this is the first or one of the very few times (I can't recall any other) that the daemon just stops without anything appearing in the log. You're the only person that I know of that's experiencing this issue. So, it could possibly be your install or hardware.

kizniche commented 7 years ago

Just curious, before concluding the daemon crashed, did you confirm with ps aux | grep mycodo_daemon.py that the daemon did indeed stop and wasn't just unresponsive?

lucasvickers commented 7 years ago

I sure hope it's not the hardware, that would be challenging to debug. The next time I get it to hang I'll check ps.

Just curious, what were these messages about:

Mar  4 04:28:16 nibbler systemd[22374]: Reached target Shutdown.
Mar  4 04:28:16 nibbler systemd[22374]: Starting Exit the Session...
Mar  4 04:28:16 nibbler systemd[22374]: Received SIGRTMIN+24 from PID 13300 (kill).

I think it's pretty suspicious that they coincided with the last LCD update. I was using the LCD screen with time output to see if it was responsive or not, and when the issue occurred the LCD was frozen at roughly 4:28.

Thanks for helping me work through all this.

kizniche commented 7 years ago

I'm not sure what those messages mean.

I just looked over the LCD controller and it's fairly covered with exception-level logging lines, so it should be caught if that's the issue. It's a mystery.

lucasvickers commented 7 years ago

http://serverfault.com/questions/774491/what-is-sigrtmin24-in-syslog

Weird. Perhaps this is some type of system issue. I'll keep my eyes on it and see what I can find.

lucasvickers commented 7 years ago

I was able to confirm that the daemon does crash. It does not hang. I will continue with removing individual components of my system until it runs stable.

I reviewed the logs and your code a little more and I want to bring back up the option of additional logging. You did a great job in defensive programming and finding ways to capture exceptions, but clearly there is something going on that we can't find.

Even in debug mode, all I see are the following 4 lines repeated

2017-03-10 06:14:44,967 - mycodo.lcd_jK2jnVDR - DEBUG - Latest temperature: 17.5 @ 2017-03-10 06:14:42
2017-03-10 06:14:44,998 - mycodo.lcd_jK2jnVDR - DEBUG - Latest humidity: 10.8 @ 2017-03-10 06:14:42
2017-03-10 06:14:45,026 - mycodo.lcd_jK2jnVDR - DEBUG - Latest co2: 413 @ 2017-03-10 06:14:34
2017-03-10 06:14:45,067 - mycodo.lcd_jK2jnVDR - DEBUG - Latest time: 17.5 @ 2017-03-10 06:14:42

This doesn't give me any insight into what else is happening. Presumably there is a function that will handle grabbing sensor values, and a function that will kick off camera captures, etc, correct?

I would recommend adding logging such as the following:

def update_lcd():
  Logger.verbose("Starting LCD Update.")
  ... your logic
  Logger.verbose("Finished LCD Update.")

or if inside a loop something line this

def run():
  try:
    logging.verbose("Starting LCD Loop.")
    self.running = True
    ...
    while self.running:
    if time.time() > self.timer:
     logging.verbose("Starting LCD Logic.")
     self.get_lcd_strings()
     ...
     logging.verbose("LCD Sleeping.")
     time.sleep(1)
     ...
  logging.verbose("LCD Loop has Ended.")

While this will make a large amount of logs, it would let us actually know what was going on at the time of a crash. Whatever the issue may be, be it a hardware problem, an uncaught exception, etc, we would quickly be able to tell the last function that was called before things went wrong. This would really help narrow down troubleshooting.

The reason I suggest it's done in a VERBOSE mode is that you don't need this in most situations, so allowing a DEBUG and VERBOSE log helps resolve issues like the one I am having without flooding logs for the users who only need DEBUG.

If you are interested in this I'm happy to help work through your code and add this type of logging.

kizniche commented 7 years ago

More logging would be nice. Thanks for the offer to help. A few questions and concerns:

How would one add VERBOSE as a logging level beyond DEBUG?

I'm getting ready to release version 5.0, so adding code to 4.1.16 may not be the best idea because it will not be maintained and merging with 5.0, when it's released, could become more difficult. Perhaps you should give the dev-5.0 branch a try and see if you still experience the same daemon crash issues.

lucasvickers commented 7 years ago

Ahh sorry, forgot VERBOSE is not enabled by default in Python. There are ways to add custom levels, I can help with that.

I'll set my test back up in dev-5.0 and see if I have the issue. We can take it from there, and if needed I'll help with the logging in 5.0.

Thanks!

kizniche commented 7 years ago

This seems like the proper way to add another logging level.

import logging
logging.DEBUG_LEVEL_NUM = 9
logging.addLevelName(logging.DEBUG_LEVEL_NUM, "DEBUGV")
def debugv(self, message, *args, **kws):
    if self.isEnabledFor(DEBUG_LEVELV_NUM):
        self._log(DEBUG_LEVELV_NUM, message, args, **kws) 
logging.Logger.debugv = debugv
lucasvickers commented 7 years ago

I'll keep 5.0-dev running for the week and see if it is stable.

kizniche commented 7 years ago

I just got 5,0 freshly installed. Thanks for the debugging.

lucasvickers commented 7 years ago

Sad to report the daemon went dead after 4-5 days uptime (same as usual).

I'm going to go back to disabling individual components and see what that yields, but I'd love to talk about getting that extra logging in place.

kizniche commented 7 years ago

That's unfortunate. What is your exact configuration? Can you upload your mycodo.db? You should delete the user table just to be safe (even though passwords are hashed), if posting here publicly.

As for logging, that sounds like a good approach, now that 5.0 has been released. I'll see about getting a verbose mode added in addition to the debug (-d -v parameters).

lucasvickers commented 7 years ago

mycodo.db.zip

It's a random generated password that I can change, so I don't really care. I currently have the LCD disabled to test that, but the LCD as active when the issues occur.

kizniche commented 7 years ago

Thanks. I'd bet it's the LCD. I've never tested the 4 line LCD (only the 2-line). Everything else I've had running on my systems for months straight without issue.

lucasvickers commented 7 years ago

Current test has no LCD, so we'll see. I won't know until next weekend. Strange how it's so consistently 4-5 days.

etiology commented 7 years ago

I had some thoughts on this that I wanted to put out there.

It seems to me that for the moment a quick patch would be to have a script running in a tight loop, maybe in a subprocess, that checks the lock file's process id to see if it's running. The moment it isn't the script cleans it up and fires it off again.

The next this is that I noticed #208 and this behavior does sound like a memory leak or resource not being closed. Maybe the debut log levels should include current system resources in their statements.

etiology commented 7 years ago

while true; do maintain_active_daemon.py; sleep 1; done

kizniche commented 7 years ago

Here's a quick script to check if the daemon is running:

#!/usr/bin/python
import os

DAEMON_PID_FILE = '/var/lock/mycodo.pid'

def daemon_running():
    if os.path.exists(DAEMON_PID_FILE):
        with open(DAEMON_PID_FILE, 'r') as pid_file:
            if os.path.exists("/proc/{pid}".format(pid=pid_file.read())):
                return "Mycodo Daemon is running"
    return "Mycodo Daemon is not running"

if __name__ == '__main__':
    print(daemon_running())
kizniche commented 7 years ago

Would the easiest action be to use subprocess to execute the following?

rm /var/lock/mycodo.pid && service mycodo restart

lucasvickers commented 7 years ago

The idea sounds right. The daemon works perfectly until it dies, so having the ability to reboot the daemon is a good feature.

As for your script, the only problem is that you would lose whatever arguments you passed the daemon initially (i.e. -d or -v)

kizniche commented 7 years ago

I'll make a cup of coffee and mull it over.

etiology commented 7 years ago

What about pulling the arguments from the environment? Well what if the daemon arguments were handed to to sentry process so that it would handle passing them to the script?

Also I like the subprocess module because from what I've read, letting the system manage the command allows the OS to decide which core the process runs on.

lucasvickers commented 7 years ago

So a sentry process that would just use python.subprocess to call up the actual daemon?

etiology commented 7 years ago

Yeah that's what I was thinking. The subprocess is just a command line call to a program that keeps the daemon running. Since it steers the daemon it could pass agruments to it when it starts it.

etiology commented 7 years ago

Maybe it would be worth thinking of this as a systemd service instead

kizniche commented 7 years ago

So this would be the general system setup?

lucasvickers commented 7 years ago

In theory sounds right, my only questions are:

kizniche commented 7 years ago

It does make it tricky with systemd involved. Perhaps the run parameters could be stored in a file that could be erased if the daemon is stopped manually, otherwise it will startup with those parameters if they exist. That would allow systemd to always control mycodo_daemon.py.

lucasvickers commented 7 years ago

Yeah that seems less intrusive

kizniche commented 7 years ago

That would allow maintain_active_daemon.py to only have to worry about executing service mycodo restart and not have to handle any parameters.

lucasvickers commented 7 years ago

Sad to report it crashed again w/o the LCD output enabled (device still hooked up). Removing the (2) DHT22 sensors now, will see what happens.