Closed lucasvickers closed 7 years ago
Also note that the /var/lock/mycodo.pid
file gets left and I need to remove it before restarting daemon.
Run the daemon manually in debug mode and see what error it spits back at you.
sudo python ~/Mycodo/mycodo/mycodo_daemon.py -d
Thanks will do. I think it runs for a solid 4-5 days before crashing so I'll update this ticket when I can.
Sounds good. Periodically check the size of the log, making sure it doesn't use all the free space on your SD card. Debug mode produces a lot of log lines!
Also, there wasn't an error in /var/log/mycodo/mycodo.log when not running in debug mode?
32GB card so should be good, but thanks for the tip.
I couldn't find any specific error, tho the log is busy. I also checked the system log and saw the same info. Are your logging levels standardized and I could just grep the log for "ERROR", etc, to figure out if one was thrown way earlier on?
Yes, it should be entered as error (but not always). Can you attach the log here for me to look at. There shouldn't be any sensitive info logged.
I'm running in debug now.
Logs here. The restart on 2017-02-26 10:34:02
was on purpose and errors before that were due to experimenting with sensors.
The below occurred when the last measurement was taken. Any insight? I've also attached both logs. mycodo-2.zip
Mar 4 04:28:16 nibbler systemd[1]: Stopping User Manager for UID 1001...
Mar 4 04:28:16 nibbler systemd[22374]: Stopping Default.
Mar 4 04:28:16 nibbler systemd[22374]: Stopped target Default.
Mar 4 04:28:16 nibbler systemd[22374]: Stopping Basic System.
Mar 4 04:28:16 nibbler systemd[22374]: Stopped target Basic System.
Mar 4 04:28:16 nibbler systemd[22374]: Stopping Paths.
Mar 4 04:28:16 nibbler systemd[22374]: Stopped target Paths.
Mar 4 04:28:16 nibbler systemd[22374]: Stopping Timers.
Mar 4 04:28:16 nibbler systemd[22374]: Stopped target Timers.
Mar 4 04:28:16 nibbler systemd[22374]: Stopping Sockets.
Mar 4 04:28:16 nibbler systemd[22374]: Stopped target Sockets.
Mar 4 04:28:16 nibbler systemd[22374]: Starting Shutdown.
Mar 4 04:28:16 nibbler systemd[22374]: Reached target Shutdown.
Mar 4 04:28:16 nibbler systemd[22374]: Starting Exit the Session...
Mar 4 04:28:16 nibbler systemd[22374]: Received SIGRTMIN+24 from PID 13300 (kill).
Mar 4 04:28:16 nibbler systemd[1]: Stopped User Manager for UID 1001.
Mar 4 04:28:16 nibbler systemd[1]: Stopping user-1001.slice.
Mar 4 04:28:16 nibbler systemd[1]: Removed slice user-1001.slice.
Both logs you sent stop March 1st, but you said your error occurred March 4th. Were there no log lines at the time of the issue?
I just double checked and both logs in the mycodo-2.zip zip file end March 4.
Can you double check?
Looking over the log, it doesn't indicate an issue. Can you enable components one by one to determine what part is causing the issue, either sensor measurements, LCD output, or other parts of the system you have active?
Sure. I currently have the following:
I'll start one by one.
Graphs shouldn't be an issue, but the others could.
If you can think of any places additional logging would help, nows the time :) I'll add more elements after 5 days of successful running, but troubleshooting may wind up taking a while.
You should think about a tool to monitor your daemon similar to how you monitor sensors. Could be useful to notify people that it crashed, etc.
Without knowing exactly what controller(s) is causing the issue, it's difficult to know where to add more logging lines. The daemon-status check is a good idea. I'll put it on the todo list for the 5.0 release.
In case it's helpful at all, whenever I write similar complex systems I always add a verbose level of logging. At the very least I log when I begin a major operation (reading a specific sensor, outputting to an LCD screen, etc). It creates a lot of log but it gives a definitive idea of where an issue occurred.
The type of black box troubleshooting we're doing right now doesn't scale very well.
Either way, I'm happy to do this debugging, just providing some input.
That should be what happens in debug mode (sudo python ~/Mycodo/mycodo/mycodo_daemon.py -d
). That sets the logging level to debug, for which there are a lot of logging instances of. It might just happen that you came across a bug that is in an area of the code that doesn't have good exception and/or logging coverage, or something else.
I think this is the first or one of the very few times (I can't recall any other) that the daemon just stops without anything appearing in the log. You're the only person that I know of that's experiencing this issue. So, it could possibly be your install or hardware.
Just curious, before concluding the daemon crashed, did you confirm with ps aux | grep mycodo_daemon.py
that the daemon did indeed stop and wasn't just unresponsive?
I sure hope it's not the hardware, that would be challenging to debug. The next time I get it to hang I'll check ps
.
Just curious, what were these messages about:
Mar 4 04:28:16 nibbler systemd[22374]: Reached target Shutdown.
Mar 4 04:28:16 nibbler systemd[22374]: Starting Exit the Session...
Mar 4 04:28:16 nibbler systemd[22374]: Received SIGRTMIN+24 from PID 13300 (kill).
I think it's pretty suspicious that they coincided with the last LCD update. I was using the LCD screen with time output to see if it was responsive or not, and when the issue occurred the LCD was frozen at roughly 4:28.
Thanks for helping me work through all this.
I'm not sure what those messages mean.
I just looked over the LCD controller and it's fairly covered with exception-level logging lines, so it should be caught if that's the issue. It's a mystery.
http://serverfault.com/questions/774491/what-is-sigrtmin24-in-syslog
Weird. Perhaps this is some type of system issue. I'll keep my eyes on it and see what I can find.
I was able to confirm that the daemon does crash. It does not hang. I will continue with removing individual components of my system until it runs stable.
I reviewed the logs and your code a little more and I want to bring back up the option of additional logging. You did a great job in defensive programming and finding ways to capture exceptions, but clearly there is something going on that we can't find.
Even in debug mode, all I see are the following 4 lines repeated
2017-03-10 06:14:44,967 - mycodo.lcd_jK2jnVDR - DEBUG - Latest temperature: 17.5 @ 2017-03-10 06:14:42
2017-03-10 06:14:44,998 - mycodo.lcd_jK2jnVDR - DEBUG - Latest humidity: 10.8 @ 2017-03-10 06:14:42
2017-03-10 06:14:45,026 - mycodo.lcd_jK2jnVDR - DEBUG - Latest co2: 413 @ 2017-03-10 06:14:34
2017-03-10 06:14:45,067 - mycodo.lcd_jK2jnVDR - DEBUG - Latest time: 17.5 @ 2017-03-10 06:14:42
This doesn't give me any insight into what else is happening. Presumably there is a function that will handle grabbing sensor values, and a function that will kick off camera captures, etc, correct?
I would recommend adding logging such as the following:
def update_lcd():
Logger.verbose("Starting LCD Update.")
... your logic
Logger.verbose("Finished LCD Update.")
or if inside a loop something line this
def run():
try:
logging.verbose("Starting LCD Loop.")
self.running = True
...
while self.running:
if time.time() > self.timer:
logging.verbose("Starting LCD Logic.")
self.get_lcd_strings()
...
logging.verbose("LCD Sleeping.")
time.sleep(1)
...
logging.verbose("LCD Loop has Ended.")
While this will make a large amount of logs, it would let us actually know what was going on at the time of a crash. Whatever the issue may be, be it a hardware problem, an uncaught exception, etc, we would quickly be able to tell the last function that was called before things went wrong. This would really help narrow down troubleshooting.
The reason I suggest it's done in a VERBOSE mode is that you don't need this in most situations, so allowing a DEBUG and VERBOSE log helps resolve issues like the one I am having without flooding logs for the users who only need DEBUG.
If you are interested in this I'm happy to help work through your code and add this type of logging.
More logging would be nice. Thanks for the offer to help. A few questions and concerns:
How would one add VERBOSE as a logging level beyond DEBUG?
I'm getting ready to release version 5.0, so adding code to 4.1.16 may not be the best idea because it will not be maintained and merging with 5.0, when it's released, could become more difficult. Perhaps you should give the dev-5.0 branch a try and see if you still experience the same daemon crash issues.
Ahh sorry, forgot VERBOSE is not enabled by default in Python. There are ways to add custom levels, I can help with that.
I'll set my test back up in dev-5.0 and see if I have the issue. We can take it from there, and if needed I'll help with the logging in 5.0.
Thanks!
This seems like the proper way to add another logging level.
import logging
logging.DEBUG_LEVEL_NUM = 9
logging.addLevelName(logging.DEBUG_LEVEL_NUM, "DEBUGV")
def debugv(self, message, *args, **kws):
if self.isEnabledFor(DEBUG_LEVELV_NUM):
self._log(DEBUG_LEVELV_NUM, message, args, **kws)
logging.Logger.debugv = debugv
I'll keep 5.0-dev running for the week and see if it is stable.
I just got 5,0 freshly installed. Thanks for the debugging.
Sad to report the daemon went dead after 4-5 days uptime (same as usual).
I'm going to go back to disabling individual components and see what that yields, but I'd love to talk about getting that extra logging in place.
That's unfortunate. What is your exact configuration? Can you upload your mycodo.db? You should delete the user table just to be safe (even though passwords are hashed), if posting here publicly.
As for logging, that sounds like a good approach, now that 5.0 has been released. I'll see about getting a verbose mode added in addition to the debug (-d -v parameters).
It's a random generated password that I can change, so I don't really care. I currently have the LCD disabled to test that, but the LCD as active when the issues occur.
Thanks. I'd bet it's the LCD. I've never tested the 4 line LCD (only the 2-line). Everything else I've had running on my systems for months straight without issue.
Current test has no LCD, so we'll see. I won't know until next weekend. Strange how it's so consistently 4-5 days.
I had some thoughts on this that I wanted to put out there.
It seems to me that for the moment a quick patch would be to have a script running in a tight loop, maybe in a subprocess, that checks the lock file's process id to see if it's running. The moment it isn't the script cleans it up and fires it off again.
The next this is that I noticed #208 and this behavior does sound like a memory leak or resource not being closed. Maybe the debut log levels should include current system resources in their statements.
while true; do maintain_active_daemon.py; sleep 1; done
Here's a quick script to check if the daemon is running:
#!/usr/bin/python
import os
DAEMON_PID_FILE = '/var/lock/mycodo.pid'
def daemon_running():
if os.path.exists(DAEMON_PID_FILE):
with open(DAEMON_PID_FILE, 'r') as pid_file:
if os.path.exists("/proc/{pid}".format(pid=pid_file.read())):
return "Mycodo Daemon is running"
return "Mycodo Daemon is not running"
if __name__ == '__main__':
print(daemon_running())
Would the easiest action be to use subprocess to execute the following?
rm /var/lock/mycodo.pid && service mycodo restart
The idea sounds right. The daemon works perfectly until it dies, so having the ability to reboot the daemon is a good feature.
As for your script, the only problem is that you would lose whatever arguments you passed the daemon initially (i.e. -d or -v)
I'll make a cup of coffee and mull it over.
What about pulling the arguments from the environment? Well what if the daemon arguments were handed to to sentry process so that it would handle passing them to the script?
Also I like the subprocess module because from what I've read, letting the system manage the command allows the OS to decide which core the process runs on.
So a sentry process that would just use python.subprocess to call up the actual daemon?
Yeah that's what I was thinking. The subprocess is just a command line call to a program that keeps the daemon running. Since it steers the daemon it could pass agruments to it when it starts it.
Maybe it would be worth thinking of this as a systemd service instead
So this would be the general system setup?
maintain_active_daemon.py
maintain_active_daemon.py
starts mycodo_daemon.py
maintain_active_daemon.py
maintain_active_daemon.py
maintain_active_daemon.py
will restart the demon if it stops with whatever parameters it was started with lastIn theory sounds right, my only questions are:
mycodo_client.py
work now? Seems like the shutdown command would need to be sent to the maintain_active_daemon.py
while the other commands go to the daemon
itself.It does make it tricky with systemd involved. Perhaps the run parameters could be stored in a file that could be erased if the daemon is stopped manually, otherwise it will startup with those parameters if they exist. That would allow systemd to always control mycodo_daemon.py.
Yeah that seems less intrusive
That would allow maintain_active_daemon.py
to only have to worry about executing service mycodo restart
and not have to handle any parameters.
Sad to report it crashed again w/o the LCD output enabled (device still hooked up). Removing the (2) DHT22 sensors now, will see what happens.
Mycodo Issue Report:
Problem Description
Daemon shuts down / crashes without log entries.
Errors
Daemon Log: Below is all I could find. The rest of the log is repeating mysql queries that I assume are the result of a persistent web-client / graphing.
Steps to Reproduce the issue:
How can this issue be reproduced?
Additional Notes
I've been able to do this twice now, so I believe it's consistent. Please let me know what other types of log entries I should look for or if I can enable some increased logging, etc.