jamesrwhite / minicron

🕰️ Monitor your cron jobs
GNU General Public License v3.0
2.34k stars 154 forks source link

Frequent issues with "job failed to execute at its expected time" #303

Open PerRommetveit opened 6 years ago

PerRommetveit commented 6 years ago

Version: minicron 0.9.7 OS: Ubuntu 16.04.4 LTS

It's a reoccurring issue that jobs fails to execute at their expected time. I've got 10 cronjobs set up with minicron, and every day I get messages like this to my slack channel:

Job #8 failed to execute at its expected time - 2018-04-23 09:29:30 +0000 Job #11 failed to execute at its expected time - 2018-04-23 09:29:30 +0000 Job #16 failed to execute at its expected time - 2018-04-23 09:29:30 +0000

When looking at the Executions history for a particular job, the job which was alerted as "failed to execute" does not have an entry in the Executions history for that particular point in time. Executions before and after is fine though, and most executions for a job runs just fine. I just checked for a particular job, and it failed on a random day 2 times, that's a failure rate of 0.0041%. Minicron simply does not run it, and does not log why, not even when verbose and debug flags are enabled at runtime to the minicron server.

Any idea what the issue can be here? Let me know if you need more information.

PerRommetveit commented 6 years ago

I think I found a solution to the issue above. The ever growing executions table which holds data about minicron job executions slows down minicron significantly.

The following command makes it snappy again. Adjust your paths if these are not the same as mine.

/usr/bin/sqlite3 /opt/minicron/lib/vendor/ruby/2.2.0/gems/minicron-0.9.7.1480251919/db/minicron.sqlite3 "DELETE FROM executions;" ".exit"

It might be an idea to make this into a configuration parameter, ie. preserve_executionhistory=7days or similar.

jamesrwhite commented 6 years ago

How many executions did you have out of interest?

I'm currently working in general on improving performance for v1.0. I'm hopeful there is a way other than deleting data to improve this :)

Thanks for the info btw!

PerRommetveit commented 6 years ago

You are most welcome James, and thanks for creating good software, looking forward to v1.0!

I am not sure exactly how many entries there were in the executions table. But I had about 15 jobs, and some of them running every 3 minutes, so naturally that's 480 entries just for one job in a day. Multiply that up, and it becomes a significant number of rows.

The symptoms where that ocasionally jobs would not execute, and I received error messages to my slack chan as described in the OP. Also, the WEB UI became much slower over time. May I suggest that when viewing a job, that not all Executions be loaded, but perhaps only the first 50, and then the user can select the next 50 and so on. Loading all at once, is quite slow when there's a lot of rows involved.

Also not sure why the 'failed to execute at its expected time' errors happened, but I suspect that the part of the code which tries to run a job, have a timeout in regards to some db read, and when it can't get the data it needs, then it skips the execution. That's not the desired behaviour, so if it can be fixed, that would be great.

To reproduce the issue, you could set up a test instance, then install minicron using sqlite database, and have a number of jobs (ie. 20) run every minute, also set up notifications to a slack channel, and after a while you will see jobs starting to fail to execute, and the webui to be less responsive. I would think increasing the number of jobs significantly would trigger these issues sooner.

jdforsythe commented 6 years ago

@PerRommetveit we have the same issue - I scheduled a job in minicron to clean up the executions and alerts tables in the minicron database:

#!/bin/bash

USER="root"
PASSWORD=""
DB="minicron"

mysql -u$USER -p$PASSWORD $DB -e "DELETE FROM executions WHERE created_at < DATE_ADD(CURDATE(), INTERVAL -3 DAY);"
mysql -u$USER -p$PASSWORD $DB -e "DELETE FROM alerts WHERE sent_at < DATE_ADD(CURDATE(), INTERVAL -3 DAY);"