humanmade / Cavalcade-Runner

Daemon for Cavalcade, a scalable WordPress jobs system.
https://engineering.hmn.md/projects/cavalcade/
54 stars 26 forks source link

Cavalcade Runner crashes if DB becomes unreachable. #66

Closed owaincuvelier closed 4 years ago

owaincuvelier commented 4 years ago

If your DB endpoint becomes unreachable, cavalcade can effectively crash, and fail to resume running jobs once the DB becomes available again.

rmccue commented 4 years ago

@owaincuvelier do you have any further details on "effectively crash"? I'm not sure this is an issue we can fix at the Cavalcade Runner level.

owaincuvelier commented 4 years ago

Yeah, sorry! I couldn't get debug info till recently. I have some syslog entries here:

Feb 10 20:16:08 ip-172-32-10-74 cavalcade[6718]: Error: SQLSTATE[08S01]: Communication link failure: 1053 Server shutdown in progress
Feb 10 20:16:08 ip-172-32-10-74 cavalcade[6718]: #0 /etc/cavalcade/inc/class-runner.php(246): PDO->prepare('SELECT * FROM w...')
Feb 10 20:16:08 ip-172-32-10-74 cavalcade[6718]: #1 /etc/cavalcade/inc/class-runner.php(126): HM\Cavalcade\Runner\Runner->get_next_job()
Feb 10 20:16:08 ip-172-32-10-74 cavalcade[6718]: #2 /etc/cavalcade/bin/cavalcade(33): HM\Cavalcade\Runner\Runner->run()
Feb 10 20:16:08 ip-172-32-10-74 cavalcade[6718]: #3 {main}
Feb 10 20:16:08 ip-172-32-10-74 systemd[1]: cavalcade.service: Main process exited, code=exited, status=1/FAILURE
Feb 10 20:16:08 ip-172-32-10-74 systemd[1]: cavalcade.service: Unit entered failed state.
Feb 10 20:16:08 ip-172-32-10-74 systemd[1]: cavalcade.service: Failed with result 'exit-code'.
Feb 10 20:16:08 ip-172-32-10-74 systemd[1]: cavalcade.service: Service hold-off time over, scheduling restart.
Feb 10 20:16:08 ip-172-32-10-74 systemd[1]: Stopped CavalcadeRunner.
Feb 10 20:16:08 ip-172-32-10-74 systemd[1]: Started CavalcadeRunner.
Feb 10 20:16:09 ip-172-32-10-74 cavalcade[10446]: PHP Notice:  Undefined index: HTTP_USER_AGENT in /srv/www/release-xxxxxxxxx/.config/constants.php on line 109
Feb 10 20:16:09 ip-172-32-10-74 cavalcade[10446]: Error: SQLSTATE[HY000] [2002] Connection refused
Feb 10 20:16:09 ip-172-32-10-74 cavalcade[10446]: #0 /etc/cavalcade/inc/class-runner.php(211): PDO->__construct('mysql:host=db.s...', 'wordpress', 'wordpress', Array)
Feb 10 20:16:09 ip-172-32-10-74 cavalcade[10446]: #1 /etc/cavalcade/inc/class-runner.php(88): HM\Cavalcade\Runner\Runner->connect_to_db()
Feb 10 20:16:09 ip-172-32-10-74 cavalcade[10446]: #2 /etc/cavalcade/bin/cavalcade(32): HM\Cavalcade\Runner\Runner->bootstrap('.')
Feb 10 20:16:09 ip-172-32-10-74 cavalcade[10446]: #3 {main}
Feb 10 20:16:09 ip-172-32-10-74 systemd[1]: cavalcade.service: Main process exited, code=exited, status=1/FAILURE
Feb 10 20:16:09 ip-172-32-10-74 systemd[1]: cavalcade.service: Unit entered failed state.
Feb 10 20:16:09 ip-172-32-10-74 systemd[1]: cavalcade.service: Failed with result 'exit-code'.
Feb 10 20:16:09 ip-172-32-10-74 systemd[1]: cavalcade.service: Service hold-off time over, scheduling restart.
Feb 10 20:16:09 ip-172-32-10-74 systemd[1]: Stopped CavalcadeRunner.
Feb 10 20:16:09 ip-172-32-10-74 systemd[1]: Started CavalcadeRunner.
Feb 10 20:16:09 ip-172-32-10-74 cavalcade[10449]: PHP Notice:  Undefined index: HTTP_USER_AGENT in /srv/www/release-xxxxxxxxx/.config/constants.php on line 109
Feb 10 20:16:09 ip-172-32-10-74 cavalcade[10449]: Error: SQLSTATE[HY000] [2002] Connection refused
Feb 10 20:16:09 ip-172-32-10-74 cavalcade[10449]: #0 /etc/cavalcade/inc/class-runner.php(211): PDO->__construct('mysql:host=db.s...', 'wordpress', 'wordpress', Array)
Feb 10 20:16:09 ip-172-32-10-74 cavalcade[10449]: #1 /etc/cavalcade/inc/class-runner.php(88): HM\Cavalcade\Runner\Runner->connect_to_db()
Feb 10 20:16:09 ip-172-32-10-74 cavalcade[10449]: #2 /etc/cavalcade/bin/cavalcade(32): HM\Cavalcade\Runner\Runner->bootstrap('.')
Feb 10 20:16:09 ip-172-32-10-74 cavalcade[10449]: #3 {main}
Feb 10 20:16:09 ip-172-32-10-74 systemd[1]: cavalcade.service: Main process exited, code=exited, status=1/FAILURE
Feb 10 20:16:09 ip-172-32-10-74 systemd[1]: cavalcade.service: Unit entered failed state.
Feb 10 20:16:09 ip-172-32-10-74 systemd[1]: cavalcade.service: Failed with result 'exit-code'.
Feb 10 20:16:09 ip-172-32-10-74 systemd[1]: cavalcade.service: Service hold-off time over, scheduling restart.
Feb 10 20:16:09 ip-172-32-10-74 systemd[1]: Stopped CavalcadeRunner.
Feb 10 20:16:09 ip-172-32-10-74 systemd[1]: Started CavalcadeRunner.
owaincuvelier commented 4 years ago

At this point the DB instance connection was restored, using the same end point I believe. Yet cavalcade wasn't able to restart itself

rmccue commented 4 years ago

Going to close this, as it's not a bug in Cavalcade.

Essentially, when the DB dies, the Runner loses connection to the database (understandably). It then exits, as it can't do anything.

The system daemon running the Cavalcade Runner is responsible for monitoring and restarting the Runner. In this setup, it appears that it times out on the restarts before the database is back up again.

The solution to this is to adjust the systemd config to restart more times, and potentially wait longer between each attempt, in order to ensure it restarts after the DB is back up. This is out of Cavalcade's control, although might be worth updating our example systemd unit file for it.

owaincuvelier commented 4 years ago

Thanks for the feedback @rmccue!