Closed scontapay closed 9 years ago
Thanks for reporting this!
Fixed for now. Investigating the root cause. This didn't impact the API server which remained available through the website's downtime. A bug in the monitoring code also prevented early detection -- it was only monitoring the API server but is now monitoring both API + Web servers
A more complete (though still incomplete) analysis showed that the webserver went down around 3:54AM PST -- it came back up 9:26AM PST after the web process was manually restarted. At the time, a number of different search engines were indexing the site as well as minimal normal traffic. Something (still unknown) caused the web process to crash without logging a reason. At that point 502 started to be returned by NGINX as the up upstream server was down. Upstart auto-restarted the web process, but something continued to cause the process to crash. The Upstart config was to eventually give up on restarting the process which is what led to the longer term downtime.
Investigating the syslog, server log, and upstart log did not give any additional information other than the process was being started after it exited. None of the first requests that returned 502 or the requests immediately before the crash resulted in a reproducible crash. I have added additional logging to the web process to try to track any unexpected exits. Hopefully we'll catch this faster next time with the new logging and be able to identify a root cause.
For the record, or if anyone else has additional thoughts here's the relevant logs:
Snippet from /var/log/bloomweb-upstart.log
[Thu Jan 8 16:49:48 EST 2015] BloomWeb Starting
[Thu Feb 12 06:54:38 EST 2015] BloomWeb Starting
[Thu Feb 12 06:54:39 EST 2015] BloomWeb Starting
[Thu Feb 12 06:54:40 EST 2015] BloomWeb Starting
[Thu Feb 12 06:54:40 EST 2015] BloomWeb Starting
[Thu Feb 12 06:54:41 EST 2015] BloomWeb Starting
[Thu Feb 12 12:26:03 EST 2015] BloomWeb Starting
Snippet from /var/log/syslog
Feb 12 06:54:38 web-01 kernel: [7136918.929504] init: bloomweb main process ended, respawning
Feb 12 06:54:39 web-01 kernel: [7136919.938390] init: bloomweb main process ended, respawning
Feb 12 06:54:40 web-01 kernel: [7136920.597809] init: bloomweb main process ended, respawning
Feb 12 06:54:40 web-01 kernel: [7136921.246691] init: bloomweb main process ended, respawning
Feb 12 06:54:41 web-01 kernel: [7136921.844517] init: bloomweb main process ended, respawning
Feb 12 06:54:41 web-01 kernel: [7136922.455762] init: bloomweb respawning too fast, stopped
From /srv/bloomweb/logs/server.log
2015-02-12T11:54:38.996Z - info: starting bloomapi
2015-02-12T11:54:38.997Z - info: Master launching 1 workers
2015-02-12T11:54:39.357Z - info: Worker 1 listening on 3000
2015-02-12T11:54:39.669Z - info: starting bloomapi
2015-02-12T11:54:39.670Z - info: Master launching 1 workers
2015-02-12T11:54:40.010Z - info: Worker 1 listening on 3000
2015-02-12T11:54:40.351Z - info: starting bloomapi
2015-02-12T11:54:40.352Z - info: Master launching 1 workers
2015-02-12T11:54:40.666Z - info: Worker 1 listening on 3000
2015-02-12T11:54:40.957Z - info: starting bloomapi
2015-02-12T11:54:40.958Z - info: Master launching 1 workers
2015-02-12T11:54:41.264Z - info: Worker 1 listening on 3000
2015-02-12T11:54:41.555Z - info: starting bloomapi
2015-02-12T11:54:41.556Z - info: Master launching 1 workers
2015-02-12T11:54:41.876Z - info: Worker 1 listening on 3000
2015-02-12T17:26:03.902Z - info: starting bloomapi
http://www.bloomapi.com/
8:42 AM Pacific Time