goodformandspectacle / v_and_a

1 stars 0 forks source link

Add Elastic Beanstalk healthcheck #20

Closed infovore closed 9 years ago

infovore commented 9 years ago

Set up a healthcheck URL and set emails to Tom / glo on fail.

This is to let us know when the site 'shits itself'. However: I'm not sure how useful it'll be, because the most common cause of the site shitting itself is the RDS database server timing out (because it's at 100% CPU or something) and I don't think that technically raises an 'error' anywhere. Still, better than nothing.

george08 commented 9 years ago

You've already seen the RDS DB timing out?

On 4 Jan 2015, at 21:58, Tom Armitage notifications@github.com wrote:

Set up a healthcheck URL and set emails to Tom / glo on fail.

This is to let us know when the site 'shits itself'. However: I'm not sure how useful it'll be, because the most common cause of the site shitting itself is the RDS database server timing out (because it's at 100% CPU or something) and I don't think that technically raises an 'error' anywhere. Still, better than nothing.

— Reply to this email directly or view it on GitHub.

infovore commented 9 years ago

When we had those "white screens" in December, with no error message: it looked like RDS was failing to complete a query in 60s, whilst its CPU pegged at 100%. (Hence: the program never got as far as throwing a 500 error, or indeed any error).

I tried bumping that timeout limit but to little success; instead, I focused on all the work around caching results (per-deploy), and speeding up queries, in order to never hit the 60s time out. (Which is a reasonable limit for a database query).

However, I wanted to note that being notified of 500s from the server wouldn't necessarily notify us of all the errors we've seen in this project.

george08 commented 9 years ago

Copy that.

On 4 Jan 2015, at 23:34, Tom Armitage notifications@github.com wrote:

When we had those "white screens" in December, with no error message: it looked like RDS was failing to complete a query in 60s, whilst its CPU pegged at 100%. (Hence: the program never got as far as throwing a 500 error, or indeed any error).

I tried bumping that timeout limit but to little success; instead, I focused on all the work around caching results (per-deploy), and speeding up queries, in order to never hit the 60s time out. (Which is a reasonable limit for a database query).

However, I wanted to note that being notified of 500s from the server wouldn't necessarily notify us of all the errors we've seen in this project.

— Reply to this email directly or view it on GitHub.

infovore commented 9 years ago

Tom: healthchecks don't work like you think. The healthcheck has to pass, or else the instance gets marked as failing, and then it disappears.

So I'm going to look into other ways to stay on top of things breaking.