gruntwork-io / health-checker

A simple HTTP server that will return 200 OK if all configured health checks pass.
MIT License
103 stars 37 forks source link

Support "independent" health checks #3

Open josh-padnick opened 6 years ago

josh-padnick commented 6 years ago

Right now, health-checker only runs health checks in response to an inbound HTTP request. This has the benefit of keeping the code simple while allowing the user to decide how often to run the underlying health check.

But it may be desirable to work around the limitations of, say, Amazon Load Balancer Health Checks by updating health-checker to run its own background health checks.

For example, Amazon Load Balancer health checks can only be run every 10s. But health-checker could support health checks that run every 1s or 500ms. It could then maintain a "state" of the health check depending on the configuration of a health check definition (see #2). In particular, health-checker could check an endpoint every 500ms, so that by the time Amazon's health check comes in, a single result would in fact reflect 20 separate health checks.

brikis98 commented 6 years ago

In particular, health-checker could check an endpoint every 500ms, so that by the time Amazon's health check comes in, a single result would in fact reflect 20 separate health checks.

What is the advantage of this?

josh-padnick commented 6 years ago

It gives you more control over how often your health checks run. For example, you could set the health-check from the ELB to just every 5 seconds, and mark the instance healthy after a single passing health check, but in fact that health-check could represent 10 health-checker checks.

To be clear, I wouldn't consider this a high priority. I was just calling out a design decision of the app.

brikis98 commented 6 years ago

I must be missing something, but what is the advantage of being able to do health checks more often? If you did 10 health checks internally for each 1 health check of the ELB, and one of those 10 failed, do you return a failure to the ELB? Is this handling some case where failures may be faster than an ELB can pick up?

josh-padnick commented 6 years ago

Is this handling some case where failures may be faster than an ELB can pick up?

Exactly. An ELB can check at most once per 5 seconds, but that means you can't do more than one check every 5 seconds.

BTW, I've never needed such a feature in practice, so perhaps this GitHub issue was premature. One use case might be if you've got a service acting as a gateway that's extremely intolerant to any failures, you could have health-checker do 1 check per 500ms, and return a HTTP 500 for at least 5 seconds if there's even one failure. That gives you 10 actual checks for a single ELB check.

But, to be clear, I don't see this as high priority and wouldn't recommend implementing it right now. It's mostly just a neat "opportunity" enabled by having a health check independent of the ELB.

brikis98 commented 6 years ago

Understood, thx!