Open encryptio opened 8 years ago
I read about something like this on Gitlab:
http://docs.gitlab.com/ce/monitoring/health_check.html
This would be pretty cool. Do you think we need to use an access token like they do @encryptio?
Looks like Gitlab uses the health_check
gem which does extensive and expensive checks for the entire rails stack, including services that are not the process you're querying, or even rails processes at all (like memcached, mysql, and the smtp mail server.) Gitlab's kind of check would be very well suited to point an alerting system at (where you want a single configuration to send you downtime texts about), but NOT a load balancer readiness check.
Load balancer readiness checks can easily climb to 10 health check requests per second per target node on a large cluster. Google's CDN health checks and Fastly's health checks run multiple times per second per target node (because each load balancer edge does them independently.) (In the past I've also worked with systems that do this internally, totaling a thousand health checks per second cluster-wide for a single service.)
I was thinking of a much cheaper, non-recursive check, where you don't do any external IO. That ensures you scale well, at the cost of not catching as many failures as you could. It's probably fine to start with almost nothing (no really, just "return 200;" - even that checks the config options caused it to start up and stay up and the listening socket is alive, which is 90% of what you want to know about in Kubernetes.) You can add on little process-local flags that need to be true later on as you find things that are good reasons to say "this horizon process cannot serve useful traffic right now." Things like: "do I have an open changefeed open on my collection config", etc.
Also because it'd be a cheap check, no authorization would be needed.
I think the health_check
gem's full stack check approach is interesting and very, very useful if you're not doing your own alerting, but it's not what I intended with this issue.
I think the plugins branch has grown a proto-plugin for this.
This endpoint (say,
/horizon/healthy
) would return a 200 if the server is up and connected to rethinkdb, ready to serve user connections, and a 4XX/5XX response if it is not.This would make automated deployment and updates easier, both on bare metal (where haproxy or a hardware load balancer would check each node for health) and for infrastructure management tools like Kubernetes (where a readinessProbe/livenessProbe would check each pod, during updates and during continuous operation.)