int08h / roughenough

A Roughtime secure time sync client and server written in Rust
https://int08h.com/post/roughenough-a-rust-roughtime-server/
Apache License 2.0
123 stars 21 forks source link

Add `/heartbeat` style HTTP endpoint for load-balancer health checks #8

Closed grempe closed 6 years ago

grempe commented 6 years ago

Many public cloud load balancers expect services to provide an HTTP(s) accessible endpoint to verify its healthy operation. This will allow the load balancer to add or remove servers it is aware of from the load-balancer rotation automatically.

As roughenough is a UDP only server it is unable to respond to these health checks and therefore a bad server cannot be removed from rotation.

It would be very helpful if roughenough could (optionally?) also expose an endpoint on HTTP port 80 (e.g. /heartbeat) which would return a 200 response if the server is operating normally. This endpoint might even expose a tiny JSON response with additional info such as the output of a current roughtime timestamp in JSON form. e.g.

{"midpoint": 123, "radius": 1000000000}

Thanks.

int08h commented 6 years ago

I have been toying with this idea for a while now. It's genuinely useful. Thanks for brining it up.

Two approaches come to mind:

  1. Use a Rust HTTP server crate and run the healthcheck outside the main event loop, or
  2. Have a listening TCP socket in the mio event loop; parse HTTP requests and respond in the event loop

I'm leaning towards approach 2:

Thoughts? I think this satisfies what you described.

int08h commented 6 years ago

Few more thoughts:

  1. To keep attack surface minimal, I would prefer to not include an HTTP parser at all and instead have the HTTP health check port unconditionally squirt back an HTTP 200 response to all that connect.

  2. Health check functionality will be an optional build-time feature not compiled by default. Keeping with the idea that the health check endpoint is only to be exposed to non-internet, non-hostile networks, the health check feature will be opt-in, requiring build and runtime steps to enable. The intent is to reduce the chance of unintentionally deploying roughenough with the health check exposed to the internet.

grempe commented 6 years ago

I think that returning an HTTP 200 for all TCP requests on that port would do the trick for our situation. And I think that it does make sense to handle that in the main event loop so the response is in some way tied to the healthy operation of the server.

https://cloud.google.com/load-balancing/docs/health-checks#legacy-health-checks

Having this functionality as a runtime option definitely makes sense to me. I feel less strongly about it being a build time option (not compiled in by default). I don't see much harm in including it in a standard build as long as it requires positive action to enable it in the config. But for our use case we could work with either approach (as we build a docker image of roughenough from source to host from).

I look forward to trying it out!

grempe commented 6 years ago

v1.0.7? ;-)

int08h commented 6 years ago

@grempe look at what just landed, your health check has arrived. ;)

Some polish still required, but give it a go. I'm planning on releasing 1.1 this weekend if time permits.

grempe commented 6 years ago

Great @int08h !

I'll try to look today, but if not, I'll pick up 1.1 and let you know if I have any issues with the LB integration.

Cheers.

grempe commented 6 years ago

So far, so good. I see:

❯ curl -i http://127.0.0.1:8000
HTTP/1.1 200 OK
Content-Length: 0
Connection: close

I don't really have a way to test the negative case I think (other than shutting down roughenough). When the 1.1 release version comes out I can test it with the real load balancer health checks.

Thanks for implementing.

👍

grempe commented 6 years ago

I did note that the logging output is different for health check logs vs previously existing logs (_ vs ::).

chainpoint-roughtime    | 2018-10-27 01:34:09 INFO  [roughenough_server] Max response batch size : 64
chainpoint-roughtime    | 2018-10-27 01:34:09 INFO  [roughenough_server] Status updates every    : 600 seconds
chainpoint-roughtime    | 2018-10-27 01:34:09 INFO  [roughenough_server] Server listening on     : 0.0.0.0:2002
chainpoint-roughtime    | 2018-10-27 01:34:09 INFO  [roughenough_server] TCP health check        : 0.0.0.0:8000
chainpoint-roughtime    | 2018-10-27 01:34:13 INFO  [roughenough::server] health check from 172.21.0.1:59608
chainpoint-roughtime    | 2018-10-27 01:34:13 INFO  [roughenough::server] health check from 172.21.0.1:59610
chainpoint-roughtime    | 2018-10-27 01:34:22 INFO  [roughenough::server] health check from 172.21.0.1:59612
chainpoint-roughtime    | 2018-10-27 01:34:47 INFO  [roughenough::server] health check from 172.21.0.1:59614
chainpoint-roughtime    | 2018-10-27 01:34:48 INFO  [roughenough::server] health check from 172.21.0.1:59616
chainpoint-roughtime    | 2018-10-27 01:34:50 INFO  [roughenough::server] health check from 172.21.0.1:59618
chainpoint-roughtime    | 2018-10-27 01:34:52 INFO  [roughenough::server] health check from 172.21.0.1:59620
chainpoint-roughtime    | 2018-10-27 01:34:53 INFO  [roughenough::server] health check from 172.21.0.1:59622
chainpoint-roughtime    | 2018-10-27 01:34:55 INFO  [roughenough::server] health check from 172.21.0.1:59624
chainpoint-roughtime    | 2018-10-27 01:35:02 INFO  [roughenough::server] health check from 172.21.0.1:59626
int08h commented 6 years ago

Committed in 1.1.0

int08h commented 6 years ago

I did note that the logging output is different for health check logs vs previously existing logs (_ vs ::).

I noticed that too. Artifact of the logging library being used. It violates my aesthetic sensibilities too, but I haven't had time to look more deeply into it.