openfaas / nats-queue-worker

Queue-worker for OpenFaaS with NATS Streaming
https://docs.openfaas.com/reference/async/
MIT License
129 stars 59 forks source link

Feature: add healthcheck #61

Open alexellis opened 5 years ago

alexellis commented 5 years ago

Expected Behaviour

Healthcheck over HTTP or an exec probe which can be used by Kubernetes to check readiness and health

Current Behaviour

N/a

Possible Solution

Please suggest one of the options above, or see how other projects are doing this and report back.

Context

A health-check can help with robustness.

alexellis commented 5 years ago

See also: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/

https://github.com/openfaas/faas-netes/blob/master/chart/openfaas/templates/gateway-dep.yaml#L45

alexellis commented 4 years ago

@matthiashanel what are your thoughts on this?

matthiashanel commented 4 years ago

@alexellis, I can see how adding a HTTP endpoint makes sense if the service itself serves HTTP. There you'd get some feedback e.g. your service slows down and so would the health check endpoint. In the queue worker this would be largely unrelated, so I don't quite see the benefit justifying the added complexity. As for readiness, if connect fails the program will exit, causing a restart. When this happens messages will continue to be stored streaming.

Did you run into a concrete problem where this could help?

alexellis commented 4 years ago

Most Kubernetes services should have a way to express health and readiness via an exec, TCP, or HTTP probe. This can be used for a number of things including decisions about scaling or recovery.

If we're fairly sure that this is not required when interacting with NATS then I'll close it out.

I wonder if there is any value in exploring metrics instrumentation of the queue-worker itself, or if the metrics in the gateway and NATS itself are enough to get a good picture of things?

matthiashanel commented 4 years ago

health probe: The best value I can imagine the queue worker to produce is how many messages it currently processes. A value of 5 says little about wether scaling is needed or not. Scaling is needed if there are too many messages the service has not seem.

Readiness probe: The queue worker does not open a port or serve HTTP, which makes a readiness probe a tough nut to crack. Ready for the queue worker essentially means the nats connection got established. If that does not work the queue worker exits. I can imagine conditions where the streaming client does not return from connect. Starting a webserver to protect against this by indicating readiness seems even more complex. Do I make sense here?

We will get a lot more mileage by using the metrics nats has. In the nats-streaming-server what would have to happen is opening the monitoring port -m <port>

This example shows how to discover channels and inspect them via curl

nats-streaming-server -m 8080
curl http://127.0.0.1:8080/streaming/channelsz
{
  "cluster_id": "test-cluster",
  "server_id": "ZAs0tFNCNAd5CZuEm0I0xA",
  "now": "2020-05-13T14:49:04.556034-04:00",
  "offset": 0,
  "limit": 1024,
  "count": 2,
  "total": 2,
  "names": [
    "queue",
    "foo"
  ]
}
curl http://127.0.0.1:8080/streaming/channelsz\?channel\=queue
{
  "name": "queue",
  "msgs": 1,
  "bytes": 22,
  "first_seq": 1,
  "last_seq": 1
}%
# this one also returns information about subscriber
curl http://127.0.0.1:8080/streaming/channelsz\?channel\=foo&subs=1

https://docs.nats.io/nats-streaming-concepts/monitoring#monitoring-a-nats-streaming-channel-with-grafana-and-prometheus