temporalio / temporal

Temporal service
https://docs.temporal.io
MIT License
10.57k stars 769 forks source link

Add health check handler for worker service #2582

Open tsurdilo opened 2 years ago

tsurdilo commented 2 years ago

Looks as currently worker service https://github.com/temporalio/temporal/tree/master/service/worker does not expose a gRPC handler and the health check method like matching and history services do.

Please add this if possible.

yiminc commented 2 years ago

Worker does not have a gRPC service endpoint. It would be good to have sdk to implement the health check so whoever running a worker would have it.

paymog commented 11 months ago

While we wait for this to get implemented, does anyone have a workaround for those of us running workers in kubernetes? We recently deployed a broken docker image for our worker and didn't realize for a few days.

tredmon commented 3 months ago

While we wait for this to get implemented, does anyone have a workaround for those of us running workers in kubernetes?

I've just started playing with temporal in kubernetes, but a simple liveness probe could look like (assuming your image has sh and pidof installed and that your worker has a binary which is unique on your pod runtime):

livenessProbe:
  exec:
    command:
      - /bin/sh
      - -c
      - pidof your-worker-binary >/dev/null

For the readiness probe I'm considering creating a file in an emptyDir volume when the worker is ready, and creating another file when the worker receives a termination signal. Then the readiness probe could look for the existence of said files.

robholland commented 1 week ago

If the worker process died, the container would restart, you'd see crash backoff loop. A pidof health check is never going to report unhealthy.

robholland commented 1 week ago

@paymog do you mean your application worker, or Temporal's (internal) worker, which is what this issue relates to? In what way was your worker failing that wasn't resulting in the container exiting?

paymog commented 1 week ago

I'm referencing the application worker here and specifically interested in having a health check to make staged rollouts easier in kubernetes.

robholland commented 1 week ago

This issue was referring to the internal system worker. We can't provide a health check for your application workers.

paymog commented 1 week ago

ah, my bad then! I think other folks which 👍 this comment are similarly confused. Why can't you provide a health check? There's a default way to expose prometheus metrics on a provided port, why can't the SDK spin up an http server that responds with a 200 once the worker is ready to process workflows/activities?

robholland commented 1 week ago

Healthy is context specific. Your application worker may be running but failing to process workflow tasks because it lacks some resource that we can't know about. The worker being able to respond 200 ok to an incoming request doesn't mean it was successfully able to connect to Temporal for all the Workers within the process (there can be many Temporal SDK Worker per process). There is just no clear meaning to what healthy means.

paymog commented 1 week ago

I see. Could temporal instead provide a sample doc of how to spin up a health check server that each user could implement and customize?

robholland commented 1 week ago

Maybe, but it's "just" spinning up a HTTP listener with one endpoint, which in most languages is very easy. The rest would all be custom, so I'm not sure how much value there is there. There would be nothing Temporal specific about it.

robholland commented 1 week ago

Put another way, it would be no different to how you'd add healthcheck endpoints to any other part of your application.

paymog commented 1 week ago

Is there a canonical way to hook into the worker sdk to only spin up this server after the worker has connected to temporal?

Ninja edit: well this is embarrassing, turns out I already implemented this in our worker by copy-pasting https://github.com/temporalio/sdk-typescript/blob/c1e7fffcd807493298d66d0063c4618eaf851206/packages/test/src/load/worker.ts#L191

EDIT: I suspect lots of users would be happy if there was a flag we could set during worker initiation that would handle this for us. Agreed that healthy is context specific, something like this will likely satisfy the vast majority of users though.