Open tsurdilo opened 2 years ago
Worker does not have a gRPC service endpoint. It would be good to have sdk to implement the health check so whoever running a worker would have it.
While we wait for this to get implemented, does anyone have a workaround for those of us running workers in kubernetes? We recently deployed a broken docker image for our worker and didn't realize for a few days.
While we wait for this to get implemented, does anyone have a workaround for those of us running workers in kubernetes?
I've just started playing with temporal in kubernetes, but a simple liveness probe could look like (assuming your image has sh
and pidof
installed and that your worker has a binary which is unique on your pod runtime):
livenessProbe:
exec:
command:
- /bin/sh
- -c
- pidof your-worker-binary >/dev/null
For the readiness probe I'm considering creating a file in an emptyDir
volume when the worker is ready, and creating another file when the worker receives a termination signal. Then the readiness probe could look for the existence of said files.
If the worker process died, the container would restart, you'd see crash backoff loop. A pidof
health check is never going to report unhealthy.
@paymog do you mean your application worker, or Temporal's (internal) worker, which is what this issue relates to? In what way was your worker failing that wasn't resulting in the container exiting?
I'm referencing the application worker here and specifically interested in having a health check to make staged rollouts easier in kubernetes.
This issue was referring to the internal system worker. We can't provide a health check for your application workers.
ah, my bad then! I think other folks which 👍 this comment are similarly confused. Why can't you provide a health check? There's a default way to expose prometheus metrics on a provided port, why can't the SDK spin up an http server that responds with a 200 once the worker is ready to process workflows/activities?
Healthy is context specific. Your application worker may be running but failing to process workflow tasks because it lacks some resource that we can't know about. The worker being able to respond 200 ok to an incoming request doesn't mean it was successfully able to connect to Temporal for all the Workers within the process (there can be many Temporal SDK Worker per process). There is just no clear meaning to what healthy means.
I see. Could temporal instead provide a sample doc of how to spin up a health check server that each user could implement and customize?
Maybe, but it's "just" spinning up a HTTP listener with one endpoint, which in most languages is very easy. The rest would all be custom, so I'm not sure how much value there is there. There would be nothing Temporal specific about it.
Put another way, it would be no different to how you'd add healthcheck endpoints to any other part of your application.
Is there a canonical way to hook into the worker sdk to only spin up this server after the worker has connected to temporal?
Ninja edit: well this is embarrassing, turns out I already implemented this in our worker by copy-pasting https://github.com/temporalio/sdk-typescript/blob/c1e7fffcd807493298d66d0063c4618eaf851206/packages/test/src/load/worker.ts#L191
EDIT: I suspect lots of users would be happy if there was a flag we could set during worker initiation that would handle this for us. Agreed that healthy is context specific, something like this will likely satisfy the vast majority of users though.
Looks as currently worker service https://github.com/temporalio/temporal/tree/master/service/worker does not expose a gRPC handler and the health check method like matching and history services do.
Please add this if possible.