que-rb / que

A Ruby job queue that uses PostgreSQL's advisory locks for speed and reliability.
MIT License
2.3k stars 190 forks source link

Worker health checks #260

Open hlascelles opened 4 years ago

hlascelles commented 4 years ago

We are deploying que workers in our Kubernetes environments, and would like to add Liveness and Readiness checks.

Our immediate problem / use case is that during startup (worker init) CPU levels spike, so the autoscaler brings in more cascading pods. Adding an initialDelaySeconds can help with that, but the bigger issue of worker health is still to be addressed.

We've had a few ideas, including something as simple as the worker writing a file to disk when it is "ready" which can be looked for, or an embedded sinatra app that can be curled from inside the container (to avoid opening up ports) thus:

readinessProbe:
  exec:
    command: ["curl", "--silent", "--show-error", "--fail", "http://localhost:8080/health"]
  initialDelaySeconds: 10
  periodSeconds: 5

Beyond that it gets more complicated. How to tell between a hung worker and one that is processing a very long task etc...

How are others handling this issue?

siegy22 commented 4 years ago

Would it be helpful if we would introduce some kind of command to check the status of que (workers)?

Something like bin/que healthcheck, which would return 0 when all the workers have booted?

hlascelles commented 4 years ago

Yes, something like that would be good (though which to connect to if you are running multiple worker "instances". Or have it write a file to disk at take a lock on it that can be detected? Open to ideas!

allquantor commented 4 years ago

Is there an idiomatic way to extract state for healthcheck / metric purposes? Like

hlascelles commented 4 years ago

@allquantor excellent point, metrics would be perfect for horizontal scaling (HPA). It will need to request queue depth by queue name and distinguish between "overdue" and "future scheduled".

This is starting to look like a full json payload as a response, not just an exit code.

I'm fact, may as well do the work in one go to provide a Prometheus compatible endpoint, and we'll get the health check almost incidentally.

If that sounds like the right approach, I think an internal Sinatra server thread would be the route. Thoughts?

hlascelles commented 4 years ago

Something like these (except they run as separate processes).

https://github.com/kaorimatz/resque_exporter https://github.com/moznion/resque_exporter

coffenbacher commented 4 years ago

How are others handling this issue?

As a liveness check, I ended up wrapping Que::Job in a class that writes a file to /tmp/liveness any time it finishes a job. Then I configured k8s to check that the file has been updated more recently than the expected runtime of our longest job. Seems to be working great :+1: In practice this solves https://github.com/que-rb/que/issues/241 for us as well, although it wouldn't work for non-k8s users.

How to tell between a hung worker and one that is processing a very long task etc...

Our workload doesn't really have any incredibly long tasks at the moment, but one option might be to put a a fixed or calculable timeout attribute in the task, and write it to a file before the job starts. Then adapt the k8s liveness probe script to use that instead of a fixed time. If your work dies in the middle of the task, it'd still be dead but at least it would only be dead for the time of ~1 job.

MyklClason commented 4 years ago

Just chiming in here, coffenbacher seems to have a good solution, but the hung vs long running task could perhaps be improved.

One option might be to keep a log file for each worker (which seems to be what coffenbacher is somewhat saying), then check that. Setup the long running task to log progress to that file. How often to log progress is hard to say, but anywhere from every about a second to minute would probably be practical. I'd say about sqrt(N) (rounded up) the amount of seconds between each log. So if a task is expected to take 5 minutes, that's 300 seconds so around every 17 seconds or so (it also means it'll log about 17 times). This should strike a decent balance between the number of logs written and the time between log entries. The start of the log should have expected log time. If it takes more than say 2.5x the expected log time since the last log, then the process has hung (one could be more or less conservative depending on how accurate the log timing is expected to be).

Technically this log could be stored in the database as well (with a rails task checking it), but it might be tricky to get a value back via a CLI command. As a benefit, by using sqrt(N) seconds as the metric, even if the task runs for a week, it would have less than 800 log entries (with about 800 seconds between entries), which is small whether reading a file or checking the CLI. You can also use 0.1 (or even 0.01) seconds instead if you are looking at minutes rather than hours or days. 60 seconds for 1 second is about 8 seconds. While sqrt(60 / 0.1) * 0.1 is about 2.5 seconds, but also means logging about 24 times instead of about 8 times, though that's not much difference, and even 60 times wouldn't be too bad.

Of course, the downside here is the need to adjust long running tasks so they can log to a file (or the database) every X seconds, which is going to be a small performance hit and require passing additional data to task (one way or another). Though really you could just logs the last called method the worker is in every X seconds or so (see https://ruby-doc.org/core-2.6.5/TracePoint.html for the possibilities, though older ruby versions may need to use method that wraps). In any case, it means that rather than taking N seconds to determine if a process has hung (simple timeout) it only takes about 2.5 sqrt(N) seconds or so to see if the process has hung. Really though, with that method above, you could theoretically use 1.1 sqrt(N) would be enough.

airhorns commented 4 years ago

FWIW I think there are que deployments that could have 0 jobs processed for many minutes at at time where all the workers are still "healthy" by my definition.

I have been using que for several hour long import/export jobs from remote APIs. While super duper long running jobs are an antipattern from a resiliency standpoint, when it's someone else's API that's throttling you don't don't have much control over how long the work takes, so I feel like it would be hard for me to ever estimate what the actual upper bound of a task time is. Even if I did, it'd be so high that if the liveness check only started failing after about that long, it wouldn't be useful.

I think that it might be a safer option for que itself to export a healthiness indicator of some sort that just describes if the Worker or Poller is able to check for jobs to dequeue, and treat job execution as an opaque, unpredictable process that has no impact on healthiness. I think this would capture a lot of the intention of a health check (is the connection to PG working, is the ruby process getting CPU time, etc etc) without requiring configuration from the user or coupling to the nature of their jobs / queues / concurrency / etc.

There's already the master thread in the Locker/Listener doing a fixed-timeout work loop. I would suggest having it log/update the metrics/update the web interface/whatever every time it it completes one of those loops. That way, you get a reliable, consistent ping from a healthy worker to whatever that interface is regardless of if long jobs are happening or no jobs are happening and you don't require any config other than the work loop timeout, which already has a default and is known to operators.

I think if individual applications care deeply about if specific jobs are being processed or if job throughput is around the right level, que can export metrics for those applications to send to an operations dashboard or alerting system. How to alert on those metrics is gonna vary wildly from application to application I think, especially if you have little to no job throughput or super long running jobs (like me), so I think it wise to keep that problem out of core and just expose as much information as possible with some nice docs that helps operators keep it runnin' good.

aldent95 commented 2 years ago

Has anyone made any progress on this?

I'm looking for a job worker to move my Rails app on to and Que seems like my only choice if I want cron jobs and a job system backed by Postgres. But given all my systems run on a K8s cluster it's not going to be possible it seems to deploy Que workers without this issue being resolved.

coffenbacher commented 2 years ago

YMMV but the liveness check approach I described above has worked for my team with no issues for 2 years :+1:

aldent95 commented 2 years ago

YMMV but the liveness check approach I described above has worked for my team with no issues for 2 years 👍

How does that work when you have multiple worker nodes? I would have thought that would only really work with a single node.

coffenbacher commented 2 years ago

YMMV but the liveness check approach I described above has worked for my team with no issues for 2 years +1

How does that work when you have multiple worker nodes? I would have thought that would only really work with a single node.

K8s liveness probes are per-container, so if any of our worker containers hasn't updated its liveness check file beyond the timeout, k8s restarts it. As far as I know there's no relation to node count, but I might not be totally following the question as well.

aldent95 commented 2 years ago

YMMV but the liveness check approach I described above has worked for my team with no issues for 2 years +1

How does that work when you have multiple worker nodes? I would have thought that would only really work with a single node.

K8s liveness probes are per-container, so if any of our worker containers hasn't updated its liveness check file beyond the timeout, k8s restarts it. As far as I know there's no relation to node count, but I might not be totally following the question as well.

So the problem I see using that method for my situation is that I'm going to be running 2 workers due to potential spikes in workload (I've got spare CPU I already pay for and can't be bother to autoscale) but they might not have any jobs to run for a while. Specifically I have one job that will run every 10 minutes to check for updates to a bunch of data. If there is data to update it likely kicks off about 5,500 jobs to update each different date point and push updates out to clients. Those jobs need to be completely fairly quickly as well, hence the multiple workers.

If I'm understanding your method then with the whole writing to a file after job completion there is a chance of one of the two workers not actually getting a job for 40-50 minutes at a time, depending on which one keeps picking up the single job. That would likely result in the file time expiring and the worker being killed off and re-made.

coffenbacher commented 2 years ago

That's right. If either works for your use case, two ideas that come to mind:

1) just let the containers be killed / re-created periodically when idle, this isn't that weird in k8s (eg ephemeral Cron containers)

2) schedule a no-op liveness check job every minute with que-scheduler just to make sure your workers are ready for when you need them https://github.com/hlascelles/que-scheduler

ZimbiX commented 2 years ago

I just came across a fork I'd missed - perhaps a similar approach to this would suffice: https://github.com/gocardless/que/pull/65

nathanhamilton commented 2 years ago

This is the solution that we implemented: