Fix missing MAX_WORKERS enforcement in executor api

bloodearnest commented 2 years ago

In the move the executor API, we made the decision for the executor to be responsible for applying back pressure when there are no free resources. Different implementations will have different resource levels and ways to query current usage.

However, we never actually implemented this for the local docker executor. So since we switched the the new executor API on prod (Nov 2021), there have been no concurrent job limits.

This actually explains why we had repeating resource issues in TPP leading up to the VM switch in April. We haven't seen resource exhaustion much since then, which is likely primarily due to the increased resources. Additionally, many project.yaml's action dependencies enforce a maximum concurrency anyway, except for highly parameterised jobs.

It also explains why we've see a few vague reports from users that local runs have slowed down recently, and they've been forced to run jobs one by one, and hitting memory issue. I originally thought this was to do with the addition of a per-job 4GB memory limit, but it's likely because it's trying to run all the jobs it can at once. Note that executor API usage for opensafely tool was switched on by default in early May, quite a bit later than prod.

A few such highly parallel projects ran together on TPP this week, bring the total concurrent job count to 47 and exhausting memory. Due to the cgroup limiting, the VM was still running, but docker itself is included in this limitation, so it was struggling to respond to API queries. We killed 6 jobs, and resource usage dropped enough to unstick things.

As an emergency measure, I manually patched the old concurrency management into the TPP job-runner, and restarted it. This worked, and currently MAX_WORKERS=24 and is being respected, anecdotally running at around 60% mem/cpu when full.

I think the first response is to formally add the old management back in, and release a new version of opensafely tool. This uses the sqlite database to track currently running job numbers.

However, do we want to apply a concurrency limit in the job-runner scheduler, or is our original goal of the backend providing back pressure still the correct approach?

We could add the current mechanism into the local docker executor, but it relies on accessing the sqlite db, which is a layering violation. We'd probably want to query docker for number of running jobs to get a currently level, rather than relying on the db, but there may be a performance/load issue there (give that docker api it prone to timeouts).

bloodearnest commented 2 years ago

The hotfix I've applied to prod in TPP is working, but I'm reluctant to add that as the fix, as it would by default limit concurrency on graphnet and others to the number of CPUs job-runner is running on, which is in no way related to it's actually capacity. But it's an easy fix to tell them set graphnet to set MAX_WORKERS to override it, I guess

I'm planning to go with that for now, so I can hopefully get a fixed opensafely-cli out ASAP.

evansd commented 2 years ago

Well, well, well ... at least that explains the problems we've been having, and means that they're solvable, so that's a good thing! Good work on spotting this and getting a hotfix out.

I guess my days of having strong, detailed opinions on job-runner things are coming to end; but, for what they're worth, here are my thoughts.

I think ultimately our original goal of having the executor be responsible for resource management and backpressure has got to be the right one. But I think the reason it's difficult to do at the moment is that the way we've factored the split between executor/scheduler/job-server isn't quite right. I think a better model would be the one originally proposed by BenBC (and which I think we've discussed a bit already) where the scheduler goes away and its responsibilities get split between job-server and the executor. Under that model, rather than the executor being invoked independently for each job we want to run, it instead gets supplied a big list of everything job-server would like it to run (together with metadata about priority, likely resource consumption etc) and it then decides what it's actually going to run. That way it has the full context for making these decisions and we don't need to poke around in the database to discover it.

(I know there are good reasons for the current split being the way it is as part of a process of incremental improvement so I'm not saying we made wrong decisions in the past, just that it's worth keeping the ultimate end goal in mind.)

Given all that, it seems reasonable to me to add the concurrency management back to the scheduler. It's not where we want to end up, but there's a lot of other work to do before we get there and I feel like this would be the simplest and most robust mechanism in the meantime.

bloodearnest commented 2 years ago

I agree that the goal is that the scheduler parts of job-runner live in job-server. I actually think our API fault line between the two parts is correctly aligned to support this as is (imagine that job_executor.py API was a rest API), modulo a the question of examining output files (#425).

But I don't think that ultimately affects the decisions about where we apply throttling logic.

As it is now, we can apply it in the scheduler, and/or in the executor. If the the scheduler part moved to job-server, then we could still choose to apply throttling there (i.e. not send the job to the executor, as the current implementation does). And we could still apply it separately in the executor.

I'm a little cautious about throttling at 2 different places, and in the long term, I think I'd prefer to throttle just in the backend, but I think it's not the worst problem we'd have.

evansd commented 2 years ago

I actually think our API fault line between the two parts is correctly aligned to support this as is (imagine that job_executor.py API was a rest API)

That API takes jobs one at a time, and what I was suggesting is that we'd tell the executor everything we want to be running all at once and then let it decide what to do with that information. But I suppose that's a question of the API shape rather than where we draw the system fault line. I think I just expressed myself incorrectly.

benbc commented 2 years ago

Querying the Docker API to count running jobs sounds reasonable on the face of it. Is that API really so flaky that we don't want to call it?

bloodearnest commented 2 years ago

It's that we'd be calling it for every PENDING job, every loop, is all. And we are still seeing random docker API timeouts, which makes me cautious. But it should be fine.

benbc commented 2 years ago

We could cache, I suppose. Both to reduce the number of calls and give us a reasonably fresh estimate in case of failures.

evansd commented 2 years ago

I think this is another instance where the one-job-at-a-time API is hurting us. If we could do a single call to the Docker API to say "what's your state?" it would be much more efficient that asking it for each individual job, especially if we include all the PENDING ones.

bloodearnest commented 2 years ago

This this query would be a bulk one - basically "how many containers are currently running with this label or name pattern?"

But we'd have to call it in each individual handle_job(job) call, unless we cache or track it internally within one pass of the loop (e.g. grab the running job count at the start of the loop, and +1 it if start one).

We don't have to be super accurate, as it will correct itself next iteration of the loop.

opensafely-core / job-runner

Fix missing MAX_WORKERS enforcement in executor api #459