Open iaindillingham opened 7 months ago
Interesting, looks like the state machine hit something unexpected:
Traceback (most recent call last):
File "/home/opensafely/jobrunner/code/jobrunner/run.py", line 153, in handle_single_job
synchronous_transition = trace_handle_job(job, api, mode, paused)
File "/home/opensafely/jobrunner/code/jobrunner/run.py", line 190, in trace_handle_job
synchronous_transition = handle_job(job, api, mode, paused)
File "/home/opensafely/jobrunner/code/jobrunner/run.py", line 232, in handle_job
api.terminate(job_definition)
File "/home/opensafely/jobrunner/code/jobrunner/executors/logging.py", line 41, in wrapper
status = method(job_definition)
File "/home/opensafely/jobrunner/code/jobrunner/executors/local.py", line 208, in terminate
assert current_status.state in [
AssertionError
From Honeycomb logs:
I note that we entered maintenance mode immediately afterwards, so my assumption would be that it's somehow related. We've had race-condition-like issues around this before.
I dug into it in the slack thread but here's my conclusion:
so, db mode killed it. It looks like it was in EXECUTING state, but apparently wasn't. Hypothesis: we know that it takes a while for job-runner to detect db mode. Its possible both of those actually errored due to missing db tables, and were not running, but the loop hadn't gotten round to detecting that and modifying state to EXECUTED, so they tripped over this assert. This makes sense of the fact neither container is still running (as they'd already finished). I think there are a couple of actions here: 1) remove/rework this assertion. Maybe we log unexpected states? But it shouldn't cause an internel error I think? 2) follow up on TPP adding a delay before removing tables: https://bennettoxford.slack.com/archives/C010SJ89SA3/p1701786382187729
Just having a read around this - the terminate()
was called from the db maintenance mode block which checks job.state == State.RUNNING
which I think comes from the database not from the actual state of the world? Most of the rest of handle_job()
bases its decisions on api.get_status()
.
I think it's a reasonable assumption for maintenance mode to attempt to stop all jobs that could possibly be running rather than all jobs that are running , so maybe we should just update terminate()
to handle this gracefully. When I did some work on job-runner before, I looked at cancellation but I did not look at db maintenance mode, so it's not on my state transition diagram and I didn't consider this possibility when I added that assert
.
here's a demo of the behaviour https://github.com/opensafely-core/job-runner/pull/710
I think this has fallen through the cracks because we have unit tests of LocalDockerAPI
in test_local_executor.py
& unit tests of run.py::manage_jobs
in test_run.py
, but I don't think we have integration tests that runs stuff via manage_jobs
with the local executor - in this PR I've simulated the behaviour of run.py
in test_local_executor
, but obviously it would be better to use the code itself in this case. Maybe safest to fix this issue in the executor anyway & then we can use this failing test as-is. (Or we fix it in run.py
& delete this test & it remains untested)
Ah, great, thanks for digging into this properly.
Maybe safest to fix this issue in the executor anyway & then we can use this failing test as-is
I think it's a reasonable assumption for maintenance mode to attempt to stop all jobs that could possibly be running rather than all jobs that are running , so maybe we should just update
terminate()
to handle this gracefully
Pragmatically, both of these feel like the right immediate next steps to me.
Note: removing tech-support label, as this is scheduled to fix during deck-scrubbing
Honeycomb reports that a TPP Job raised an INTERNAL_ERROR.^1 The affected jobs in Job Request 22288 are:
I've scanned the logs on L3 (both are 53,466 lines long 🤔), but can't determine why the jobs failed.