opensafely-core / job-runner

A client for running jobs in an OpenSAFELY secure environment, requested via job-server (q.v.)
Other
4 stars 5 forks source link

TPP Job raised INTERNAL_ERROR #709

Open iaindillingham opened 7 months ago

iaindillingham commented 7 months ago

Honeycomb reports that a TPP Job raised an INTERNAL_ERROR.^1 The affected jobs in Job Request 22288 are:

I've scanned the logs on L3 (both are 53,466 lines long 🤔), but can't determine why the jobs failed.

evansd commented 7 months ago

Interesting, looks like the state machine hit something unexpected:

Traceback (most recent call last):
  File "/home/opensafely/jobrunner/code/jobrunner/run.py", line 153, in handle_single_job
    synchronous_transition = trace_handle_job(job, api, mode, paused)
  File "/home/opensafely/jobrunner/code/jobrunner/run.py", line 190, in trace_handle_job
    synchronous_transition = handle_job(job, api, mode, paused)
  File "/home/opensafely/jobrunner/code/jobrunner/run.py", line 232, in handle_job
    api.terminate(job_definition)
  File "/home/opensafely/jobrunner/code/jobrunner/executors/logging.py", line 41, in wrapper
    status = method(job_definition)
  File "/home/opensafely/jobrunner/code/jobrunner/executors/local.py", line 208, in terminate
    assert current_status.state in [
AssertionError

From Honeycomb logs:

I note that we entered maintenance mode immediately afterwards, so my assumption would be that it's somehow related. We've had race-condition-like issues around this before.

bloodearnest commented 7 months ago

I dug into it in the slack thread but here's my conclusion:

so, db mode killed it. It looks like it was in EXECUTING state, but apparently wasn't. Hypothesis: we know that it takes a while for job-runner to detect db mode. Its possible both of those actually errored due to missing db tables, and were not running, but the loop hadn't gotten round to detecting that and modifying state to EXECUTED, so they tripped over this assert. This makes sense of the fact neither container is still running (as they'd already finished). I think there are a couple of actions here: 1) remove/rework this assertion. Maybe we log unexpected states? But it shouldn't cause an internel error I think? 2) follow up on TPP adding a delay before removing tables: https://bennettoxford.slack.com/archives/C010SJ89SA3/p1701786382187729

madwort commented 7 months ago

Just having a read around this - the terminate() was called from the db maintenance mode block which checks job.state == State.RUNNING which I think comes from the database not from the actual state of the world? Most of the rest of handle_job() bases its decisions on api.get_status().

I think it's a reasonable assumption for maintenance mode to attempt to stop all jobs that could possibly be running rather than all jobs that are running , so maybe we should just update terminate() to handle this gracefully. When I did some work on job-runner before, I looked at cancellation but I did not look at db maintenance mode, so it's not on my state transition diagram and I didn't consider this possibility when I added that assert.

madwort commented 7 months ago

here's a demo of the behaviour https://github.com/opensafely-core/job-runner/pull/710

I think this has fallen through the cracks because we have unit tests of LocalDockerAPI in test_local_executor.py & unit tests of run.py::manage_jobs in test_run.py, but I don't think we have integration tests that runs stuff via manage_jobs with the local executor - in this PR I've simulated the behaviour of run.py in test_local_executor, but obviously it would be better to use the code itself in this case. Maybe safest to fix this issue in the executor anyway & then we can use this failing test as-is. (Or we fix it in run.py & delete this test & it remains untested)

evansd commented 7 months ago

Ah, great, thanks for digging into this properly.

Maybe safest to fix this issue in the executor anyway & then we can use this failing test as-is

I think it's a reasonable assumption for maintenance mode to attempt to stop all jobs that could possibly be running rather than all jobs that are running , so maybe we should just update terminate() to handle this gracefully

Pragmatically, both of these feel like the right immediate next steps to me.

bloodearnest commented 5 months ago

Note: removing tech-support label, as this is scheduled to fix during deck-scrubbing