opensafely-core / job-runner

A client for running jobs in an OpenSAFELY secure environment, requested via job-server (q.v.)
Other
3 stars 5 forks source link

Actions running out of order when a job is rerun #703

Closed StevenMaude closed 8 months ago

StevenMaude commented 8 months ago

https://bennettoxford.slack.com/archives/C33TWNQ1J/p1705572676801459

The following was reported to support:

I ran some jobs, let's say A -> B -> C. But C finished before B, and B finished before A. I've checked the dependencies are as expected and I'm fairly sure nothing is wrong there.

The state of the repo before submitted the jobs was A(success), B(success), C(fail). So I fixed what was causing C to fail and reran. It looks like job runner started running B and C straight away because their respective antecedents had already run successfully, even though I requested a rerun in the same job request.

I remember a conversation (long ago!) that job runner would do the right thing in this scenario, but it looks like that isn't happening any more?

This is the job request: https://jobs.opensafely.org/investigating-the-effectiveness-of-the-covid-19-vaccination-programme-in-the-uk/comparative-booster-spring2023/21765/

It's a bit more complicated than A,B,C. But if you look at dependencies and completion times for extract -> .... -> match_cv_B -> match_cv_B_report you should see what I mean. … Actually it looks like the match actions ran correctly in sequence. But they started and finished before extract had finished (it's still running)

StevenMaude commented 8 months ago

A list of actions in job-server for the job with the reported issue.

So to summarise, I think an example of the incorrect ordering for the project.yaml here is: match_cv_A ultimately needs extract, yet match_cv_A is running at the same time as extract.

madwort commented 8 months ago

match_cv_A requires data_selection_cv, which is not in the specified list of actions - so it doesn't matter whether match_cv_A waits for extract to finish, it would always use the stale data?

evansd commented 8 months ago

I think this is related to the behaviour discussed below whereby there's no mechanism that automatically marks the results of downstream actions as stale when an upstream action is re-run: