opensafely-core / job-server

A server for mediating jobs that can be run in an OpenSAFELY secure environment. q.v. job-runner
https://jobs.opensafely.org
Other
5 stars 11 forks source link

Large `project.yaml` can lead to timeouts on "run jobs" page #4617

Open evansd opened 2 months ago

evansd commented 2 months ago

A user reported getting the big blue Dokku error page:

We're sorry, but something went wrong If you are the application owner please check the logs for more information.

when trying to access this URL: http://jobs.opensafely.org/comparing-disparities-in-rsv-influenza-and-covid-19/disparities-comparison-rsv-flu-c19/run-jobs/

The cause seems to be this very large (programatically generated) project.yaml file with 6,874 actions in it: https://github.com/opensafely/disparities-comparison/blob/6cba2f6f1609f26dc4ee5698f51afcc7e9be3840/project.yaml

In the short term, the user has been advised to reduce the number of actions to the minimum needed to get started with the project.

It's possible that we decide we just don't support this many actions (the user expressed surprise at the number, so possibly they don't actually need all of them in any case). But if so it would be good to fail more gracefully and give a more explanatory error.

Related issue:

Original Slack thread: https://bennettoxford.slack.com/archives/C01D7H9LYKB/p1726760871032629

Subsequent discussion: https://bennettoxford.slack.com/archives/C069SADHP1Q/p1726761557795209

Sentry error: https://ebm-datalab.sentry.io/issues/5874686309/

evansd commented 1 month ago

I can see some initial profiling work was done the last time the performance of this page was raised. But that's no longer representative of the current situation.

I've done some basic profiling locally of loading the run-jobs/ page with the above very large project.yaml file. This was using the job-server code as of this commit and was just measuring the time to return HTML. I suspect, given the size of the DOM, there will also be issues rendering this HTML but I haven't attempted to measure those.

Total time to return HTML was 48 seconds on my laptop. I suspect that on the server, which is not particularly powerful and has more to do than satisfy a single request, this takes even longer leading to the timeout errors.

The time breaks down as follows (so roughly 27s parsing the YAML, 6s validating and 15s rendering HTML): Screenshot from 2024-10-01 13-21-49

The slow YAML parsing time is a consequence of doing the parsing in pure Python, which we had good reason for at the time. Switching the YAML library back from ruyaml to ruamel[^1] and using its libyaml parsing backend drops the parsing time to 1.9 seconds for an overall response time of 23 seconds: Screenshot from 2024-10-01 13-25-35

The slow template rendering time is a consequence of having many slippers components in the hot loop and the high overhead of calling in to each component. It may be possible to inline some of the code here, or restructure the page in some way to reduce this.

The previous concerns about the calls out to the Github or OpenCodelists APIs don't seem to be particularly relevant to the overall runtime.

[^1]: ruyaml is a community "maintained" fork of ruamel, but it does not support the C-based parser. In general, Python YAML support is a bit of a shit-show.

bloodearnest commented 1 month ago

Yet another place were our current method for distributing opensafely cli puts limitations on us (we need a pure python solution).

But if it also takes 20+s to parse parsing a 6k job project.yaml on every invocation of opensafely run, no wonder some of our users are not testing locally...

evansd commented 1 month ago

There may be some quick (if slightly dirty) wins here. One would be for the pipeline library to expose a variant of the load_pipeline function which took a pre-parsed structure rather than a YAML string. Then job-server could use pyaml/ruamel.clib/whatever to parse the YAML and pass it in.

It's obviously not ideal to be using different parsers in different parts of the system. But there's nothing particularly exotic about the YAML used in pipeline files, so I can't see it would cause problems in practice.

Of course this doesn't help with running things locally, nor with the problem of submitting lots of jobs in one go. But it would (hopefully) buy us enough of a speed up that it becomes at least physically possible to run jobs from a very large project file, which seems to me the major blocker at the moment.

lucyb commented 1 month ago

Thanks for this excellent analysis, Dave. The proposed fix for the pipeline library seems like a pragmatic approach that will have some significant benefits.

I don't think there's anything we should do on Job Server in the very short term, so I'm going to remove this ticket from our board for now, but keep it open.