Open evansd opened 2 months ago
I can see some initial profiling work was done the last time the performance of this page was raised. But that's no longer representative of the current situation.
I've done some basic profiling locally of loading the run-jobs/
page with the above very large project.yaml
file. This was using the job-server code as of this commit and was just measuring the time to return HTML. I suspect, given the size of the DOM, there will also be issues rendering this HTML but I haven't attempted to measure those.
Total time to return HTML was 48 seconds on my laptop. I suspect that on the server, which is not particularly powerful and has more to do than satisfy a single request, this takes even longer leading to the timeout errors.
The time breaks down as follows (so roughly 27s parsing the YAML, 6s validating and 15s rendering HTML):
The slow YAML parsing time is a consequence of doing the parsing in pure Python, which we had good reason for at the time. Switching the YAML library back from ruyaml
to ruamel
[^1] and using its libyaml
parsing backend drops the parsing time to 1.9 seconds for an overall response time of 23 seconds:
The slow template rendering time is a consequence of having many slippers components in the hot loop and the high overhead of calling in to each component. It may be possible to inline some of the code here, or restructure the page in some way to reduce this.
The previous concerns about the calls out to the Github or OpenCodelists APIs don't seem to be particularly relevant to the overall runtime.
[^1]: ruyaml
is a community "maintained" fork of ruamel
, but it does not support the C-based parser. In general, Python YAML support is a bit of a shit-show.
Yet another place were our current method for distributing opensafely cli puts limitations on us (we need a pure python solution).
But if it also takes 20+s to parse parsing a 6k job project.yaml on every invocation of opensafely run
, no wonder some of our users are not testing locally...
There may be some quick (if slightly dirty) wins here. One would be for the pipeline
library to expose a variant of the load_pipeline
function which took a pre-parsed structure rather than a YAML string. Then job-server could use pyaml
/ruamel.clib
/whatever to parse the YAML and pass it in.
It's obviously not ideal to be using different parsers in different parts of the system. But there's nothing particularly exotic about the YAML used in pipeline files, so I can't see it would cause problems in practice.
Of course this doesn't help with running things locally, nor with the problem of submitting lots of jobs in one go. But it would (hopefully) buy us enough of a speed up that it becomes at least physically possible to run jobs from a very large project file, which seems to me the major blocker at the moment.
Thanks for this excellent analysis, Dave. The proposed fix for the pipeline library seems like a pragmatic approach that will have some significant benefits.
I don't think there's anything we should do on Job Server in the very short term, so I'm going to remove this ticket from our board for now, but keep it open.
A user reported getting the big blue Dokku error page:
when trying to access this URL: http://jobs.opensafely.org/comparing-disparities-in-rsv-influenza-and-covid-19/disparities-comparison-rsv-flu-c19/run-jobs/
The cause seems to be this very large (programatically generated)
project.yaml
file with 6,874 actions in it: https://github.com/opensafely/disparities-comparison/blob/6cba2f6f1609f26dc4ee5698f51afcc7e9be3840/project.yamlIn the short term, the user has been advised to reduce the number of actions to the minimum needed to get started with the project.
It's possible that we decide we just don't support this many actions (the user expressed surprise at the number, so possibly they don't actually need all of them in any case). But if so it would be good to fail more gracefully and give a more explanatory error.
Related issue:
Original Slack thread: https://bennettoxford.slack.com/archives/C01D7H9LYKB/p1726760871032629
Subsequent discussion: https://bennettoxford.slack.com/archives/C069SADHP1Q/p1726761557795209
Sentry error: https://ebm-datalab.sentry.io/issues/5874686309/