opensafely-core / job-server

A server for mediating jobs that can be run in an OpenSAFELY secure environment. q.v. job-runner
https://jobs.opensafely.org
Other
5 stars 11 forks source link

Project Detail page: remove concurrency, improve performance by making a single request #4634

Open mikerkelly opened 3 weeks ago

mikerkelly commented 3 weeks ago

The ProjectDetailView uses the concurrent Python module to parallelise Github API requests through github.get_repo_is_private(). This is unnecessary and over-complicated and makes reasoning about and maintaining the code more complicated than necessary (ref #4591). It's also inefficient to open a separate connection for each item as the connection and API endpoint overheads are higher than a bulk request. #3393 noted this code is inefficient.

The more typical approach would be for the endpoint to provide a method to do a bulk request with a single API access (possibly paginated/chunked if needed due to a max size limit). Doing this would allow us to eliminate this idiosyncratic use of concurrency. Per #4685 is a simple request to https://api.github.com/orgs/opensafely/repos sufficient? If we look in the github module (our thin-wrapper around requests and the GitHub API), we can see that functions like get_repos_with_branches and get_repos_with_dates follow a cursor query approach with GraphQL API accesses. get_repos_with_status_and_url already exists and returns repos with their privacy.

Generally, we should avoid using concurrency unless we need it to avoid over-complication. This is the only use of the concurrent module across the entire repo. It looks like #1562 introduced it to parallelize the requests to GitHub but did not consider restructuring to make a single request for many repos rather than many requests each to a single repo. The other two clients of github.get_repo_is_private() (WorkspaceDetail and SignOffRepo views) use this function for a single request, which is reasonable.

We might consider refactoring the commonality of the GraphQL code in github.py alongside doing this. Or put that in a separate issue. That might take inspiration from the metrics codebase approach but note this is a CLI-tool so there may be some differences in desired behavior. It may also be that the GraphQL code in github.py is written inefficiently.

If you work on this ticket probably start with quick prototyping of the different ways of getting the info we need from the API to compare performance. Also consider restructuring the view.

3401 added some telemetry for diagnosis of #3393. That could be removed if this ticket is successful in improving performance.

bloodearnest commented 4 days ago

It's also inefficient to open a separate connection for each item as the connection and API endpoint overheads are higher than a bulk request. https://github.com/opensafely-core/job-server/issues/3393 noted this code is inefficient.

I don't think this section is quite right, as the client uses a persistent requests Session object to maintain a per process global connection pool: https://github.com/opensafely-core/job-server/blob/main/jobserver/github.py#L19

So TLS connections to api.github.com should be long lived and reused. It may be that the default pool size needs increasing, or other config tweaks, to handle projects with lots of workspaces, perhaps.

Unrelated to the larger question of prefering to use bulk API endpoints, but possibly a useful detail. This is a pattern we use in other service clients too, so I think worth calling out.