mozilla / sccache

Sccache is a ccache-like tool. It is used as a compiler wrapper and avoids compilation when possible. Sccache has the capability to utilize caching in remote storage environments, including various cloud storage options, or alternatively, in local storage.
Apache License 2.0
5.85k stars 552 forks source link

[do not merge] Ensure client and scheduler are resilient to server autoscaling #2277

Open trxcllnt opened 3 weeks ago

trxcllnt commented 3 weeks ago

While profiling distributed build cluster performance, forcing the client to fallback to local compilation is the largest contributor to overall build time. Presently this happens due to at least one bug, but also sub-optimal error handling in the client and scheduler.

These issues are amplified when autoscaling sccache-dist servers, as the errors happen more frequently, can lead to sub-optimal autoscaling behavior, leading to more errors, etc.

So this PR is a collection of fixes for the sccache client, scheduler, and server to better support dist-server autoscaling, as well as general improvements for tracing and debugging distributed compilation across clients, schedulers, and workers.

Build cluster configuration

Before diving into these changes, I should describe the architecture of the cluster for which these changes are necessary.

  1. A Traefik API Gateway to terminate SSL and expose a single endpoint for clients. This could be any API Gateway/LB/router, I just like Traefik.
  2. An sccache-dist scheduler instance, which receives forwarded connections from Traefik.
  3. An autoscaling group of sccache-dist servers, which scale in and out based on load, and are associated with one of a fixed pool of ports on the Traefik instance. For example, if the ASG includes up to 10 instances, Traefik will open 10 ports (e.g. 10500-10509) and associate each port with a worker.

Workers are associated with and forwarded traffic from one of Traefik's open ports when they start up, and un-associated with that port when they shut down. When a new worker starts up, it could be associated with any free port, even ports previously associated with a different worker.

Note: While this PR isn't related, this description assumes sccache has been compiled with the changes in https://github.com/mozilla/sccache/pull/1922, as that's necessary for the workers to report the public_url of the API Gateway instead of their private VPC address.

Certificate handling for server scale in and out

When the server cluster goes through a cycle of scaling out, in, then out again, the new servers may be available at addresses that were previously associated with an old server. This presents a challenge for certificate handling, because the client and scheduler may have cached certificates for the initial instance, and those certs are not valid for communicating with the new instance:

# scale out to two instances:
127.0.0.1:10500 - server A
127.0.0.1:10501 - server B

# scale in to one instance:
127.0.0.1:10501 - server B

# scale out to three instances:
127.0.0.1:10500 - server C
127.0.0.1:10501 - server B
127.0.0.1:10502 - server D

In the initial state, the client and scheduler cached certificates for servers A and B. After scaling in and out again, the client and scheduler attempt and fail to use the certificates generated by server A to communicate with server C. I believe this is because the certificates for A and C both embed 127.0.0.1:10500 as their SubjectAlternativeName, and this confuses reqwest.

https://github.com/mozilla/sccache/commit/df2e4a1e7a4619db14fc23c5e85b36c7058d9fdd updates the client to track certificates by server_id like the server does, and updates both the client and scheduler to remove the old certificate from the certs map before adding the existing certs to the reqwest client builder.

Scheduler job allocation resiliency

There's a delay between when servers scale in and when the scheduler prunes them from the list of active servers. In this time, the scheduler may attempt to allocate jobs to these servers. When this fails, and the current behavior is to return an error to the client to run a local compile.

This is sub-optimal for an autoscaling strategy, since by rejecting the jobs, the additional work sent back to the client to do isn't captured by the autoscaler.

For example, if the autoscaler scales in from 64 to 32 CPUs, and in the meantime the scheduler rejects the next 32 jobs to compile locally, the autoscaler believes it is in a steady-state rather than recognizing there are 64 units of work to handle.

At best, this leads to delays in scaling up, and at worst it can cause the autoscaler to believe it can continue to scale down.

The best solution is for the scheduler to handle the alloc_job failure and attempt to allocate to the next-best server candidate, until either the job is allocated or the candidate list is exhausted. This ensures the autoscaler will see the existing instances get busier, and stop scaling in/start scaling out again.

Example of starting a cluster with 3 initial workers, scaling down to 1, then running a distributed compile before the scheduler has pruned the dead servers:
$ docker compose up -d --scale worker=3
# ... wait till cluster is up
$ docker compose up -d --scale worker=1
$ sccache ...
[INFO  sccache::dist::http::server] Scheduler listening for clients on 0.0.0.0:80
[INFO  sccache::dist::http::server] Adding new certificate for 172.18.0.2:10500 to scheduler
[INFO  sccache_dist] Registered new server ServerId(172.18.0.2:10500)
[INFO  sccache::dist::http::server] Adding new certificate for 172.18.0.2:10501 to scheduler
[INFO  sccache_dist] Registered new server ServerId(172.18.0.2:10501)
[INFO  sccache::dist::http::server] Adding new certificate for 172.18.0.2:10502 to scheduler
[INFO  sccache_dist] Registered new server ServerId(172.18.0.2:10502)
[WARN  sccache_dist] [alloc_job(0)]: POST to scheduler assign_job failed, caused by: error sending request for url (https://172.18.0.2:10502/api/v1/distserver/assign_job/0), caused by: client error (Connect), caused by: error:0A000086:SSL routines:tls_post_process_server_certificate:certificate verify failed:ssl/statem/statem_clnt.c:2091: (self-signed certificate), caused by: error:0A000086:SSL routines:tls_post_process_server_certificate:certificate verify failed:ssl/statem/statem_clnt.c:2091:
[INFO  sccache_dist] [alloc_job(0)]: Job created and assigned to server 172.18.0.2:10501 with state Ready
[WARN  sccache_dist] [alloc_job(1)]: POST to scheduler assign_job failed, caused by: error sending request for url (https://172.18.0.2:10500/api/v1/distserver/assign_job/1), caused by: client error (Connect), caused by: error:0A000086:SSL routines:tls_post_process_server_certificate:certificate verify failed:ssl/statem/statem_clnt.c:2091: (self-signed certificate), caused by: error:0A000086:SSL routines:tls_post_process_server_certificate:certificate verify failed:ssl/statem/statem_clnt.c:2091:
[INFO  sccache_dist] [alloc_job(2)]: Job created and assigned to server 172.18.0.2:10501 with state Ready
[INFO  sccache_dist] [alloc_job(1)]: Job created and assigned to server 172.18.0.2:10501 with state Ready
[INFO  sccache_dist] [alloc_job(3)]: Job created and assigned to server 172.18.0.2:10501 with state Ready
[INFO  sccache_dist] [update_job_state(0, 172.18.0.2:10501)]: Job state updated from Ready to Started
[INFO  sccache_dist] [update_job_state(2, 172.18.0.2:10501)]: Job state updated from Ready to Started
[INFO  sccache_dist] [update_job_state(1, 172.18.0.2:10501)]: Job state updated from Ready to Started
[INFO  sccache_dist] [update_job_state(3, 172.18.0.2:10501)]: Job state updated from Ready to Started
[INFO  sccache_dist] [update_job_state(0, 172.18.0.2:10501)]: Job state updated from Started to Complete
[INFO  sccache_dist] [update_job_state(2, 172.18.0.2:10501)]: Job state updated from Started to Complete
[INFO  sccache_dist] [update_job_state(1, 172.18.0.2:10501)]: Job state updated from Started to Complete
[INFO  sccache_dist] [update_job_state(3, 172.18.0.2:10501)]: Job state updated from Started to Complete
[WARN  sccache_dist] Server 172.18.0.2:10500 appears to be dead, pruning it in the scheduler
[WARN  sccache_dist] Server 172.18.0.2:10502 appears to be dead, pruning it in the scheduler

Client job execution resiliency

It's also possible for a server to be taken offline while it's running jobs for clients. In this scenario the scheduler alloc_job succeeds when the worker is still alive, but the worker is destroyed while the client is waiting on the run_job response.

To avoid the expensive local compilation, the client should handle the failure and allow retrying the job on a new server assigned by the scheduler. When combined with the feature described in the previous section, the scheduler should reallocate the job on an alive server.

Example client logs when worker shuts down during run_job, and client retries:
$ docker compose up -d --scale worker=10
# ... wait till cluster is up
$ SCCACHE_DIST_RETRY_LIMIT=5 sccache ...
$ docker compose up -d --scale worker=1
[DEBUG sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Run distributed compilation (attempt 1 of 6)
[DEBUG sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Creating distributed compile request
[DEBUG sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Identifying dist toolchain for "/usr/local/cuda/bin/../nvvm/bin/cicc"
[DEBUG sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Requesting allocation
[DEBUG sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Successfully allocated job 2
[DEBUG sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Running job 2 on server 172.18.0.2:10504
[WARN  sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Error running distributed compilation (attempt 1 of 6), retrying. Could not run distributed compilation job on 172.18.0.2:10504: error sending request for url (https://172.18.0.2:10504/api/v1/distserver/run_job/2): client error (SendRequest): connection closed before message completed
[DEBUG sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Run distributed compilation (attempt 2 of 6)
[DEBUG sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Creating distributed compile request
[DEBUG sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Identifying dist toolchain for "/usr/local/cuda/bin/../nvvm/bin/cicc"
[DEBUG sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Requesting allocation
[DEBUG sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Successfully allocated job 21
[DEBUG sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Running job 21 on server 172.18.0.2:10500
[DEBUG sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Fetched [("/tmp/sccache_nvcc5K0cEY/0.cudafe1.c", "Size: 209->174"), ("/tmp/sccache_nvcc5K0cEY/1.cudafe1.stub.c", "Size: 1429->635"), ("/tmp/sccache_nvcc5K0cEY/simpleP2P.compute_50.cudafe1.gpu", "Size: 25849->3584"), ("/tmp/sccache_nvcc5K0cEY/simpleP2P.compute_50.ptx", "Size: 874->445")]
[DEBUG sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Compiled in 3.963 s, storing in cache
[DEBUG sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Created cache artifact in 0.006 s
[DEBUG sccache::server] [simpleP2P.compute_50.ptx]: compile result: cache miss
[DEBUG sccache::server] [simpleP2P.compute_50.ptx]: CompileFinished retcode: exit status: 0
[DEBUG sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Stored in cache successfully!
codecov-commenter commented 3 weeks ago

Codecov Report

Attention: Patch coverage is 32.96089% with 120 lines in your changes missing coverage. Please review.

Project coverage is 40.78%. Comparing base (0cc0c62) to head (5f1d50e). Report is 83 commits behind head on main.

Files with missing lines Patch % Lines
src/compiler/compiler.rs 38.51% 28 Missing and 55 partials :warning:
src/dist/http.rs 0.00% 21 Missing :warning:
src/server.rs 0.00% 4 Missing and 7 partials :warning:
src/compiler/rust.rs 0.00% 3 Missing :warning:
src/util.rs 50.00% 2 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #2277 +/- ## ========================================== + Coverage 30.91% 40.78% +9.87% ========================================== Files 53 55 +2 Lines 20112 20978 +866 Branches 9755 9677 -78 ========================================== + Hits 6217 8556 +2339 - Misses 7922 8247 +325 + Partials 5973 4175 -1798 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.