Open trxcllnt opened 3 weeks ago
Attention: Patch coverage is 32.96089%
with 120 lines
in your changes missing coverage. Please review.
Project coverage is 40.78%. Comparing base (
0cc0c62
) to head (5f1d50e
). Report is 83 commits behind head on main.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
While profiling distributed build cluster performance, forcing the client to fallback to local compilation is the largest contributor to overall build time. Presently this happens due to at least one bug, but also sub-optimal error handling in the client and scheduler.
These issues are amplified when autoscaling sccache-dist servers, as the errors happen more frequently, can lead to sub-optimal autoscaling behavior, leading to more errors, etc.
So this PR is a collection of fixes for the sccache client, scheduler, and server to better support dist-server autoscaling, as well as general improvements for tracing and debugging distributed compilation across clients, schedulers, and workers.
job_id
, andserver_id
to client logs, and addsjob_id
to scheduler and server logs. This makes it significantly easier to trace build cluster failures across client and server logs. Nit: Searching through unstructured/ad-hoc log lines is difficult, using something likestructured_logger
instead ofenv_logger
would also improve this experience.Build cluster configuration
Before diving into these changes, I should describe the architecture of the cluster for which these changes are necessary.
sccache-dist
scheduler instance, which receives forwarded connections from Traefik.sccache-dist
servers, which scale in and out based on load, and are associated with one of a fixed pool of ports on the Traefik instance. For example, if the ASG includes up to 10 instances, Traefik will open 10 ports (e.g. 10500-10509) and associate each port with a worker.Workers are associated with and forwarded traffic from one of Traefik's open ports when they start up, and un-associated with that port when they shut down. When a new worker starts up, it could be associated with any free port, even ports previously associated with a different worker.
Note: While this PR isn't related, this description assumes sccache has been compiled with the changes in https://github.com/mozilla/sccache/pull/1922, as that's necessary for the workers to report the
public_url
of the API Gateway instead of their private VPC address.Certificate handling for server scale in and out
When the server cluster goes through a cycle of scaling out, in, then out again, the new servers may be available at addresses that were previously associated with an old server. This presents a challenge for certificate handling, because the client and scheduler may have cached certificates for the initial instance, and those certs are not valid for communicating with the new instance:
In the initial state, the client and scheduler cached certificates for servers A and B. After scaling in and out again, the client and scheduler attempt and fail to use the certificates generated by server A to communicate with server C. I believe this is because the certificates for A and C both embed
127.0.0.1:10500
as theirSubjectAlternativeName
, and this confusesreqwest
.https://github.com/mozilla/sccache/commit/df2e4a1e7a4619db14fc23c5e85b36c7058d9fdd updates the client to track certificates by
server_id
like the server does, and updates both the client and scheduler to remove the old certificate from the certs map before adding the existing certs to thereqwest
client builder.Scheduler job allocation resiliency
There's a delay between when servers scale in and when the scheduler prunes them from the list of active servers. In this time, the scheduler may attempt to allocate jobs to these servers. When this fails, and the current behavior is to return an error to the client to run a local compile.
This is sub-optimal for an autoscaling strategy, since by rejecting the jobs, the additional work sent back to the client to do isn't captured by the autoscaler.
For example, if the autoscaler scales in from 64 to 32 CPUs, and in the meantime the scheduler rejects the next 32 jobs to compile locally, the autoscaler believes it is in a steady-state rather than recognizing there are 64 units of work to handle.
At best, this leads to delays in scaling up, and at worst it can cause the autoscaler to believe it can continue to scale down.
The best solution is for the scheduler to handle the
alloc_job
failure and attempt to allocate to the next-best server candidate, until either the job is allocated or the candidate list is exhausted. This ensures the autoscaler will see the existing instances get busier, and stop scaling in/start scaling out again.Example of starting a cluster with 3 initial workers, scaling down to 1, then running a distributed compile before the scheduler has pruned the dead servers:
Client job execution resiliency
It's also possible for a server to be taken offline while it's running jobs for clients. In this scenario the scheduler
alloc_job
succeeds when the worker is still alive, but the worker is destroyed while the client is waiting on therun_job
response.To avoid the expensive local compilation, the client should handle the failure and allow retrying the job on a new server assigned by the scheduler. When combined with the feature described in the previous section, the scheduler should reallocate the job on an alive server.
Example client logs when worker shuts down during run_job, and client retries: