Closed eseliger closed 8 months ago
To be addressable by the architectural decisions made in Q2:
Executors admin page is poor quality Executors cannot target multiple queues (may be wasteful to resources as we continue to grow users of this machinery) Terraform modules are burdensome, site-admin has to do a lot of additional work and is required to use terraform unless they want to rewrite everything in their own infra Ignite has low support and is having some issues occasionally Like deadlocking We’ve identified some performance issues that might affect customer experience (the vm startup lock). There’s currently no capacity or owner to look into inlining the functionality so we can interface with firecracker more directly (we’ve tried in hack sessions but 1 hour every 2 weeks or so doesn’t cut it) Auto scaling is not working properly Is relatively slow to respond In-flight jobs are getting canceled (problem for billing, too) No indication in UI that job can be worked on until executor is up Performance issues The run time for commands that don’t have a long run time is increased by a lot Customer feedback validated this Reasons: Ignite startup/shutdown Not a lot of caching along the way
Our initial implementation of executors is mostly based on what we already had: A repo with terraform files to manage our resources, augmented by configuration changes that need to be applied to the worker and Prometheus deployment. State of today, they are probably among the hardest to set up component in the Sourcegraph architecture. When we started the project, that all made total sense, the first goal was to get something going just on k8s to prove that it could work. Then later, we needed to quickly make it available for customers as well. A quick solution was to publish our internal terraform definitions in a terraform module and add some docs on how we set executors up for us.
I don't think that is what the ideal solution for the future should look like. Ideally, executors are part of pretty much every Sourcegraph deployment, so that all features that are based on them are generally and widely available. Therefore, we need to make some changes to how they get deployed.
This is a collection of ideas I got for this problem space:
They are based on the assumption that
Hence, we need to support both cases well, while keeping the maintenance burden as low as anyhow possible.
Instead of a strictly pull-based workflow (that made sense in the past, given it's just a simple adaptor on top of the dbworker), we could introduce an internal service that monitors the queues and then distributes work as needed. Think of it as an augmented dbworker in the
worker
instance. This would be part of the standard deployment of a Sourcegraph instance. This service would then be responsible for the following:Things this architecture should solve:
How this could be implemented:
Those might be able to be implemented with a relatively simple generic interface that the dbworker handler calls out to depending on its config:
TODO: This approach suggests that it's the most simple to have a 1-1 relation between execution job and provider resource, ie. 1 GCP compute instance for 1 job. That might be desirable for some use-cases, but we should keep in mind that this causes additional overhead of spawning and terminating VMs. Also, we need to think about how we can make things like the very important docker registry mirror still fit into this model.
With the kubernetes adapter, we could potentially even have 0-conf executors for those who don't care about isolation heavily. We can still add NetworkPolicies around those instances so we keep internal APIs internal and have a service account as part of the k8s deployment that gives the worker pod the required permissions.
I don't think we can ever support docker-compose setups for an OOTB executor though, given docker-compose is a single machine deployment and executors are usually doing some pretty heavy lifting.