Multiple Build Clusters

jithine commented 6 years ago

Context

This is applicable for a Scalable Screwdriver instance which makes use of executor-queue

Screwdriver builds are executed via a single queue worker which reads the job configurations from Queue and uses appropriate executor to run execute the build.

This model can be expanded such that Screwdriver can send the build information to multiple build clusters which provides following capabilities.

Running at least two build clusters which allows us to take down a build cluster for critical infrastructure updates.
Enables Screwdriver users to potentially bring their own build clusters so that they can have specialized hardware for specific needs.
Enables Screwdriver to support a build cluster in AWS which can leverage power of EC2 cloud compute resources

Objective

Create a Scheduler service which fronts Screwdriver build Queue
1. Workers will no longer connect directly to Queue to get new builds.
2. Scheduler Service is responsible for distributing builds to Workers running in different build clusters.
3. Scheduler service should be able to authorize Worker service and expose only the Job information which are applicable to the build cluster.
4. Scheduler should create as many build queues as the number of registered Build Clusters.
Register a build cluster
1. Cluster admin should be able to register their build cluster with Screwdriver. Registration should take in information such as cluster name, queue name which should be created for the cluster, Authorization means (Eg: JWT Public key for the cluster) and SCM Organizations where pipelines must belong to in order to use this cluster .
Implicit and Explicit build clusters
1. Users may or may not specify a specific build cluster in their Screwdriver pipeline config
2. If user has not specified any build cluster information (implicit), Screwdriver API should assign any of the generic build clusters applicable to the build. (Default supported build cluster(s))
3. If user has specified a build cluster in the Screwdriver pipeline config, then Screwdriver API should enqueue the job in the build queue applicable to the build cluster
  1. If it's a specialized build cluster, (Screwdriver customer provided) then Screwdriver API must validate if the owner of the pipeline is Authorized to use the cluster for user builds. We can tie this Authorization to the SCM Organization which is captured as part of registering a cluster.
Document design at https://github.com/screwdriver-cd/screwdriver/tree/master/design

References

Worker - https://github.com/screwdriver-cd/queue-worker

minzcmu commented 6 years ago

Implementations

Please refer to the design doc for details. Here is a to-do list:

[x] data-schema: add buildCluster table and also a buildClusterName field to build table
[x] model: model for buildCluster and pick buildCluster during build creation
[x] scm plugins: add a getOrgPermission function to check an user's permission against an org. Done for scm-github, will do it for scm-gitlab and bitbucket later
[x] api: add endpoints for build cluster; add feature flag
[x] config-parser: throw err if contains invalid buildCluster annotations
[x] queue-worker: add a scheduler mode to turn it into scheduler, deployed in ossd beta
[x] multi-build queue: will use rabbitmq, server set up in both beta and prod
[x] build-queue-worker: poll jobs from multi-build queue and start builds immediately

minzcmu commented 5 years ago

Updates 11/20

Plans for rolling it out / testing it out without downtime:

Before testing and verification, make sure there is NO default buildCluster (otherwise all the builds will have the buildClusterName and cannot be handled by the old queue-worker)
Deploy rabbitmq server, consumer, scheduler
Create an external build cluster with managedByScrewdrver: false
Send jobs to that external cluster: e.g.https://github.com/minz-test-org/waffle/blob/master/screwdriver.yaml#L4
If builds run fine, then update the buildCluster with managedByScrewdrver: true
Run API functional tests to make sure nothing is broken
Let the old queue-worker run for a while (e.g. 90 mins) to make sure the previously running builds can be stopped by the queue-worker properly. Then delete it.

minzcmu commented 5 years ago

Updates 01/02

Feature is live and done.

New services added:

Rabbitmq server: we use rabbitmq-ha helm chart to install it (https://github.com/helm/charts/tree/master/stable/rabbitmq-ha)
Scheduler: configure the scheduler part properly (https://github.com/screwdriver-cd/queue-worker/blob/master/config/custom-environment-variables.yaml#L182)
Buildcluster queue worker: set up the worker in the build cluster (https://github.com/screwdriver-cd/buildcluster-queue-worker)

screwdriver-cd / screwdriver