screwdriver-cd / screwdriver

An open source build platform designed for continuous delivery.
http://screwdriver.cd
Other
1.01k stars 169 forks source link

Multiple Build Clusters #1319

Closed jithine closed 5 years ago

jithine commented 6 years ago

Context

This is applicable for a Scalable Screwdriver instance which makes use of executor-queue

Screwdriver builds are executed via a single queue worker which reads the job configurations from Queue and uses appropriate executor to run execute the build.

This model can be expanded such that Screwdriver can send the build information to multiple build clusters which provides following capabilities.

  1. Running at least two build clusters which allows us to take down a build cluster for critical infrastructure updates.
  2. Enables Screwdriver users to potentially bring their own build clusters so that they can have specialized hardware for specific needs.
  3. Enables Screwdriver to support a build cluster in AWS which can leverage power of EC2 cloud compute resources

Objective

  1. Create a Scheduler service which fronts Screwdriver build Queue
    1. Workers will no longer connect directly to Queue to get new builds.
    2. Scheduler Service is responsible for distributing builds to Workers running in different build clusters.
    3. Scheduler service should be able to authorize Worker service and expose only the Job information which are applicable to the build cluster.
    4. Scheduler should create as many build queues as the number of registered Build Clusters.
  2. Register a build cluster
    1. Cluster admin should be able to register their build cluster with Screwdriver. Registration should take in information such as cluster name, queue name which should be created for the cluster, Authorization means (Eg: JWT Public key for the cluster) and SCM Organizations where pipelines must belong to in order to use this cluster .
  3. Implicit and Explicit build clusters
    1. Users may or may not specify a specific build cluster in their Screwdriver pipeline config
    2. If user has not specified any build cluster information (implicit), Screwdriver API should assign any of the generic build clusters applicable to the build. (Default supported build cluster(s))
    3. If user has specified a build cluster in the Screwdriver pipeline config, then Screwdriver API should enqueue the job in the build queue applicable to the build cluster
      1. If it's a specialized build cluster, (Screwdriver customer provided) then Screwdriver API must validate if the owner of the pipeline is Authorized to use the cluster for user builds. We can tie this Authorization to the SCM Organization which is captured as part of registering a cluster.
  4. Document design at https://github.com/screwdriver-cd/screwdriver/tree/master/design

References

  1. Worker - https://github.com/screwdriver-cd/queue-worker
minzcmu commented 6 years ago

Implementations

Please refer to the design doc for details. Here is a to-do list:

minzcmu commented 5 years ago

Updates 11/20

Plans for rolling it out / testing it out without downtime:

  1. Before testing and verification, make sure there is NO default buildCluster (otherwise all the builds will have the buildClusterName and cannot be handled by the old queue-worker)
  2. Deploy rabbitmq server, consumer, scheduler
  3. Create an external build cluster with managedByScrewdrver: false
  4. Send jobs to that external cluster: e.g.https://github.com/minz-test-org/waffle/blob/master/screwdriver.yaml#L4
  5. If builds run fine, then update the buildCluster with managedByScrewdrver: true
  6. Run API functional tests to make sure nothing is broken
  7. Let the old queue-worker run for a while (e.g. 90 mins) to make sure the previously running builds can be stopped by the queue-worker properly. Then delete it.
minzcmu commented 5 years ago

Updates 01/02

Feature is live and done.

New services added: