konflux-ci/multi-platform-controller

== Konflux Multi Platform Controller

This controller orchestrates remote agents for RTHAP. It watches for TaskRun instances that need a secret that starts with multi-platform-ssh- prefix and have a PLATFORM parameter, and will then supply SSH credentials for a node of the requested platform. The TaskRun can then use this node to build using a different platform. When the TaskRun is done the controller deals with cleanup up the remote node.

== Try It Out

This is ready to use on Konflux. An example of a Multi Platform Pull Request is located at: https://github.com/stuartwdouglas/multi-platform-test/pull/2

== How it works

=== The Buildah Remote Task

The buildah-remote task runs the container build step of the existing buildah task via ssh on a remote host. This task is currently created programatically by cmd/taskgen/main.go. The task has the following differences from the standard buildah task:

It has a PLATFORM parameter that is used by the controller to decide which host to provision for the build.
The build-container step has been modified to be run remotely over SSH. The workspace is copied to the remote host, the build is done via podman, and then the image copied back. The commands run on the remote host are largely the same as the existing buildah task, just executed via podman.
It expects the creation of a secret with the name multi-platform-ssh-$(context.taskRun.name) to be created by the controller. This secret will have an id_rsa private key, and the name of the host to connect to. If no host is available it contains an error message so the task does not wait forever.

Other than that the buildah-remote task attempts to create the image in the same way as the existing buildah task. Things like creating the SBOM and pushing the image are done on cluster, so there is no need to expose credentials to the remote host.

=== The Controller

The controllers job is to look for TaskRun objects with the appropriate labels that expect a multi-platform-ssh-$(context.taskRun.name), and then create this secret so the task can execute.

The controller has three different strategies that can be use to allocate hosts. Fixed pools, dynamic allocation and dynamic pooling. These strategies are configured on a platform basis, however in practice as the platform is an arbitrary string any number of platforms can be defined, allowing for different configurations of the same underlying platform.

Fixed Pools:: Fixes pools are a fixed set of machines configured in the host-config ConfigMap. Each host is configured with its address, SSH Key and a concurrency limit as to how many jobs can run on the host at once. Allocation of jobs to hosts is done by selecting the host with the lowest number of running jobs. If all hosts are at their limit the job is queued.

Dynamic:: Dynamic allocation will allocate a host from a selected cloud provider. Different cloud providers have different configuration options, and the multi platform controller tries to expose all relevant ones. Each dynamic allocator has a max-instances configuration option that limits the maximum number of concurrent instances. This setting is per cloud provider, as the limit is checked by counting the number of running instances. This means that multiple pools within the same cloud provider will share the same instance counts. If you want to avoid this behaviour you can set a pools instance-tag, which will mean that the controller only counts instances with this tag towards the limit. This was a deliberate decision to try and limit the possibility of a bug in the controller creating a catastrophic AWS bill.

Dynamic Pools:: This is a combination of the above two. This has all of the config options of the first two, and a time to live. Hosts are only started on an as-needed basis, so the pool will scale to zero if there is no load. When a host is created it will join the pool, and will execute concurrent jobs up to the limit specified by the concurrency param. If all hosts are full another VM will be requested from the cloud provider, up to the max-instances limit. Once a host has reached its time-to-live it is no longer schedulable, and once all running jobs are completed it is shut down.

Once a host has been allocated then it is provisioned by a Tekton task. This task will create a non-privileged user to run the build, and create an SSH key for that user. Once this key is created it is send to the OTP server to be consumed by the task, and a secret is created that contains the OTP password.

=== The OTP Server

The OTP server is a basic in-memory service that maps one time passwords to SSH keys. This means that there is no way for an attacker to steal a SSH key from a secret, as the only place the key is seen is inside the task itself. If an attacker does steal a password and use it to retrieve the key then the original task will be unable to and will fail.

Note that there is room to improve here, as if the server is restarted then all in memory passwords and keys are lost, which may cause tasks to fail. These could be written out to a PVC or database to allow for restarts, but care must be taken that this will not result in the possibility of a key being returned twice.

=== Cleanup

Finalizers are used to ensure all resources that were allocated can be cleaned up.

konflux-ci / multi-platform-controller

readme