overlap with https://cluster-api.sigs.k8s.io/

cgwalters commented 7 months ago

There's some logical overlap here with https://cluster-api.sigs.k8s.io/ btw that seems like it'd be good to at least think through.

As well as https://www.redhat.com/en/blog/learn-about-red-hat-peer-pods-openshift-sandboxed-containers

cgwalters commented 7 months ago

I'm coming here btw after seeing https://github.com/redhat-appstudio/multi-platform-controller/pull/194 which I was pointed at after hitting race conditions in my project's jobs.

The cluster API code is heavily battle tested against exactly things like this - how and when to retry underlying cloud infra API requests, how to handle auth, etc.

stuartwdouglas commented 7 months ago

That looks like it is more designed to allocate whole clusters, can it be used to allocate individual VMs? I had a quick poke around but it was not immediately obvious if this was possible.

cgwalters commented 7 months ago

On Tue, Apr 23, 2024, at 9:39 PM, Stuart Douglas wrote:

That looks like it is more designed to allocate whole clusters, can it be used to allocate individual VMs?

The name is misleading. It is primarily an abstraction over spawning VMs, declaratively controlled via CRD. That mechanism happens to be useful for provisioning kube clusters.

BTW OpenShift HCP/hypershift relies on cluster API and we are aiming to support it for standalone too. IOW cluster API will be the one way OCP itself spawns VMs.

(But again then the next thing is, given the goal is “run a container in that VM”, that’s what peer pods does)

brianwcook commented 2 months ago

Trying to resolve this old thread. @arewm I remember you investigated peer pods and found that we couldn't use it (at that time, anyway). Do you remember why? It would be good to have it here.

arewm commented 2 months ago

I looked into both peer pods and CAPI previously. At the time that work started on the multi-platform controller, CAPI did not support provisioning resources on IBM Cloud for s390x.

While peer pods has many similarities to the architecture for the multi-platform controller, there were also limitations to support for IBM Cloud provisioning as well as for supporting syncing data from PVCs. I feel like most of these issues should be resolved if we use the community version of peer pods, but they might not all be supported if we use the Red Hat version.

cgwalters commented 2 months ago

My take though is that even if there were missing features from one or two of those other codebases, it would be less overall long term maintenance burden to carry a fork that adds whatever changes are needed than to have a completely new codebase.

Specifically, using either CAPI or peer pods we'd get support for a ton of major public clouds instead of being tied to AWS as this codebase is today.

arewm commented 2 months ago

I agree. I think that we should try to reuse upstream projects within Konflux-CI instead of inventing our own solutions.

I didn't continue to look at CAPI after the initial investigation. I have been trying to keep a pulse on the use of cloud-api-adaptor from Kata (i.e. peer pods) as I feel like the approach is consistent to the one that was implemented with the multi-platform controller.

brianwcook commented 2 months ago

@arewm I agree, we should not reinvent things when we can avoid it.

@ifireball has taken over as primary maintainer of this repo. Barak, do you want to investigate alternatives to scheduling jobs using CAPI or peer pods? I am not worried about downstream, attack it for Konflux community using upstream.

cgwalters commented 2 months ago

I forgot to mention earlier but since I think it's relevant: I didn't just randomly come to this repository and look at the code. At some point, I was debugging a CI failure which looked very much like flakes in "ssh to machine to perform task" that are part of what this project is doing.

Whereas I think a more Kubernetes-native model would look more like scheduling a pod and driving it to completion and monitoring its status asynchronously, which gets enabled with a peer-pod like model.

ifireball commented 2 months ago

TL;DR: Other solutions are on our radar, but we have things we must do before we can look more seriously at them

AFAIK the cluster API is merely a pending API standard without solid implementations we could use at this point. Alex looked into it when he was designing our EaaS solution, and decided to go with something else at this point in time.

We had a chat with the folks working on Kata peer pods - which is probably he most relevant option for our multi-platform support. As far as I could tell, up until chatting with us, they had assumed all the PODs in a cluster would be running on the same architecture - and that wouldn't fit multi-arch builds.

I don't know wither they revised their plans since.

In any case we have some other priorities that had to do with stabilizing and building support around our existing solution, since its already running in production, we seriously can't look at other solutions until we are done with those.

brianwcook commented 2 months ago

Barak you are helping me remember - cluster API does look good but in addition to it being very new, it was an alpha feature that we would not be able to enable on either of our target platforms (EKS, OpenShift).

arewm commented 2 months ago

I created a tracker for using kata containers and the cloud-api-adaptor in the Konflux-ci upstream: https://issues.redhat.com/browse/KONFLUX-4358.

brianwcook commented 2 months ago

I would like to make more use of Kata; unfortunately deployments of Konflux on AWS can't use it because most AWS VMs do not support nested virt.

arewm commented 2 months ago

@brianwcook, as I understand it, Kata is just about running pods in a VM for further sandboxing. There are multiple modes of operation of kata (runtime classes). One is with qemu which requires either baremetal or nested virtualization. This is not supported on AWS as you indicated. Another mode of operation is to use peer pods which leverages the cloud-api-adaptor to provision new VMs.

While qemu will not work, we can still leverage kata within our deployments with the peer pod approach. This would just spin up new VMs for pods in to run workloads. These VMs can either be on the same architecture to enable sandboxed environments that need elevated privileges or they can be on different architectures to enable mutli-architecture builds.

brianwcook commented 2 months ago

That makes sense. Somewhat off topic but related, we could try using peer pods to do disk image generation. We are currently using multi-platform controller for it because we need root access to filesystem stuff. Still doesn't solve for multi-arch.

konflux-ci / multi-platform-controller

overlap with https://cluster-api.sigs.k8s.io/ #197