Feature: Add a resource to request the execution of a workload on a BareMetalHost

pierrecregut commented 3 months ago

We need a resource representing abstractly a workload executing on a compute resource (namely a BareMetalHost):

It MUST describe the workload completely:
- specification of an OS image
- specification of an initial configuration. It SHOULD support all the supported initial configuration format (cloud-init, ignition).
- online status,
It MUST abstract away the identity of the BareMetalHost, a user should be able to describe a workload to execute and a set of requirements on the resource executing it. This is the mechanism of host selectors exposed at the level of Metal3Machine but made independent of it.
It MUST be abstract enough so that we can target with appropriate controllers other compute resources than BareMetalHost that provide a similar API: typically virtual machines on private or public clouds that can execute arbitrary OS configured with either cloud-init or ignition on first boot.

The resource MUST be usable in place of BareMetalHost as the associated target of a Metal3Machine in the cluster api provider metal3. It SHOULD behave as a BareMetalHost and MUST be transparent for at least the following features:

data templates,
in place updates,
metal3 remediation.

It MUST support pivoting but may change its semantics. There MUST be a way to point to a compute resource in another cluster if the resource has the right credentials to do so.

Supported use cases

Description of workloads to execute directly on a bare metal server

We want to execute some simple services directly on bare-metal. We may not want to specify exactly which machine to use as long as it fulfills a set of requirements to have a better utilization of the hardware resources.

Multi-tenancy for CAPM3

We want to share a set of BareMetalHost between several clusters belonging to different users. Each user should have a namespace for his cluster. The user must be able to use BareMetalHosts without taking the full control of the hardware. He must never get access to the BMC credentials but he must have a sufficient view on the server he uses to configure his cluster (mainly fill the data templates). When a node is stopped in a cluster, the underlying bare-metal server must be usable by another host.

Hybrid clusters

We want to create clusters with Cluster Api with nodes hosted on different kind of compute resources (servers, VM in public or private clouds). Today the known are either:

centered around light control planes and one kind of workers (Kamaji),
complex and incomplete for day 1 operations (Bring Your Own Host),
hacks (use of several clusters object with only one implementing the control plane as presented in https://metal3.io/blog/2022/07/08/One_cluster_multiple_providers.html) The Metal3 cluster api provider is a complete solution relatively abstract from the underlying compute resource with a lot of tooling (data templates, strong IPAM integration, notion of remediation, support for in place update). If there are multiple controllers linking the workload resource with different kind of compute resource in the same way as persistent volume claim can target different storage classes.

Pivot Semantics

Regarding BareMetalHosts, if we want to support multi-tenancy we cannot pivot them as this would give full control of the hardware (BMC credentials) to the customer. It would also make it impossible to reuse the servers on the initial cluster when the pivoted cluster is scaled down. So we want to only pivot the workload resource. It means that it will point to servers on another cluster. From the user point of view, this may mean a decrease in dependability because the BareMetalHost controller and Ironic are still hosted on the initial cluster. If the link between the initial cluster and the pivoted cluster is severed, the pivoted cluster will not be able to update the state of its underlying servers.

pierrecregut commented 3 months ago

I have not given a name to this resource. @lentzi90 suggests BMHClaim but in this proposal we want to target also other compute resource. This is why we called it Host but loosing the important notion of requesting a resource. So why not HostClaim ?

pierrecregut commented 3 months ago

The Host resource in the Kanod project whose designed is described here: https://gitlab.com/Orange-OpenSource/kanod/reference/-/blob/master/blueprints/multi-tenancy-hybrid.md?ref_type=heads answers most of the requirements. As the remote part is not implemented yet, it is not experimented and the design may evolve.

lentzi90 commented 3 months ago

Thank you for creating this nice and detailed issue! I think we are after pretty much the same thing so it should be possible to find a solution that fits. :slightly_smiling_face: Adding some comments to specific things below.

It MUST be abstract enough so that we can target with appropriate controllers other compute resources than BareMetalHost that provide a similar API: typically virtual machines on private or public clouds that can execute arbitrary OS configured with either cloud-init or ignition on first boot.

This is tricky to promise as a project. Since we do not have control over the APIs of public clouds and other compute resources there is no way to guarantee it. I agree though that we should try to make the API generic enough so that it can be used for this.

Multi-tenancy for CAPM3

I think we are fully in agreement here! What is not quite clear to me yet is where exactly to do the split. This also affects the pivoting scenario. I see at least two options:

The Claim resources are in the same cluster as the BMHs
The Claim resources are in the same cluster as the Metal3Machines

For alternative 1, I think it would mean that CAPM3 should directly create the Claims in the remote cluster. BMO or some new controller would then bind them to BMHs and update the necessary fields. Secrets for user-data, meta-data and network-data would also be created in the remote cluster.

 ┌───────────────────────────────┐     ┌─────────────────────────────────┐
 │ Cluster 1                     │     │ Cluster 2                       │
 │                               │     │                                 │
 │  ┌──────┐                     │     │                        ┌──────┐ │
 │  │ CAPI │     Metal3Machine   │     │  Claim                 │BMO   │ │
 │  ├──────┤                     │     │                        ├──────┤ │
 │  │ CAPM3│                     │     │                        │Ironic│ │
 │  └──────┘     Metal3Data      │     │  BareMetalHost         └──────┘ │
 │                               │     │                                 │
 │                               │     │                                 │
 │                               │     │  user-data-secret               │
 │                               │     │                                 │
 │                               │     │                                 │
 └───────────────────────────────┘     └─────────────────────────────────┘

For alternative 2, I think the Claims would then need to hold some reference to the remote cluster. In this case it would be CAPM3 or a new controller next to it that would "reach out" to the remote cluster and manage the BMHs.

 ┌───────────────────────────────┐     ┌─────────────────────────────────┐
 │ Cluster 1                     │     │ Cluster 2                       │
 │                               │     │                                 │
 │  ┌──────┐                     │     │                        ┌──────┐ │
 │  │ CAPI │     Metal3Machine   │     │                        │BMO   │ │
 │  ├──────┤                     │     │                        ├──────┤ │
 │  │ CAPM3│                     │     │                        │Ironic│ │
 │  └──────┘     Metal3Data      │     │  BareMetalHost         └──────┘ │
 │                               │     │                                 │
 │                               │     │                                 │
 │               Claim           │     │  user-data-secret               │
 │                               │     │                                 │
 │                               │     │                                 │
 └───────────────────────────────┘     └─────────────────────────────────┘

I think alternative 1 makes more sense but wanted to mention both. I'll write a separate comment about pivoting and how I imagine it could work.

lentzi90 commented 3 months ago

Pivoting

I will here try to describe how I think pivoting should work. The full scenario would be to start with a bootstrap "all in one" cluster, create two workload clusters, move CAPI resources to one and BMO resources to the other.

The bootstrap cluster is any kind of cluster used to get started, e.g. created using kind. The only change from the current situation here is that there would be a Claim for the BMH instead of CAPM3 acting directly on it.

 ┌──────────────────────────────┐ 
 │ Bootstrap cluster            │ 
 │                              │ 
 │   ┌──────┐   Metal3Machine   │ 
 │   │ CAPI │                   │ 
 │   ├──────┤   Metal3Data      │ 
 │   │ CAPM3│                   │ 
 │   ├──────┤   Claim           │ 
 │   │BMO   │                   │ 
 │   ├──────┤   BareMetalHost   │ 
 │   │Ironic│                   │ 
 │   └──────┘   user-data-secret│ 
 │                              │ 
 └──────────────────────────────┘

Pivot BMO

In this step we add the paused annotation to the CAPI/CAPM3 objects. We deploy BMO/Ironic in the target cluster. Then we move the Claims and BMHs and secrets, just like we currently do. As long as the CAPM/CAPM3 objects are paused, there will not be any issue on this side.

The interesting part is then to update the CAPM3 objects to point to the target cluster. I imagine that the Metal3Cluster object would hold a reference to a secret with access information. This would be changed before the objects are un-paused. At this point, the controllers should then reconcile them, find the new access information and use that.


 ┌──────────────────────────────┐   ┌─────────────────────────────────┐
 │ Bootstrap cluster            │   │ Cluster 1                       │
 │                              │   │                                 │
 │   ┌──────┐   Metal3Machine   │   │                        ┌──────┐ │
 │   │ CAPI │                   │   │  Claim                 │BMO   │ │
 │   ├──────┤   Metal3Data      │   │                        ├──────┤ │
 │   │ CAPM3│                   │   │                        │Ironic│ │
 │   └──────┘                   │   │  BareMetalHost         └──────┘ │
 │                              │   │                                 │
 │                              │   │                                 │
 │                              │   │  user-data-secret               │
 │                              │   │                                 │
 │                              │   │                                 │
 └──────────────────────────────┘   └─────────────────────────────────┘

Pivot CAPI

This should work exactly as today, with one notable exception. The user must ensure that the access information works from the target cluster. If the objects were "co-located" in the same cluster before move, the user should for example ensure that the access information works "externally" (e.g. using a public IP instead of a cluster IP).

Access information

This is how I imagine we could provide the access information. I took inspiration from CAPO.

apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: Metal3Cluster
metadata:
  name: test
spec:
  identityRef:
    # Name of a secret containing a kubeconfig used to access the cluster where the BMHs are
    name: bmh-cluster-kubeconfig
    # Name of the context in the kubeconfig to use
    contextName: bmh-cluster

The identityRef could be optional, with a default fallback to "in-cluster config" and RBAC like it is currently done.

dtantsur commented 3 months ago

Potential prior art: https://github.com/metal3-io/metal3-docs/pull/268

Rozzii commented 3 months ago

/triage accepted @pierrecregut Please create a formal architecture proposal in the metal3-docs and link the proposal PR here. It will be much easier to discuss this in a proposal PR form.

Rozzii commented 3 months ago

/kind feature

pierrecregut commented 3 months ago

@lentzi90 regarding the first mentioned requirement "being abstract enough", it means that the spec part of the host claim should not mention anything related to the BMC or to the way the first boot is done (eg PXE interface). It also means that the status should be abstract enough to be filled by other : info too specific to ironic should be avoided but it is also true that an operator for a specific target compute could work without filling every status field.

pierrecregut commented 3 months ago

@lentzi90 regarding scenario 1 or 2, I agree with you that pivot pushes for adopting 2 but I think there is a strong case for scenario 1: if you put the hostclaim with the bmh, you want/need to give visibility on the hostclaim to the tenant:

if you do not use capm3 but directly use the hostclaim
after pivot the tenant has full control over the capm3 controller so over the hostclaim If all your hostclaims from different tenants are in the same namespace as the bmhs then you have to solve how you do the access control over those hostclaims. With traditional k8s rbac I don't think this is easy. If they are in the cluster namespace, then this is trivially solved.

In the solution we have implemented, we do not choose: there is a resource in the cluster namespace where you express your intent and have visibility over the status. There is a resource in the cluster with the bmh where we do the selection and where we can also perform for example quota control.

pierrecregut commented 3 months ago

@rozii Yes I will. The issue was only a first step. The architecture proposal will be a cleaned up and "de-kanod-ized" version of the blueprint mentioned above with a better written motivation part (mainly taken from the issue above).

lentzi90 commented 3 months ago

Thank you @pierrecregut ! Feel free to reach out on slack/email also if you want to discuss details before pushing the PR :slightly_smiling_face:

metal3-io-bot commented 6 days ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Rozzii commented 6 days ago

/remove-lifecycle stale /lifecycle frozen

metal3-io / metal3-docs