afritzler commented 4 months ago

Summary

Describe a concept on how we can put a Server into Maintenance mode. Also evaluate what it would mean to move it back into an operational state.

The Kubernetes community has a KEP extending the Node API by a declarative approach in handling Node maintenance: https://github.com/kubernetes/enhancements/issues/4212

We should see how those concepts can be applied for our Server objects.

Implications to consider:

How is the maintenance state influences the behaviour of controllers sitting on top of the metal API e.g.:

Cluster API provider
Gardener Machine Controller Manager
etc.

Expected outcome:

Concept document describing the above ideas and identifying potential missing fields/concepts in the metal types

Nuckal777 commented 2 months ago

To kick of some discussion here is a short sketch, which more or less is a trimmed down version of the NodeMaintenance proposal. We introduce the ServerMaintenance custom resource to signal maintenances on Servers.

// +enum
type ServerMaintenanceStage string

const (
// Idle announces a maintenance.
Idle ServerMaintenanceStage = "Idle"
// InMaintenance shuts down servers, potentially releasing server claims.
InMaintenance ServerMaintenanceStage = "InMaintenance"
// As long as no other maintenance is applied to a server, they are made ready again.
Complete ServerMaintenanceStage = "Complete"
)

type ServerMaintenance struct {
    ...
    Spec ServerMaintenanceSpec
    Status ServerMaintenanceStatus
}

type ServerMaintenanceSpec struct {
    // ServerSelector selects servers for this maintenance.
    ServerSelector *v1.ServerSelector

    // The order of the stages is Idle -> InMaintenance -> Complete.
    // The default value is Idle.
    Stage ServerMaintenanceStage

    // Reason for the maintenance.
    Reason string
}

As soon as the Stage is set to InMaintenance all selected Servers will have their .spec.power set to Off. The respective ServerClaims, if any, can either be re-bound or kept. Likely, such a setting needs to be added to a ServerClaim. On completion all selected Servers will have .spec.power to On, when no other ServerMaintenance with stage InMaintenance selects a Server. Completed ServerMaintenances need to be garbage collected eventually.

The cloud-provider-manager-metal can have a watch on ServerMaintenances and forward information about ServerMaintenances into it's Kubernetes cluster. Do we need some mechanism to block/deny maintenances on Servers?

aobort commented 2 months ago

@Nuckal777 thanks for input!

I'm pretty sure, that we must not power off servers in such an implicit manner. For instance, firmware updates, especially which require server reboot, should be also considered as maintenance. Hence, it might not be reasonable to force server power off.

Apart from that, any of controllers which work with Server CR would have to loop over ServerMaintenance objects to find whether server is in maintenance or not. This could be solved by the additional field or condition in servers status, however, it might cause data race between controllers as soon as both spec (to switch power) and status (to set maintenance state) have to be updated.

From my perspective, maintenance state should be represented in .spec of the Server CR:

it is explicit
it can be set by other operators
server controller itself will be able to determine whether object being reconciled is in maintenance or not

Nuckal777 commented 2 months ago

From my perspective, maintenance state should be represented in .spec of the Server CR

There is an issue with that design: When 2 controllers powered off a Server and one controller is finished, the finished controller will set the power state to on and the second controller to off. The controller start fighting over that field. metal3 uses a similar design for that reason. Further reasoning is in the declarative node API proposal.

For instance, firmware updates, especially which require server reboot, should be also considered as maintenance.

Why couldn't a the firmware upgrade handling create a ServerMaintenance and delete it after finishing?

Apart from that, any of controllers which work with Server CR would have to loop over ServerMaintenance objects to find whether server is in maintenance or not.

I agree. This would be a disadvantage.

afritzler commented 2 months ago

I like the idea of having an own resource initiating the Maintenance of a Server. Regarding the ServerClaim binding: Power management and other updates will only happen, when the Server is in a Reserved state - meaning once a Server is in Maintenance, only the MaintenanceReconciler should perform operations on the Server e.g. creating a BootConfiguration to boot into the BIOS/Firmware update mode etc.

aobort commented 2 months ago

There is an issue with that design: When 2 controllers powered off a Server and one controller is finished, the finished controller will set the power state to on and the second controller to off. The controller start fighting over that field. metal3 uses a similar design for that reason. Further reasoning is in the declarative node API proposal.

As @afritzler mentioned, it should be done in such a way, that particular controller may manage server's power state only if server is in particular state. Like if server is in Reserved state, then only ServerClaim controller could power it on/off, if server is in Maintenance state, then only Maintenance controller could power it on/off and etc.

Why couldn't a the firmware upgrade handling create a ServerMaintenance and delete it after finishing?

It definitely could, but I was pointing that is not necessarily need to power off the server in maintenance state.

Nuckal777 commented 1 month ago

After reading the node in-place update discussion of Gardener, I had some time to look into the topic again.

It was never explicitly mentioned in this ticket, but I think there are to main use cases for the maintenance of a Server:

updating/configuring BIOS/Firmware
hardware failure and replacement

Am I missing something from the metal-operator perspective?

As with the BIOS updates, I think we should focus on managing maintenances on a single Server first. Declaring a maintenance for multiple Servers via a label selector can be added with an additional CRD. That way we also work around the issue that the set of selected Servers by a maintenance can be larger than the subset of Servers, which can be maintained at once due to higher level concerns.

So, if I understand both of your correctly, you'd like the declaration of a maintenance to be a mean of moving the permission of power-cycling a Server from the ServerClaim controller to a different controller.

An alternative ServerMaintenance object, which targets a single Server, could like:

// +enum
type ServerMaintenanceStage string

const (
// Idle announces a maintenance.
Idle ServerMaintenanceStage = "Idle"
// InMaintenance shuts down servers, potentially releasing server claims.
InMaintenance ServerMaintenanceStage = "InMaintenance"
// As long as no other maintenance is applied to a server, they are made ready again.
Complete ServerMaintenanceStage = "Complete"
)

type ServerMaintenance struct {
    ...
    Spec ServerMaintenanceSpec
    Status ServerMaintenanceStatus
}

type ServerMaintenanceSpec struct {
    // ServerRef selects a server for this maintenance.
    ServerRef *v1.ObjectReference

    // The order of the stages is Idle -> InMaintenance -> Complete.
    // The default value is Idle.
    Stage ServerMaintenanceStage

    // TargetState specifies the state the server should be moved to
    // when the stage changes to InMaintenance
    TargetState ServerState

    // Reason for the maintenance.
    Reason string
}

Potentially, a list of maintenances can also be inlined into the Server object.

An exemplary upgrade flow for a Server could then look like this:

The Server is in the Reserved status in the beginning
ServerBIOS/ServerFirmware spec is changed
Respective controller creates a ServerMaintenance object in the Idle maintenance stage
This would signal to external systems (e.g. MCM) that a maintenance is necesscary
- External system prepares maintenances (e.g. migrating payload)
External system updates the ServerMaintenance stage to InMaintenance
Server status is changed to Updating (based on target state in the ServerMaintenance object)
- "Detaches" the Server from the ServerClaim controller and hands the responsibility for it over to the controller, which handles the upgrade
When the Update is done the maintenance stage in of the ServerMaintenance object is set to Complete
Server status is changed back to Reserved, because a ServerClaim exists
- Moves the responsibility back to the ServerClaim controller

The hardware failure case would be similar, but require a different controller, which just turns the power off and another HardwareFailure state on the Server object. Servers in the Available state can be moved to InMaintenance stage upon creation of the ServerMaintenance object.

stefanhipfel commented 1 month ago

How about adding a maintenance window as well? So one can decide when a bios or firmware upgrade should take place or should that be done on another service?

Maybe adding another stage: Planned ServerMaintenanceStage = "Planned"

ironcore-dev / metal-operator

Declarative `Server` Maintenance #76

Summary

Implications to consider:

Expected outcome: