[bug] Cluster scale down and deletion is too slow

alexandrem commented 10 months ago

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

It seems that during scale down or cluster deletion requests Omni proceeds with each machine one by one in a synchronous manner. For each machine during a "Destroying" phase, a reset and reboot is requested to the target machine, and Omni apparently waits for the machine to complete its reboot and to return in maintenance mode connected back to Omni before proceeding with the next machine.

This is inefficient and is problematic for large cluster, especially those with physical machines where boot time is longer.

We had a cluster earlier with 100+ machines and it's been many hours since we requested a scale down.

We expect to have cluster with 200+ machines, so I'm afraid this will be a show stopper for us.

I'm also concerned that if there's a hardware failure where the machine is unable to connect back to Omni after reboot, then this will block the cluster update/deletion operation entirely.

Expected Behavior

Concurrent destroying operations in asynchronous manner.

Steps To Reproduce

With a large cluster, trigger a scale down or a delete request.

What browsers are you seeing the problem on?

No response

Anything else?

No response

Unix4ever commented 10 months ago

We're going to address that in v0.25 or v0.26. Right now it's also possible to change that behavior, it's not exposed in the cluster templates or the UI, so the way to fix it is:

List all worker machine sets:

omnictl get machinesets -l omni.sidero.dev/cluster=talos-default,omni.sidero.dev/role-worker -o yaml > machinesets.yaml

The machinesets.yaml file will contain something like:

metadata:
    namespace: default
    type: MachineSets.omni.sidero.dev
    id: talos-default-workers
    version: 8
    owner:
    phase: running
    created: 2023-12-02T14:16:21Z
    updated: 2023-12-12T23:16:27Z
    labels:
        omni.sidero.dev/cluster: talos-default
        omni.sidero.dev/role-worker:
    finalizers:
        - MachineSetController
        - MachineSetStatusController
spec:
    updatestrategy: 1
    machineclass: null
    bootstrapspec: null

For each machine set change updatestrategy to 0:

metadata:
    namespace: default
    type: MachineSets.omni.sidero.dev
    id: talos-default-worker
    version: 8
    owner:
    phase: running
    created: 2023-12-02T14:16:21Z
    updated: 2023-12-12T23:16:27Z
    labels:
        omni.sidero.dev/cluster: talos-default
        omni.sidero.dev/role-worker:
    finalizers:
        - MachineSetController
        - MachineSetStatusController
spec:
    updatestrategy: 0 # < --this is the change
    machineclass: null
    bootstrapspec: null

Run omnictl apply -f machinesets.yaml

After that all config changes and scale down will happen in parallel on all nodes.

Note: I won't recommend changing control plane machine set updatestrategy like that as it will break scale down there.

alexandrem commented 10 months ago

@Unix4ever Thanks for the tip, we will try this.

Can you confirm if this will resolve the potential issue I outlined when the machine doesn't connect back to Omni after a reboot, will this block the cluster operation in progress (down scale or deletion) ?

Unix4ever commented 10 months ago

If the Machine doesn't connect back, then Omni won't be able to reliably detect that it was reset.

If the machine fails to connect and it's confirmed to be a hardware failure, then it should be removed from Omni. It will unblock deletion.

alexandrem commented 10 months ago

I confirm that setting the updatestrategy to 0 solves the down scale slowness.

I propose to open a separate issue to track the problem of machines not connecting back to Omni that can halt further progress of down scale and delete operations.

siderolabs / omni-feedback