Closed alexandrem closed 10 months ago
We're going to address that in v0.25 or v0.26. Right now it's also possible to change that behavior, it's not exposed in the cluster templates or the UI, so the way to fix it is:
omnictl get machinesets -l omni.sidero.dev/cluster=talos-default,omni.sidero.dev/role-worker -o yaml > machinesets.yaml
The machinesets.yaml
file will contain something like:
metadata:
namespace: default
type: MachineSets.omni.sidero.dev
id: talos-default-workers
version: 8
owner:
phase: running
created: 2023-12-02T14:16:21Z
updated: 2023-12-12T23:16:27Z
labels:
omni.sidero.dev/cluster: talos-default
omni.sidero.dev/role-worker:
finalizers:
- MachineSetController
- MachineSetStatusController
spec:
updatestrategy: 1
machineclass: null
bootstrapspec: null
updatestrategy
to 0:metadata:
namespace: default
type: MachineSets.omni.sidero.dev
id: talos-default-worker
version: 8
owner:
phase: running
created: 2023-12-02T14:16:21Z
updated: 2023-12-12T23:16:27Z
labels:
omni.sidero.dev/cluster: talos-default
omni.sidero.dev/role-worker:
finalizers:
- MachineSetController
- MachineSetStatusController
spec:
updatestrategy: 0 # < --this is the change
machineclass: null
bootstrapspec: null
omnictl apply -f machinesets.yaml
After that all config changes and scale down will happen in parallel on all nodes.
Note: I won't recommend changing control plane machine set updatestrategy like that as it will break scale down there.
@Unix4ever Thanks for the tip, we will try this.
Can you confirm if this will resolve the potential issue I outlined when the machine doesn't connect back to Omni after a reboot, will this block the cluster operation in progress (down scale or deletion) ?
If the Machine doesn't connect back, then Omni won't be able to reliably detect that it was reset.
If the machine fails to connect and it's confirmed to be a hardware failure, then it should be removed from Omni. It will unblock deletion.
I confirm that setting the updatestrategy
to 0 solves the down scale slowness.
I propose to open a separate issue to track the problem of machines not connecting back to Omni that can halt further progress of down scale and delete operations.
Is there an existing issue for this?
Current Behavior
It seems that during scale down or cluster deletion requests Omni proceeds with each machine one by one in a synchronous manner. For each machine during a "Destroying" phase, a reset and reboot is requested to the target machine, and Omni apparently waits for the machine to complete its reboot and to return in maintenance mode connected back to Omni before proceeding with the next machine.
This is inefficient and is problematic for large cluster, especially those with physical machines where boot time is longer.
We had a cluster earlier with 100+ machines and it's been many hours since we requested a scale down.
We expect to have cluster with 200+ machines, so I'm afraid this will be a show stopper for us.
I'm also concerned that if there's a hardware failure where the machine is unable to connect back to Omni after reboot, then this will block the cluster update/deletion operation entirely.
Expected Behavior
Concurrent destroying operations in asynchronous manner.
Steps To Reproduce
With a large cluster, trigger a scale down or a delete request.
What browsers are you seeing the problem on?
No response
Anything else?
No response