Open bernardgut opened 3 months ago
For this I think we need to wait for Talos 1.8. If we reset the whole system disk it will confuse Omni etcd audit. That's the main reason we didn't enable EPHEMERAL partition reset yet.
1.8 will allow us to partially reset EPHEMERAL without touching etcd.
Hi @Unix4ever
Can you please expose talosctl node reset
to Omni-provisioned clusters in 1.8 ? Right now the CLI returns permissionDenied
.
Debugging nonwhistanding, I think that would be the most consistent way to perform fail-over quickly in production when any kind of node-related issues arise. Particularly those due to node-state decay over time.
BTW I did not check the code for talosctl node reset
but I assume it should do something like :
if this is not what talosctl node reset
does, there should be a CLI command that does the above IMO. Otherwise fail-over is a true PITA... which doesn't make sense for an immutable OS.
Thanks B./
We can expose reset as soon as it can do partial reset. Without touching etcd state.
Partial reset is in plans for 1.8.
I guess we can also do an experiment with how Omni handles full Ephemeral partition reset.
Is there an existing issue for this?
Current Behavior
It seems You can destroy an Omni-provisioned Talos cluster with a single node in the
diskpressure=true
state:diskpressure=true
)talosctl list ...
allows you to see the data that wasnt garbage-collected that you have to delete but there is no option to deletenode-shell
or any equivalent container is unavailable because the node is unschedulabletalosctl node reset
returnsPermissionDenied
failed to update: resource [MachineSetNodes.omni.sidero.dev](http://machinesetnodes.omni.sidero.dev/)(default/<MACHINEID>@2) is owned by "MachineSetNodeController"
omnictl delete machinesetnodes.omni.sidero.dev <MACHINEID>
returnsfailed to update: resource [MachineSetNodes.omni.sidero.dev](http://machinesetnodes.omni.sidero.dev/)(default/<MACHINEID>@2) is owned by "MachineSetNodeController"
unknown
in Omni "cluster" menu and the machine goes intostreaming success
loop as described in #180unknown
in Omni "cluster" menu, doesn't rejoin in Omni "machines" menu and the machine goes into{component : controller-runtime, controller: siderolink.ManagerController, error: error provisioning : rpc error: code = Unknown desc = resource Links.omni.sideo.dev(default/MACHINEID) is not in phase running}
and never rejoin the Omni instance. You are not stuck with a 2-node cluster and a node that cannot rejoin omni until you delete the cluster.Expected Behavior
Any/All of the strategies describe in 4. above should work and allow for an easy failover from
nodePressure
issues in a single node in an Omni-provisionned cluster.Steps To Reproduce
See above
What browsers are you seeing the problem on?
No response
Anything else?
Talos 1.7.0 Omni 0.34 Kubernetes 1.29.3