[bug] Failover from node conditions seemingly not possible. Have to delete the cluster

bernardgut commented 3 months ago

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

It seems You can destroy an Omni-provisioned Talos cluster with a single node in the diskpressure=true state:

create a cluster
deploy openebs-localpv (right now there is a bug where the provisioner fails to delete the pv when the disk is under pressure)
wait for the node to become unschedulable (diskpressure=true)
Now try to recover from that "easily":
- talosctl list ... allows you to see the data that wasnt garbage-collected that you have to delete but there is no option to delete
- node-shell or any equivalent container is unavailable because the node is unschedulable
- talosctl node reset returns PermissionDenied
- Deleting the node from the GUI cluster menu gives you failed to update: resource [MachineSetNodes.omni.sidero.dev](http://machinesetnodes.omni.sidero.dev/)(default/<MACHINEID>@2) is owned by "MachineSetNodeController"
- omnictl delete machinesetnodes.omni.sidero.dev <MACHINEID> returns failed to update: resource [MachineSetNodes.omni.sidero.dev](http://machinesetnodes.omni.sidero.dev/)(default/<MACHINEID>@2) is owned by "MachineSetNodeController"
- Resetting the machine from the ISO puts both the Cluster and machine in an inconsistent state where the machine has status unknown in Omni "cluster" menu and the machine goes into streaming success loop as described in #180
- Deleting the machine from Omni and resetting the machine from the ISO puts the machine in an inconsistent state where the machine has status unknown in Omni "cluster" menu, doesn't rejoin in Omni "machines" menu and the machine goes into {component : controller-runtime, controller: siderolink.ManagerController, error: error provisioning : rpc error: code = Unknown desc = resource Links.omni.sideo.dev(default/MACHINEID) is not in phase running} and never rejoin the Omni instance. You are not stuck with a 2-node cluster and a node that cannot rejoin omni until you delete the cluster.

Expected Behavior

Any/All of the strategies describe in 4. above should work and allow for an easy failover from nodePressure issues in a single node in an Omni-provisionned cluster.

Steps To Reproduce

See above

What browsers are you seeing the problem on?

No response

Anything else?

Talos 1.7.0 Omni 0.34 Kubernetes 1.29.3

Unix4ever commented 2 months ago

For this I think we need to wait for Talos 1.8. If we reset the whole system disk it will confuse Omni etcd audit. That's the main reason we didn't enable EPHEMERAL partition reset yet.

1.8 will allow us to partially reset EPHEMERAL without touching etcd.

bernardgut commented 2 months ago

Hi @Unix4ever

Can you please expose talosctl node reset to Omni-provisioned clusters in 1.8 ? Right now the CLI returns permissionDenied.

Debugging nonwhistanding, I think that would be the most consistent way to perform fail-over quickly in production when any kind of node-related issues arise. Particularly those due to node-state decay over time.

BTW I did not check the code for talosctl node reset but I assume it should do something like :

cordon the node
evict the workloads
reset the node to the initial state (at cluster init, not at machine init)
uncordon the node after the reset is over

if this is not what talosctl node reset does, there should be a CLI command that does the above IMO. Otherwise fail-over is a true PITA... which doesn't make sense for an immutable OS.

Thanks B./

Unix4ever commented 2 months ago

We can expose reset as soon as it can do partial reset. Without touching etcd state.

Partial reset is in plans for 1.8.

I guess we can also do an experiment with how Omni handles full Ephemeral partition reset.

siderolabs / omni