kubernetes-sigs / cluster-api

Home for Cluster API, a subproject of sig-cluster-lifecycle
https://cluster-api.sigs.k8s.io
Apache License 2.0
3.58k stars 1.31k forks source link

Audit KCP codebase for re-entrancy & error handling of non-key space operations #11184

Open fabriziopandini opened 1 month ago

fabriziopandini commented 1 month ago

There was a few interesting thread about error management for etcd's non-key space operations.

As a first reaction, I think in KCP we are generally ok, because errors reported by etcd are usually handled by re-entracy, which implies we re-assess the current state of the world before deciding the course of action.

But this is also a good chance to audit the code base for when we use non-key space operations, mostly remove member and forward leadership.

NOTE: add member/join is a slight different case, because we rely on kubeadm for it.

PS. I classified this as a bug because I did know exactly which kind to use 😅, but to be clear we are not aware of bugs it this area and this issue is to double check our codebase is robust enough to handle edge cases described in the comment above.

sbueringer commented 1 month ago

Stupid question, non-key space operations are all operations that don't read/write a key/data?

ahrtr commented 1 month ago

non-key space operations are all operations that don't read/write a key/data?

YES.

fabriziopandini commented 1 month ago

Note: look also at how we handle errors in case kubeadm join fails and a there is member not started (without a name) sticking around