Open 0xdiba opened 1 year ago
Chain of events I saw in Pod yaml definition:
To complete the view if we could track the Node Kubernetes resource we would see couple of things (in time order)
metadata:
annotations:
gke-current-operation: |
operation_type: UPGRADE_NODES
lock_end_timestamp: 1682524463
operation_name: "operation-1682524263257-ff6ecdb6-ffc7-40b8-bd13-f6afc3d17569"
spec:
taints:
- effect: NoSchedule
key: node.kubernetes.io/unschedulable
timeAdded: "2023-04-26T15:53:34Z"
unschedulable: true
metadata:
labels:
operation.gke.io/type: drain
spec:
taints:
- effect: NoSchedule
key: node.kubernetes.io/unreachable
timeAdded: "2023-04-26T15:54:17Z"
If we could create special reconciler that get’s Cluster object. Search for Pods in Pending state. Looks for Persistent Volume Claim attached to Pending Pod. Looks for Persistent Volume binded to PVC. Check the node affinity within Persistent Volume. Verifies that Persistent Volume is using hostPath or local and node from node affinity is gone then Persistent Volume Claim can be removed along side with Pod in Pending state.
In talking with @tdewitt, I'm thinking we should add an annotation to the Cluster
resource to make the operator work in an unsafe way. By unsafe, I mean once a pod is deleted and the node its PV is assigned to is unschedulable: true
, we delete the PVC and allow the pod to be rescheduled elsewhere.
Any update here?
I'm working on new dedicated to GKE node pool upgrades reconciler.
I'm pausing this effort. My current stage can be find https://github.com/RafalKorepta/redpanda/commit/6d95c4d99e7bc65944ab08ee6cfdacd2855ad4cd
closed by #12167
During a standard kubernetes upgrade the process is to drain nodes and wait for pods to move to newly created node groups. During that operation old nodes are drained and cordoned and they end up with a
NoSchedule
taint on them (so they can be safely removed):Going through that operation, RP pods are correctly drained from the old node but fail to get scheduled on a new node due to their PVCs' affinity to the old node (which is already marked as unschedulable). In the operator we are watching for
NoExecute
taint to delete PVCs and allow the pods to get scheduled on new nodes which does not cover cases of unschedulable nodes. To allow seamless node rollouts PVCs need to be able to get deleted and pods to get rescheduled. We need to watch forNoSchedule
taints on the nodes either always (as we do withNoExecute
) or (to be more safe) only when the cluster is into a planned maintenance mode (eg specifically annotating theCluster
CR).JIRA Link: CORE-2325