RP Operator: Support pods getting moved during a drain operation

0xdiba commented 1 year ago

During a standard kubernetes upgrade the process is to drain nodes and wait for pods to move to newly created node groups. During that operation old nodes are drained and cordoned and they end up with a NoSchedule taint on them (so they can be safely removed):

  taints:
    - effect: NoSchedule
      key: node.kubernetes.io/unschedulable

Going through that operation, RP pods are correctly drained from the old node but fail to get scheduled on a new node due to their PVCs' affinity to the old node (which is already marked as unschedulable). In the operator we are watching for NoExecute taint to delete PVCs and allow the pods to get scheduled on new nodes which does not cover cases of unschedulable nodes. To allow seamless node rollouts PVCs need to be able to get deleted and pods to get rescheduled. We need to watch for NoSchedule taints on the nodes either always (as we do with NoExecute) or (to be more safe) only when the cluster is into a planned maintenance mode (eg specifically annotating the Cluster CR).

JIRA Link: CORE-2325

RafalKorepta commented 1 year ago

Chain of events I saw in Pod yaml definition:

Deletion timestamp appear
Finalizer operator.redpanda.com/finalizer removed
containerStatuses for Redpanda container changed with
- ready moved from true to false
- started moved from true to false
- status.containerStatuses.state.running changed to status.containerStatuses.state.terminated with reason Completed
pod metadata changed:
- deletionGracePeriodSeconds from 0 to 120
- new deletionTimestamp value
Now the fun part!
- annotation operator.redpanda.com/node-id removed
- metadata deletionGracePeriodSeconds removed
- metadata deletionTimestamp removed
- new metadata uid value
- kubernetes token volume changed from kube-api-access-98rdv to kube-api-access-dxnbd
- NodeName is removed. The gke-redpanda-ch4hv7ddus-redpanda-7a90-1cd7524e-2vd1 is gone
- whole status is removed and only status.phase=Pending and status.qosClass=Guaranteed
The next update is about condition Pod unschedulable. Due to the fact GKE put taint on the Node that will be deleted soon.
Finalizer is added operator.redpanda.com/finalizer
Next update is about the condition when old node is removed

To complete the view if we could track the Node Kubernetes resource we would see couple of things (in time order)

metadata:
  annotations:
    gke-current-operation: |
      operation_type: UPGRADE_NODES
      lock_end_timestamp: 1682524463
      operation_name: "operation-1682524263257-ff6ecdb6-ffc7-40b8-bd13-f6afc3d17569"

spec:
  taints:
    - effect: NoSchedule
      key: node.kubernetes.io/unschedulable
      timeAdded: "2023-04-26T15:53:34Z"
  unschedulable: true

metadata:  
  labels:
    operation.gke.io/type: drain

spec:
  taints:
    - effect: NoSchedule
      key: node.kubernetes.io/unreachable
      timeAdded: "2023-04-26T15:54:17Z"

If we could create special reconciler that get’s Cluster object. Search for Pods in Pending state. Looks for Persistent Volume Claim attached to Pending Pod. Looks for Persistent Volume binded to PVC. Check the node affinity within Persistent Volume. Verifies that Persistent Volume is using hostPath or local and node from node affinity is gone then Persistent Volume Claim can be removed along side with Pod in Pending state.

joejulian commented 1 year ago

In talking with @tdewitt, I'm thinking we should add an annotation to the Cluster resource to make the operator work in an unsafe way. By unsafe, I mean once a pod is deleted and the node its PV is assigned to is unschedulable: true, we delete the PVC and allow the pod to be rescheduled elsewhere.

tdewitt commented 1 year ago

Any update here?

RafalKorepta commented 1 year ago

I'm working on new dedicated to GKE node pool upgrades reconciler.

RafalKorepta commented 1 year ago

I'm pausing this effort. My current stage can be find https://github.com/RafalKorepta/redpanda/commit/6d95c4d99e7bc65944ab08ee6cfdacd2855ad4cd

joejulian commented 1 year ago

closed by #12167

redpanda-data / redpanda

RP Operator: Support pods getting moved during a drain operation #10409