redpanda-data / redpanda

Redpanda is a streaming data platform for developers. Kafka API compatible. 10x faster. No ZooKeeper. No JVM!
https://redpanda.com
9.45k stars 579 forks source link

RP Operator: Support pods getting moved during a drain operation #10409

Open 0xdiba opened 1 year ago

0xdiba commented 1 year ago

During a standard kubernetes upgrade the process is to drain nodes and wait for pods to move to newly created node groups. During that operation old nodes are drained and cordoned and they end up with a NoSchedule taint on them (so they can be safely removed):

  taints:
    - effect: NoSchedule
      key: node.kubernetes.io/unschedulable

Going through that operation, RP pods are correctly drained from the old node but fail to get scheduled on a new node due to their PVCs' affinity to the old node (which is already marked as unschedulable). In the operator we are watching for NoExecute taint to delete PVCs and allow the pods to get scheduled on new nodes which does not cover cases of unschedulable nodes. To allow seamless node rollouts PVCs need to be able to get deleted and pods to get rescheduled. We need to watch for NoSchedule taints on the nodes either always (as we do with NoExecute) or (to be more safe) only when the cluster is into a planned maintenance mode (eg specifically annotating the Cluster CR).

JIRA Link: CORE-2325

RafalKorepta commented 1 year ago

Chain of events I saw in Pod yaml definition:

To complete the view if we could track the Node Kubernetes resource we would see couple of things (in time order)

metadata:
  annotations:
    gke-current-operation: |
      operation_type: UPGRADE_NODES
      lock_end_timestamp: 1682524463
      operation_name: "operation-1682524263257-ff6ecdb6-ffc7-40b8-bd13-f6afc3d17569"
spec:
  taints:
    - effect: NoSchedule
      key: node.kubernetes.io/unschedulable
      timeAdded: "2023-04-26T15:53:34Z"
  unschedulable: true
metadata:  
  labels:
    operation.gke.io/type: drain
spec:
  taints:
    - effect: NoSchedule
      key: node.kubernetes.io/unreachable
      timeAdded: "2023-04-26T15:54:17Z"

If we could create special reconciler that get’s Cluster object. Search for Pods in Pending state. Looks for Persistent Volume Claim attached to Pending Pod. Looks for Persistent Volume binded to PVC. Check the node affinity within Persistent Volume. Verifies that Persistent Volume is using hostPath or local and node from node affinity is gone then Persistent Volume Claim can be removed along side with Pod in Pending state.

joejulian commented 1 year ago

In talking with @tdewitt, I'm thinking we should add an annotation to the Cluster resource to make the operator work in an unsafe way. By unsafe, I mean once a pod is deleted and the node its PV is assigned to is unschedulable: true, we delete the PVC and allow the pod to be rescheduled elsewhere.

tdewitt commented 1 year ago

Any update here?

RafalKorepta commented 1 year ago

I'm working on new dedicated to GKE node pool upgrades reconciler.

RafalKorepta commented 1 year ago

I'm pausing this effort. My current stage can be find https://github.com/RafalKorepta/redpanda/commit/6d95c4d99e7bc65944ab08ee6cfdacd2855ad4cd

joejulian commented 1 year ago

closed by #12167