IP address for statefulset pod is released by mistake

clivez commented 4 years ago

I have a statefulset with the replicas 5.

[root@clive-alp-control-01 case]# kubectl get po -o wide
NAME         READY   STATUS    RESTARTS   AGE   IP            NODE                  NOMINATED NODE   READINESS GATES
ss-cinfo-0   1/1     Running   0          35s   10.10.10.10   clive-alp-worker-03   <none>           <none>
ss-cinfo-1   1/1     Running   0          33s   10.10.10.11   clive-alp-worker-02   <none>           <none>
ss-cinfo-2   1/1     Running   0          31s   10.10.10.12   clive-alp-worker-01   <none>           <none>
ss-cinfo-3   1/1     Running   0          29s   10.10.10.13   clive-alp-worker-01   <none>           <none>
ss-cinfo-4   1/1     Running   0          27s   10.10.10.14   clive-alp-worker-03   <none>           <none>

Then the node 'clive-alp-worker-03' is shutdown.

[root@clive-alp-control-01 case]# kubectl get po -o wide
NAME         READY   STATUS        RESTARTS   AGE     IP            NODE                  NOMINATED NODE   READINESS GATES
ss-cinfo-0   1/1     Terminating   0          5m39s   10.10.10.10   clive-alp-worker-03   <none>           <none>
ss-cinfo-1   1/1     Running       0          5m37s   10.10.10.11   clive-alp-worker-02   <none>           <none>
ss-cinfo-2   1/1     Running       0          5m35s   10.10.10.12   clive-alp-worker-01   <none>           <none>
ss-cinfo-3   1/1     Running       0          5m33s   10.10.10.13   clive-alp-worker-01   <none>           <none>
ss-cinfo-4   1/1     Terminating   0          5m31s   10.10.10.14   clive-alp-worker-03   <none>           <none>

Pods ss-cinfo-0 and ss-cinfo-4 are deleted forcely, after the pods recreated, danmep for the new ss-cinfo-0 is deleted by mistake and IP also released.

[root@clive-alp-control-01 case]# kubectl get po -o wide
NAME         READY   STATUS    RESTARTS   AGE     IP            NODE                  NOMINATED NODE   READINESS GATES
ss-cinfo-0   1/1     Running   0          78s     10.10.10.15   clive-alp-worker-02   <none>           <none>
ss-cinfo-1   1/1     Running   0          8m14s   10.10.10.11   clive-alp-worker-02   <none>           <none>
ss-cinfo-2   1/1     Running   0          8m12s   10.10.10.12   clive-alp-worker-01   <none>           <none>
ss-cinfo-3   1/1     Running   0          8m10s   10.10.10.13   clive-alp-worker-01   <none>           <none>
ss-cinfo-4   1/1     Running   0          76s     10.10.10.10   clive-alp-worker-01   <none>           <none>
[root@clive-alp-control-01 case]#
[root@clive-alp-control-01 case]# kubectl get danmnet test-net1 -o yaml | grep alloc:
    alloc: gDwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAE=
[root@clive-alp-control-01 case]#
[root@clive-alp-control-01 case]# kubectl get danmep -o yaml | grep Pod:
    Pod: ss-cinfo-4
    Pod: ss-cinfo-1
    Pod: ss-cinfo-2
    Pod: ss-cinfo-3
[root@clive-alp-control-01 case]# kubectl get danmep -o yaml | grep 10.10.10
      Address: 10.10.10.10/24
      Address: 10.10.10.11/24
      Address: 10.10.10.12/24
      Address: 10.10.10.13/24

log from the cleaner pod

[root@clive-alp-control-01 case]# kubectl logs -nkube-system danm-cleaner-27rmf
W0206 08:25:43.208382       1 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0206 08:25:43.210509       1 leaderelection.go:203] attempting to acquire leader lease  kube-system/danm-cleaner...
I0206 08:30:31.934281       1 leaderelection.go:212] successfully acquired lease kube-system/danm-cleaner
...
2020/02/14 08:54:30 INFO: Cleaner freeing IPs belonging to interface:eth0 of Pod:ss-cinfo-0
2020/02/14 08:54:30 INFO: Cleaner freeing IPs belonging to interface:eth0 of Pod:ss-cinfo-0
2020/02/14 08:54:31 INFO: Cleaner freeing IPs belonging to interface:eth0 of Pod:ss-cinfo-4

And the new ss-cinfo-0 is created at 2020-02-14T08:54:29Z

[root@clive-alp-control-01 case]# kubectl get po ss-cinfo-0 -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    danm.k8s.io/interfaces: |
      [
        {
          "network":"test-net1",
          "ip": "dynamic"
        }
      ]
    kubernetes.io/psp: privileged
    seccomp.security.alpha.kubernetes.io/pod: docker/default
  creationTimestamp: "2020-02-14T08:54:29Z"
  generateName: ss-cinfo-
  labels:
    app: ss-cinfo
    controller-revision-hash: ss-cinfo-5cb5c4f98d
    statefulset.kubernetes.io/pod-name: ss-cinfo-0
  name: ss-cinfo-0
  namespace: default
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: StatefulSet
    name: ss-cinfo
    uid: fb40f200-1c79-42c8-bce6-9cde076e60e0
  resourceVersion: "70451007"
  selfLink: /api/v1/namespaces/default/pods/ss-cinfo-0
  uid: 1ba61efb-663a-4fae-8080-404a65f20426
spec:
  containers:
  - image: bcmt-registry:5000/cinfo:1.0
    imagePullPolicy: IfNotPresent
    name: ss-cinfo
    ports:
    - containerPort: 80
      name: web
      protocol: TCP
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-kc6f2
      readOnly: true
  dnsPolicy: Default
  enableServiceLinks: true
  hostname: ss-cinfo-0
  nodeName: clive-alp-worker-02
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  subdomain: ss-cinfo-service
  terminationGracePeriodSeconds: 10
  volumes:
  - name: default-token-kc6f2
    secret:
      defaultMode: 420
      secretName: default-token-kc6f2
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2020-02-14T08:54:29Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2020-02-14T08:54:31Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2020-02-14T08:54:31Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2020-02-14T08:54:29Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: docker://88237e896af3ce86d0282bfaa0fd58308d876faf418ec208f436309c85b6cf9b
    image: bcmt-registry:5000/cinfo:1.0
    imageID: docker-pullable://bcmt-registry:5000/cinfo@sha256:2ccd8d02af74668ee9147bf50f3fd190aca2b88f1e29d380ef051732342cf0c2
    lastState: {}
    name: ss-cinfo
    ready: true
    restartCount: 0
    state:
      running:
        startedAt: "2020-02-14T08:54:31Z"
  hostIP: 172.16.1.32
  phase: Running
  podIP: 10.10.10.15
  qosClass: BestEffort
  startTime: "2020-02-14T08:54:29Z"

Levovar commented 4 years ago

I'm not sure I understand the scenario TBH. Node restarts, Pods go to terminating, so far so good.

Cleaner cleans the old entries when the node comes back, which is expected from my perspective. A Pod is a Pod is a Pod - StatefulSet never guaranteed static address allocations. What do you mean when you say "Pods ss-cinfo-0 and ss-cinfo-4 are deleted forcely"?

clivez commented 4 years ago

This stands for the hardware broken scenario, when the hardware needs to be replaced, it may take a very long time, during this period the user may need to terminate the 'terminating' pods forcely to make sure new pods of the statefulset created and the service come back again. We never expect static address allocation here, what the problem here is: 5 pods running but only 4 danmeps and related IPs in the alloc, 1 released mistakenly by the cleaner.

Levovar commented 4 years ago

no service's high-availability shall depend on one instance, especially in TelCo. we definitely need to educate our users not to mess with the API in these scenarios. there is no README yet but Cleaner is designed work in conjuction with normal K8s Pod termination life-cycle, so it will eventually reconcile state even without manual interaction, even in an outage scenario

that being said yeah it is possible something is up with the Pod UUIDs when it comes down to a statefulset, I will look into it to understand the scenario more!

Levovar commented 4 years ago

what I don't understand is how could this call: https://github.com/nokia/danm-utils/blob/master/pkg/cleaner/cleaner.go#L99 result in an error if the Pod did exist

maybe it is a race condition between the new Pod starting up, and the already triggered cleanup procedure progressing to the deleteInterface phase. what is interesting is that you had two Pods evacuated from the same node, and the only one had this issue. what happens if you repeat this test 5-10 times? is the issue persistent, or intermittent?

Levovar commented 4 years ago

After re-inspecting I'm getting more sure and sure it is a race condition. Based on the logs I think periodic cleaning started at 8:54:30, while possibly very close to that 8:54:29 the new Pod was being instantiated. During the ongoing CNI_ADD the DanmEp might have been already created, but the CNI_ADD operation was not yet finished. So when Cleaner activated it saw two DanmEps belonging to Pod0 (one olda, and one new), and one belonging to Pod4 (the old). It cleared all (we see three entries in Cleaner log). 2 cleanings were justified (old Pod0 and old Pod4), 1 was not (new Pod0).

I think the following upgrades are possibly needed:

Pod UUID check needs to be added besides Pod name check. The re-created StatefulSet Pod would have the same name as the original, so if my theory is correct the timing cold work out differently so sometimes we would see both old&new Pod0 DanmEps remaining - which is also not good
after we arrived to the conclusion that a Cleanup is necessary, but before actually doing it we might need to wait for a grace period and recheck to avoid interference with CNI_ADD - just as we already wait to avoid interference with CNI_DEL

nokia / danm-utils

IP address for statefulset pod is released by mistake #4