failed to fail-over resource kubernetes Version: v1.30.0

yeshl commented 2 months ago

I0423 06:45:42.066690 1 agent.go:253] starting reconciliation I0423 06:45:52.066708 1 agent.go:253] starting reconciliation I0423 06:45:52.066824 1 reconcile_failover.go:137] resource 'pvc-5402447b-9617-4764-902b-93ae4cea6106' on node 'master22.host' has failed, evicting W0423 06:46:05.985734 1 reflector.go:462] pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229: watch of v1.Pod ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding W0423 06:46:05.985738 1 reflector.go:462] pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229: watch of v1.VolumeAttachment ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding W0423 06:46:05.985770 1 reflector.go:462] pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229: watch of v1.Node ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding W0423 06:46:05.985738 1 reflector.go:462] pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229: watch of v1.PersistentVolume ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding E0423 06:46:05.985948 1 reconcile_failover.go:141] "failed to fail-over resource" err="failed to apply node taint: Put \"https://10.96.0.1:443/api/v1/nodes/master22.host?fieldManager=linstor.linbit.com%2Fhigh-availability-controller%2Fv2\": http2: client connection lost" I0423 06:46:05.986006 1 agent.go:253] starting reconciliation I0423 06:46:05.986111 1 reconcile_failover.go:137] resource 'pvc-5402447b-9617-4764-902b-93ae4cea6106' on node 'master22.host' has failed, evicting E0423 06:46:05.997341 1 reconcile_failover.go:141] "failed to fail-over resource" err="failed force detach: volumeattachments.storage.k8s.io \"csi-28b5875796ad4197fe5c795c0ce064930dc9536179e69c3d0edaaf92121ee99b\" not found" I0423 06:46:12.066698 1 agent.go:253] starting reconciliation I0423 06:46:12.066840 1 reconcile_failover.go:137] resource 'pvc-5402447b-9617-4764-902b-93ae4cea6106' on node 'master22.host' has failed, evicting I0423 06:46:22.067170 1 agent.go:253] starting reconciliation I0423 06:46:22.067312 1 reconcile_failover.go:137] resource 'pvc-5402447b-9617-4764-902b-93ae4cea6106' on node 'master22.host' has failed, evicting I0

WanzenBug commented 2 months ago

unable to decode an event from the watch stream: http2: client connection lost

This does not seem related to kubernetes 1.30 or even the HA Controller. It looks like the node master22.host went away, which was probably also hosting the the Kubernetes control plane. The HA Controller will simply retry later, which it indeed did:

I0423 06:46:12.066698 1 agent.go:253] starting reconciliation I0423 06:46:12.066840 1 reconcile_failover.go:137] resource 'pvc-5402447b-9617-4764-902b-93ae4cea6106' on node 'master22.host' has failed, evicting I0423 06:46:22.067170 1 agent.go:253] starting reconciliation I0423 06:46:22.067312 1 reconcile_failover.go:137] resource 'pvc-5402447b-9617-4764-902b-93ae4cea6106' on node 'master22.host' has failed, evicting

At which point it did not encounter any issues. Failover times can be influenced by the amount of work the rest of the cluster has to handle. If a master node fails, failover times can be slower, as the k8s API in general gets slower.

yeshl commented 2 months ago

感谢回复，我配置的是3个master节点的集群，piraeus-operator按doc进行的安装配置，模拟了一个节点故障（shutdown或拔掉网线，k8s集群依然可用）但是piraeus无法完成故障转移，无限evicting，已经超过24分钟。稍后我用worker节点测试一下。

root@master21:~# poweroff

root@master20:~# kubectl get no NAME STATUS ROLES AGE VERSION master20.host Ready control-plane 2d12h v1.30.0 master21.host NotReady control-plane 2d11h v1.30.0 master22.host Ready control-plane 2d11h v1.30.0

root@master20:~# drbdadm status pvc-5402447b-9617-4764-902b-93ae4cea6106 role:Secondary disk:Diskless master21.host connection:Connecting master22.host role:Secondary peer-disk:UpToDate

root@master22:~# drbdadm status pvc-5402447b-9617-4764-902b-93ae4cea6106 role:Secondary disk:UpToDate master20.host role:Secondary peer-disk:Diskless master21.host connection:Connecting root@master22:~# kubectl get pod -n piraeus-datastore -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ha-controller-7bdmq 1/1 Running 4 (16h ago) 35h 10.244.2.138 master22.host ha-controller-g2hlz 1/1 Running 3 (14h ago) 35h 10.244.1.26 master21.host ha-controller-z59b2 1/1 Running 1 (20h ago) 35h 10.244.0.250 master20.host root@master22:~# kubectl logs -n piraeus-datastore ha-controller-7bdmq I0423 23:57:52.307596 1 agent.go:253] starting reconciliation I0423 23:58:02.307055 1 agent.go:253] starting reconciliation I0423 23:58:12.307170 1 agent.go:253] starting reconciliation I0423 23:58:22.307161 1 agent.go:253] starting reconciliation I0423 23:58:22.307298 1 reconcile_failover.go:137] resource 'pvc-5402447b-9617-4764-902b-93ae4cea6106' on node 'master21.host' has failed, evicting I0423 23:58:32.307523 1 agent.go:253] starting reconciliation I0423 23:58:32.307664 1 reconcile_failover.go:137] resource 'pvc-5402447b-9617-4764-902b-93ae4cea6106' on node 'master21.host' has failed, evicting I0423 23:58:42.307155 1 agent.go:253] starting reconciliation I0423 23:58:42.307310 1 reconcile_failover.go:137] resource 'pvc-5402447b-9617-4764-902b-93ae4cea6106' on node 'master21.host' has failed, evicting ...省略 I0424 00:21:52.307448 1 reconcile_failover.go:137] resource 'pvc-5402447b-9617-4764-902b-93ae4cea6106' on node 'master21.host' has failed, evicting I0424 00:22:02.307098 1 agent.go:253] starting reconciliation I0424 00:22:02.307237 1 reconcile_failover.go:137] resource 'pvc-5402447b-9617-4764-902b-93ae4cea6106' on node 'master21.host' has failed, evicting I0424 00:22:12.306998 1 agent.go:253] starting reconciliation I0424 00:22:12.307112 1 reconcile_failover.go:137] resource 'pvc-5402447b-9617-4764-902b-93ae4cea6106' on node 'master21.host' has failed, evicting

root@master22:~# kubectl logs -n piraeus-datastore ha-controller-z59b2 I0423 23:58:02.067259 1 agent.go:253] starting reconciliation I0423 23:58:12.067417 1 agent.go:253] starting reconciliation I0423 23:58:22.067711 1 agent.go:253] starting reconciliation I0423 23:58:22.067900 1 reconcile_failover.go:137] resource 'pvc-5402447b-9617-4764-902b-93ae4cea6106' on node 'master21.host' has failed, evicting I0423 23:58:32.067219 1 agent.go:253] starting reconciliation I0423 23:58:32.067359 1 reconcile_failover.go:137] resource 'pvc-5402447b-9617-4764-902b-93ae4cea6106' on node 'master21.host' has failed, evicting I0423 23:58:42.066787 1 agent.go:253] starting reconciliation I0423 23:58:42.066917 1 reconcile_failover.go:137] resource 'pvc-5402447b-9617-4764-902b-93ae4cea6106' on node 'master21.host' has failed, evicting I0423 23:58:52.066917 1 agent.go:253] starting reconciliation I0423 23:58:52.067053 1 reconcile_failover.go:137] resource 'pvc-5402447b-9617-4764-902b-93ae4cea6106' on node 'master21.host' has failed, evicting I0423 23:59:02.067629 1 agent.go:253] starting reconciliation ...省略 I0424 00:24:02.067200 1 reconcile_failover.go:137] resource 'pvc-5402447b-9617-4764-902b-93ae4cea6106' on node 'master21.host' has failed, evicting I0424 00:24:12.067010 1 agent.go:253] starting reconciliation I0424 00:24:12.067144 1 reconcile_failover.go:137] resource 'pvc-5402447b-9617-4764-902b-93ae4cea6106' on node 'master21.host' has failed, evicting I0424 00:24:22.067629 1 agent.go:253] starting reconciliation I0424 00:24:22.067752 1 reconcile_failover.go:137] resource 'pvc-5402447b-9617-4764-902b-93ae4cea6106' on node 'master21.host' has failed, evicting I0424 00:24:32.067570 1 agent.go:253] starting reconciliation I0424 00:24:32.067823 1 reconcile_failover.go:137] resource 'pvc-5402447b-9617-4764-902b-93ae4cea6106' on node 'master21.host' has failed, evicting

yeshl commented 2 months ago

when run on a node not master ,it cannot failover either！why？ when i force delete the Terminating pod it can schedule to Secondary node and start running! I0425 07:09:31.328542 1 agent.go:253] starting reconciliation I0425 07:09:31.328717 1 reconcile_failover.go:137] resource 'pvc-a28ab865-bf44-4c53-9408-303694756133' on node 'node50.host' has failed, evicting I0425 07:09:41.328786 1 agent.go:253] starting reconciliation I0425 07:09:41.328921 1 reconcile_failover.go:137] resource 'pvc-a28ab865-bf44-4c53-9408-303694756133' on node 'node50.host' has failed, evicting I0425 07:09:51.328672 1 agent.go:253] starting reconciliation I0425 07:09:51.328824 1 reconcile_failover.go:137] resource 'pvc-a28ab865-bf44-4c53-9408-303694756133' on node 'node50.host' has failed, evicting I0425 07:10:01.329536 1 agent.go:253] starting reconciliation I0425 07:10:01.329677 1 reconcile_failover.go:137] resource 'pvc-a28ab865-bf44-4c53-9408-303694756133' on node 'node50.host' has failed, evicting

yeshl commented 2 months ago

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: sc-piraeus-r2-ha
provisioner: linstor.csi.linbit.com
reclaimPolicy: Delete
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer
parameters:
  csi.storage.k8s.io/fstype: xfs
  linstor.csi.linbit.com/storagePool: pool-01
  linstor.csi.linbit.com/placementCount: "2"
  linstor.csi.linbit.com/allowRemoteVolumeAccess: "false"
  property.linstor.csi.linbit.com/DrbdOptions/auto-quorum: suspend-io
  property.linstor.csi.linbit.com/DrbdOptions/Resource/on-no-data-accessible: suspend-io
  property.linstor.csi.linbit.com/DrbdOptions/Resource/on-suspended-primary-outdated: force-secondary
  property.linstor.csi.linbit.com/DrbdOptions/Net/rr-conflict: retry-connect
---
apiVersion: v1
kind: Service
metadata:
  name: test-svc-web
spec:
  ports:
    - port: 80
      name: web
  clusterIP: None
  selector:
    app: web
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: test-sts-web
spec:
  selector:
    matchLabels:
      app: web
  serviceName: "test-svc-web"
  replicas: 1
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
        - name: nginx
          image: nginx:1.25.5-alpine
          ports:
            - containerPort: 80
          volumeMounts:
            - name: pvc
              mountPath: /mnt/data
  volumeClaimTemplates:
    - metadata:
        name: pvc
      spec:
        accessModes: [ "ReadWriteOnce" ]
        storageClassName: sc-piraeus-r2-ha
        resources:
          requests:
            storage: 2Gi

WanzenBug commented 2 months ago

You could try turning up the verbosity of the HA Controller to see what it tries to do. Edit the LinstorCluster resource to contain:

...
spec:
  highAvailabilityController:
    podTemplate:
      spec:
        containers:
        - name: ha-controller
          args:
          - /agent
          - --v=3

yeshl commented 2 months ago

Pod 'default/test-sts-web-0' is exempt from eviction because of unsafe volumes What does it mean?

containers:
        - name: c-web-server
          image: busybox
          imagePullPolicy: IfNotPresent #default Always
          env:
            - name: NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
            - name: POD_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.podIP
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: SVC_NAME
              value: "svc-headless"
            - name: DEFAULT_TZ
              value: "Asia/Shanghai"
          command:
            - sh
            - '-c'
            - |-
              trap 'exit 0' SIGTERM
              #rm /mnt/data/index.html
              while true; do
                echo [$(date "+%Y-%m-%d %T")] - $HOSTNAME - $POD_IP '<br>'  |tee -a /mnt/data/index.html
                #touch  /mnt/data/f-$(date +"%Y-%m-%d_%H-%M-%S").txt
                sleep 10
              done
          volumeMounts:
            - name: localtime
              mountPath: /etc/localtime
              readOnly: true
            - name: pvc
              mountPath: /mnt/data
        - name: nginx
          image: nginx:1.25.5-alpine
          env:
            - name: TZ
              value: "Asia/Shanghai"
          ports:
            - containerPort: 80
          volumeMounts:
            - name: conf
              mountPath: /etc/nginx/conf.d/default.conf
              subPath: fileserver.conf
            - name: pvc
              mountPath: /mnt/data
      volumes:
        - name: localtime
          hostPath:
            type: File
            path: /etc/localtime
        - name: conf
          configMap:
            name: test-cm-nginx
#            defaultMode: 0755
  volumeClaimTemplates:
    - metadata:
        name: pvc
      spec:
        accessModes: [ "ReadWriteOnce" ]
        storageClassName: sc-piraeus-r2-ha
        resources:
          requests:
            storage: 2Gi

WanzenBug commented 2 months ago

Because the pod has a hostPath volume mounted, the HA Controller believes it can't fail over this volume. See https://github.com/piraeusdatastore/piraeus-ha-controller/blob/main/pkg/agent/reconcile_failover.go#L262-L296

Why? Because if you had a host path volume and you evicted the Pod and it starts on another node, that volume has now different content. At least that was the idea: only fail over Pods that only have "safe" volumes, i.e. DRBD volume or other ephemeral volumes.

Looks like in this case it would also be safe, as the /etc/localtime is readOnly... Perhaps we can improve that check.

You can try running the ha controller with --fail-over-unsafe-pods, see if it works then.

yeshl commented 2 months ago

thank you! it can failover when i remove localtime volume! I unplug the network cable to simulate server down, and then plug it back in after a while to restore the network, i expect the primary to become secondary，but it doesn't。so i reboot the server,then it become secondary! how it can auto change primary to secondary after network restored，no need to reboot server！

+-----------------------------------------------------------------------------------------------------------------------------------------------------+
| Node          | Resource                                 | StoragePool          | VolNr | MinorNr | DeviceName    | Allocated | InUse  |      State |
|=====================================================================================================================================================|
| master20.host | pvc-f7d167b9-f486-4cc8-8281-4e7d304819c4 | pool-01              |     0 |    1000 | /dev/drbd1000 |  2.62 MiB | InUse  |   UpToDate |
| master21.host | pvc-f7d167b9-f486-4cc8-8281-4e7d304819c4 | DfltDisklessStorPool |     0 |    1000 | /dev/drbd1000 |           | Unused | TieBreaker |
| master22.host | pvc-f7d167b9-f486-4cc8-8281-4e7d304819c4 | pool-01              |     0 |    1000 | /dev/drbd1000 | 64.80 MiB |        |    Unknown |

WanzenBug commented 2 months ago

The HA Controller on the "old" Primary node should see that a Pod is stuck in suspend-io and force it to become secondary using drbdadm secondary --force.

yeshl commented 2 months ago

+-----------------------------------------------------------------------------------------------------------------------------------------------------+
| Node          | Resource                                 | StoragePool          | VolNr | MinorNr | DeviceName    | Allocated | InUse  |      State |
|=====================================================================================================================================================|
| master20.host | pvc-2028ef6c-82a1-4e7f-8bdf-4179cee1bbd9 | DfltDisklessStorPool |     0 |    1000 | /dev/drbd1000 |           | Unused | TieBreaker |
| node50.host   | pvc-2028ef6c-82a1-4e7f-8bdf-4179cee1bbd9 | pool-01              |     0 |    1000 | /dev/drbd1000 | 64.80 MiB |        |    Unknown |
| node51.host   | pvc-2028ef6c-82a1-4e7f-8bdf-4179cee1bbd9 | pool-01              |     0 |    1000 | /dev/drbd1000 |  2.62 MiB | InUse  |   UpToDate |
+-----------------------------------------------------------------------------------------------------------------------------------------------------+
root@master20:~# kubectl get po -n piraeus-datastore -o wide|grep ha
ha-controller-6zlcf                                    1/1     Running   0          27m   10.244.2.250   master22.host   <none>           <none>
ha-controller-7fjnl                                    1/1     Running   0          27m   10.244.4.83    node51.host     <none>           <none>
ha-controller-7p82f                                    1/1     Running   0          27m   10.244.0.118   master20.host   <none>           <none>
ha-controller-bb47w                                    1/1     Running   0          27m   10.244.3.130   node50.host     <none>           <none>
ha-controller-ltjjt                                    1/1     Running   0          27m   10.244.1.126   master21.host   <none>           <none>
root@master20:~# kubectl  -n piraeus-datastore exec ha-controller-bb47w -- drbdadm status
pvc-2028ef6c-82a1-4e7f-8bdf-4179cee1bbd9 role:Secondary suspended:quorum
  disk:UpToDate quorum:no blocked:upper
  master20.host connection:Connecting
  node51.host connection:Connecting
root@master20:~# kubectl  -n piraeus-datastore exec ha-controller-bb47w -- drbdadm secondary --force pvc-2028ef6c-82a1-4e7f-8bdf-4179cee1bbd9
no resources defined!
command terminated with exit code 1

WanzenBug commented 2 months ago

Sorry, should have been drbdsetup secondary --force

yeshl commented 2 months ago

Can it reconnect and recovery be done automatically?

piraeusdatastore / piraeus-ha-controller

failed to fail-over resource kubernetes Version: v1.30.0 #59