Closed AlessandroSechi closed 1 year ago
@AlessandroSechi Are there any messages in the CSI driver pods log? Did the CSI driver work before increasing the number of nodes?
Are there any messages in the CSI driver pods log?
@hakman Yes I checked logs, and I see some errors:
W0619 18:12:45.943030 1 reflector.go:436] k8s.io/client-go/informers/factory.go:134: watch of *v1.CSINode ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0619 18:12:45.943042 1 reflector.go:436] k8s.io/client-go/informers/factory.go:134: watch of *v1.PersistentVolume ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0619 18:12:45.943105 1 reflector.go:436] k8s.io/client-go/informers/factory.go:134: watch of *v1.VolumeAttachment ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0619 21:35:28.192333 1 reflector.go:436] k8s.io/client-go/informers/factory.go:134: watch of *v1.CSINode ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0619 21:35:28.192341 1 reflector.go:436] k8s.io/client-go/informers/factory.go:134: watch of *v1.PersistentVolume ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0619 21:35:28.192350 1 reflector.go:436] k8s.io/client-go/informers/factory.go:134: watch of *v1.VolumeAttachment ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
I0619 21:36:12.321871 1 trace.go:205] Trace[163361275]: "Reflector ListAndWatch" name:k8s.io/client-go/informers/factory.go:134 (19-Jun-2023 21:35:29.125) (total time: 43177ms):
Trace[163361275]: ---"Objects listed" 43177ms (21:36:00.303)
Trace[163361275]: [43.177606769s] [43.177606769s] END
W0624 21:38:51.893897 1 reflector.go:436] k8s.io/client-go/informers/factory.go:134: watch of *v1.CSINode ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0624 21:38:51.893916 1 reflector.go:436] k8s.io/client-go/informers/factory.go:134: watch of *v1.PersistentVolume ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0624 21:38:51.893907 1 reflector.go:436] k8s.io/client-go/informers/factory.go:134: watch of *v1.VolumeAttachment ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
E0624 21:39:05.882381 1 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.CSINode: failed to list *v1.CSINode: Get "https://100.64.0.1:443/apis/storage.k8s.io/v1/csinodes?resourceVersion=16974325": dial tcp 100.64.0.1:443: connect: connection refused
E0624 21:39:06.548055 1 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.PersistentVolume: failed to list *v1.PersistentVolume: Get "https://100.64.0.1:443/api/v1/persistentvolumes?resourceVersion=16974252": dial tcp 100.64.0.1:443: connect: connection refused
I also noticed some other error in hcloud-controller-manager
E0626 11:13:01.139544 1 controller.go:310] error processing service ingress-controller/ingress-nginx-controller (will retry): failed to ensure load balancer: hcloud/loadBalancers.EnsureLoadBalancer: hcops/LoadBalancerOps.Create: neither load-balancer.hetzner.cloud/location nor load-balancer.hetzner.cloud/network-zone set
I0626 11:18:01.147669 1 controller.go:407] Ensuring load balancer for service ingress-controller/ingress-nginx-controller
I0626 11:18:01.158815 1 load_balancers.go:108] "ensure Load Balancer" op="hcloud/loadBalancers.EnsureLoadBalancer" service="ingress-nginx-controller" nodes=[nodes-fsn1-67d56c4deadcd70f nodes-fsn1-729c4c76bc120662 nodes-fsn1-361c1a49dec42261]
I0626 11:18:01.160916 1 event.go:294] "Event occurred" object="ingress-controller/ingress-nginx-controller" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="EnsuringLoadBalancer" message="Ensuring load balancer"
I0626 11:18:01.477477 1 event.go:294] "Event occurred" object="ingress-controller/ingress-nginx-controller" fieldPath="" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to ensure load balancer: hcloud/loadBalancers.EnsureLoadBalancer: hcops/LoadBalancerOps.Create: neither load-balancer.hetzner.cloud/location nor load-balancer.hetzner.cloud/network-zone set"
E0626 11:18:01.477840 1 controller.go:310] error processing service ingress-controller/ingress-nginx-controller (will retry): failed to ensure load balancer: hcloud/loadBalancers.EnsureLoadBalancer: hcops/LoadBalancerOps.Create: neither load-balancer.hetzner.cloud/location nor load-balancer.hetzner.cloud/network-zone set
In fact, machine was not added in Hetzner LB as target. Tried to repeat the increase (deleted node + reapplied update cluster) but same result. Apparently there's some issue in scaling up which affects also LB.
Did the CSI driver work before increasing the number of nodes?
Yes everything was working fine
I tried to reproduce the problem without much luck. I used https://github.com/kubernetes/kops/releases/tag/v1.27.0-beta.3, which has newer CCM and CSI divers. Please try it also and see if you can reproduce the issue with this new kOps release.
If issue still appears, please document the cluster creation args, changes to cluster from default, relevant manifest(s) and other steps needed to reproduce this.
Hello, issue still reproduces after scaling cluster with Kops 1.27.0-beta.3
Command used for cluster creation:
kops create cluster --name=cluster1.fsn1.hetzner.mywebsite.com --ssh-public-key=/home/key.pub --cloud=hetzner --zones=fsn1 --image=debian-11 --networking=calico --network-cidr=10.10.0.0/16 --master-size cpx11 --master-count 3 --node-count 2 --node-size cpx21
No other changes
Also, this error is present in csi-attacher
container
I0702 09:15:56.370725 1 main.go:94] Version: v4.1.0
W0702 09:16:06.386447 1 connection.go:173] Still connecting to unix:///run/csi/socket
W0702 09:16:16.387718 1 connection.go:173] Still connecting to unix:///run/csi/socket
I0702 09:16:17.855259 1 common.go:111] Probing CSI driver for readiness
I0702 09:16:17.892917 1 controller.go:130] Starting CSI attacher
W0702 09:33:15.013352 1 reflector.go:347] k8s.io/client-go/informers/factory.go:150: watch of *v1.VolumeAttachment ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0702 09:33:15.013358 1 reflector.go:347] k8s.io/client-go/informers/factory.go:150: watch of *v1.CSINode ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0702 09:33:15.013428 1 reflector.go:347] k8s.io/client-go/informers/factory.go:150: watch of *v1.PersistentVolume ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
I0702 09:33:59.466265 1 trace.go:219] Trace[1006933274]: "Reflector ListAndWatch" name:k8s.io/client-go/informers/factory.go:150 (02-Jul-2023 09:33:15.987) (total time: 43477ms):
Also, as collateral issue, node is never added to Hetzner LB targets
It is not clear to me what kind of app you are running, how many replicas and so on, also the steps to reproduce the issue.
What is the status/events of CSI pods (describe deployment and daemonset)? Are all pods running and ready (get pods -A -o wide)? What is the status/events for the app pod (describe pod)? What is the status/events for the PVC (describe pvc)?
In fact, machine was not added in Hetzner LB as target
This comment is also out of context. How does the LB fit here?
It is not clear to me what kind of app you are running, how many replicas and so on, also the steps to reproduce the issue.
The app is consul
, installed via official helm chart. It has 2 replicas, I'm trying to schedule the third upgrading chart with server.replicas=3
.
What is the status/events of CSI pods (describe deployment and daemonset)?
kubectl -n kube-system describe deployment hcloud-cloud-controller-manager
Name: hcloud-cloud-controller-manager
Namespace: kube-system
CreationTimestamp: Thu, 04 May 2023 18:30:59 +0000
Labels: addon.kops.k8s.io/name=hcloud-cloud-controller.addons.k8s.io
app.kubernetes.io/managed-by=kops
k8s-addon=hcloud-cloud-controller.addons.k8s.io
Annotations: deployment.kubernetes.io/revision: 2
Selector: app=hcloud-cloud-controller-manager
Replicas: 1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 25% max unavailable, 25% max surge
Pod Template:
Labels: app=hcloud-cloud-controller-manager
kops.k8s.io/managed-by=kops
Service Account: cloud-controller-manager
Containers:
hcloud-cloud-controller-manager:
Image: hetznercloud/hcloud-cloud-controller-manager:v1.15.0@sha256:709ddfb2c976d16748d835ed5846333142a6a879dd6c9e5734b6bfac1071ea9f
Port: <none>
Host Port: <none>
Command:
/bin/hcloud-cloud-controller-manager
--allocate-node-cidrs=true
--allow-untagged-cloud=true
--cloud-provider=hcloud
--cluster-cidr=100.64.0.0/10
--configure-cloud-routes=false
--leader-elect=false
--v=2
--use-service-account-credentials=true
Requests:
cpu: 100m
memory: 50Mi
Environment:
NODE_NAME: (v1:spec.nodeName)
HCLOUD_TOKEN: <set to the key 'token' in secret 'hcloud'> Optional: false
HCLOUD_NETWORK: <set to the key 'network' in secret 'hcloud'> Optional: false
Mounts: <none>
Volumes: <none>
Priority Class Name: system-cluster-critical
Conditions:
Type Status Reason
---- ------ ------
Progressing True NewReplicaSetAvailable
Available True MinimumReplicasAvailable
OldReplicaSets: <none>
NewReplicaSet: hcloud-cloud-controller-manager-5fdb77d49b (1/1 replicas created)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ScalingReplicaSet 84m deployment-controller Scaled up replica set hcloud-cloud-controller-manager-5fdb77d49b to 1
Normal ScalingReplicaSet 84m deployment-controller Scaled down replica set hcloud-cloud-controller-manager-df588cd94 to 0 from 1
kubectl -n kube-system describe deployment hcloud-csi-controller
Name: hcloud-csi-controller
Namespace: kube-system
CreationTimestamp: Thu, 04 May 2023 18:31:03 +0000
Labels: addon.kops.k8s.io/name=hcloud-csi-driver.addons.k8s.io
app.kubernetes.io/managed-by=kops
k8s-addon=hcloud-csi-driver.addons.k8s.io
Annotations: deployment.kubernetes.io/revision: 2
Selector: app=hcloud-csi-controller
Replicas: 1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 25% max unavailable, 25% max surge
Pod Template:
Labels: app=hcloud-csi-controller
kops.k8s.io/managed-by=kops
Service Account: hcloud-csi-controller
Containers:
csi-attacher:
Image: registry.k8s.io/sig-storage/csi-attacher:v4.1.0@sha256:08721106b949e4f5c7ba34b059e17300d73c8e9495201954edc90eeb3e6d8461
Port: <none>
Host Port: <none>
Args:
--default-fstype=ext4
Environment: <none>
Mounts:
/run/csi from socket-dir (rw)
csi-resizer:
Image: registry.k8s.io/sig-storage/csi-resizer:v1.7.0@sha256:3a7bdf5d105783d05d0962fa06ca53032b01694556e633f27366201c2881e01d
Port: <none>
Host Port: <none>
Environment: <none>
Mounts:
/run/csi from socket-dir (rw)
csi-provisioner:
Image: registry.k8s.io/sig-storage/csi-provisioner:v3.4.0@sha256:e468dddcd275163a042ab297b2d8c2aca50d5e148d2d22f3b6ba119e2f31fa79
Port: <none>
Host Port: <none>
Args:
--feature-gates=Topology=true
--default-fstype=ext4
Environment: <none>
Mounts:
/run/csi from socket-dir (rw)
hcloud-csi-driver:
Image: hetznercloud/hcloud-csi-driver:v2.3.2@sha256:b7ed90d5fab2c3fc63bf3ecb2193d3d18c1ec368c7ad98b2dbf633f0ada6afba
Ports: 9189/TCP, 9808/TCP
Host Ports: 0/TCP, 0/TCP
Command:
/bin/hcloud-csi-driver-controller
Liveness: http-get http://:healthz/healthz delay=10s timeout=3s period=2s #success=1 #failure=5
Environment:
CSI_ENDPOINT: unix:///run/csi/socket
METRICS_ENDPOINT: 0.0.0.0:9189
ENABLE_METRICS: true
KUBE_NODE_NAME: (v1:spec.nodeName)
HCLOUD_TOKEN: <set to the key 'token' in secret 'hcloud-csi'> Optional: false
Mounts:
/run/csi from socket-dir (rw)
liveness-probe:
Image: registry.k8s.io/sig-storage/livenessprobe:v2.9.0@sha256:2b10b24dafdc3ba94a03fc94d9df9941ca9d6a9207b927f5dfd21d59fbe05ba0
Port: <none>
Host Port: <none>
Environment: <none>
Mounts:
/run/csi from socket-dir (rw)
Volumes:
socket-dir:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
Conditions:
Type Status Reason
---- ------ ------
Progressing True NewReplicaSetAvailable
Available True MinimumReplicasAvailable
OldReplicaSets: <none>
NewReplicaSet: hcloud-csi-controller-7b6cf877f9 (1/1 replicas created)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ScalingReplicaSet 85m deployment-controller Scaled up replica set hcloud-csi-controller-7b6cf877f9 to 1
Normal ScalingReplicaSet 84m deployment-controller Scaled down replica set hcloud-csi-controller-65f85947bb to 0 from 1
kubectl -n kube-system describe daemonset hcloud-csi-node
Name: hcloud-csi-node
Selector: app=hcloud-csi
Node-Selector: <none>
Labels: addon.kops.k8s.io/name=hcloud-csi-driver.addons.k8s.io
app=hcloud-csi
app.kubernetes.io/managed-by=kops
k8s-addon=hcloud-csi-driver.addons.k8s.io
Annotations: deprecated.daemonset.template.generation: 2
Desired Number of Nodes Scheduled: 6
Current Number of Nodes Scheduled: 6
Number of Nodes Scheduled with Up-to-date Pods: 6
Number of Nodes Scheduled with Available Pods: 6
Number of Nodes Misscheduled: 0
Pods Status: 6 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: app=hcloud-csi
kops.k8s.io/managed-by=kops
Containers:
csi-node-driver-registrar:
Image: registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.7.0@sha256:4a4cae5118c4404e35d66059346b7fa0835d7e6319ff45ed73f4bba335cf5183
Port: <none>
Host Port: <none>
Args:
--kubelet-registration-path=/var/lib/kubelet/plugins/csi.hetzner.cloud/socket
Environment: <none>
Mounts:
/registration from registration-dir (rw)
/run/csi from plugin-dir (rw)
hcloud-csi-driver:
Image: hetznercloud/hcloud-csi-driver:v2.3.2@sha256:b7ed90d5fab2c3fc63bf3ecb2193d3d18c1ec368c7ad98b2dbf633f0ada6afba
Ports: 9189/TCP, 9808/TCP
Host Ports: 0/TCP, 0/TCP
Command:
/bin/hcloud-csi-driver-node
Liveness: http-get http://:healthz/healthz delay=10s timeout=3s period=2s #success=1 #failure=5
Environment:
CSI_ENDPOINT: unix:///run/csi/socket
METRICS_ENDPOINT: 0.0.0.0:9189
ENABLE_METRICS: true
Mounts:
/dev from device-dir (rw)
/run/csi from plugin-dir (rw)
/var/lib/kubelet from kubelet-dir (rw)
liveness-probe:
Image: registry.k8s.io/sig-storage/livenessprobe:v2.9.0@sha256:2b10b24dafdc3ba94a03fc94d9df9941ca9d6a9207b927f5dfd21d59fbe05ba0
Port: <none>
Host Port: <none>
Environment: <none>
Mounts:
/run/csi from plugin-dir (rw)
Volumes:
kubelet-dir:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet
HostPathType: Directory
plugin-dir:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet/plugins/csi.hetzner.cloud/
HostPathType: DirectoryOrCreate
registration-dir:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet/plugins_registry/
HostPathType: Directory
device-dir:
Type: HostPath (bare host directory volume)
Path: /dev
HostPathType: Directory
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulDelete 87m daemonset-controller Deleted pod: hcloud-csi-node-kpjmp
Normal SuccessfulCreate 86m daemonset-controller Created pod: hcloud-csi-node-nkglw
Normal SuccessfulDelete 86m daemonset-controller Deleted pod: hcloud-csi-node-z9d2w
Warning FailedDaemonPod 86m daemonset-controller Found failed daemon pod kube-system/hcloud-csi-node-j97jz on node control-plane-fsn1-2-32c8473e2e051454, will try to kill it
Normal SuccessfulDelete 86m daemonset-controller Deleted pod: hcloud-csi-node-j97jz
Normal SuccessfulCreate 86m daemonset-controller Created pod: hcloud-csi-node-qhtzp
Normal SuccessfulCreate 86m daemonset-controller Created pod: hcloud-csi-node-t9hlj
Normal SuccessfulDelete 85m daemonset-controller Deleted pod: hcloud-csi-node-c42lh
Normal SuccessfulCreate 85m daemonset-controller Created pod: hcloud-csi-node-g4bmv
Normal SuccessfulDelete 85m daemonset-controller Deleted pod: hcloud-csi-node-k7l97
Normal SuccessfulCreate 85m daemonset-controller Created pod: hcloud-csi-node-2xwkm
Normal SuccessfulCreate 85m daemonset-controller Created pod: hcloud-csi-node-4chhv
Warning FailedDaemonPod 79m daemonset-controller Found failed daemon pod kube-system/hcloud-csi-node-t9hlj on node control-plane-fsn1-1-3946975372c0925c, will try to kill it
Normal SuccessfulDelete 79m daemonset-controller Deleted pod: hcloud-csi-node-t9hlj
Normal SuccessfulCreate 79m daemonset-controller Created pod: hcloud-csi-node-k25ct
Warning FailedDaemonPod 78m daemonset-controller Found failed daemon pod kube-system/hcloud-csi-node-k25ct on node control-plane-fsn1-1-3946975372c0925c, will try to kill it
Normal SuccessfulDelete 78m daemonset-controller Deleted pod: hcloud-csi-node-k25ct
Normal SuccessfulCreate 78m daemonset-controller Created pod: hcloud-csi-node-g6rrm
Warning FailedDaemonPod 76m daemonset-controller Found failed daemon pod kube-system/hcloud-csi-node-g6rrm on node control-plane-fsn1-1-3946975372c0925c, will try to kill it
Normal SuccessfulDelete 76m daemonset-controller Deleted pod: hcloud-csi-node-g6rrm
Normal SuccessfulCreate 76m daemonset-controller Created pod: hcloud-csi-node-qsfkp
Warning FailedDaemonPod 67m daemonset-controller Found failed daemon pod kube-system/hcloud-csi-node-2xwkm on node control-plane-fsn1-3-42f1b452575dc956, will try to kill it
Normal SuccessfulDelete 67m daemonset-controller Deleted pod: hcloud-csi-node-2xwkm
Normal SuccessfulCreate 67m daemonset-controller Created pod: hcloud-csi-node-6tr4h
Warning FailedDaemonPod 66m daemonset-controller Found failed daemon pod kube-system/hcloud-csi-node-qsfkp on node control-plane-fsn1-1-3946975372c0925c, will try to kill it
Warning FailedDaemonPod 66m daemonset-controller Found failed daemon pod kube-system/hcloud-csi-node-qhtzp on node control-plane-fsn1-2-32c8473e2e051454, will try to kill it
Normal SuccessfulDelete 66m daemonset-controller Deleted pod: hcloud-csi-node-qhtzp
Normal SuccessfulDelete 66m daemonset-controller Deleted pod: hcloud-csi-node-qsfkp
Normal SuccessfulCreate 66m daemonset-controller Created pod: hcloud-csi-node-vrmjt
Normal SuccessfulCreate 66m daemonset-controller Created pod: hcloud-csi-node-k86xn
Warning FailedDaemonPod 61m daemonset-controller Found failed daemon pod kube-system/hcloud-csi-node-6tr4h on node control-plane-fsn1-3-42f1b452575dc956, will try to kill it
Normal SuccessfulDelete 61m daemonset-controller Deleted pod: hcloud-csi-node-6tr4h
Normal SuccessfulCreate 61m daemonset-controller Created pod: hcloud-csi-node-8w9wg
Warning FailedDaemonPod 52m daemonset-controller Found failed daemon pod kube-system/hcloud-csi-node-vrmjt on node control-plane-fsn1-1-3946975372c0925c, will try to kill it
Normal SuccessfulDelete 52m daemonset-controller Deleted pod: hcloud-csi-node-vrmjt
Normal SuccessfulCreate 52m daemonset-controller Created pod: hcloud-csi-node-p6xkq
Warning FailedDaemonPod 49m daemonset-controller Found failed daemon pod kube-system/hcloud-csi-node-p6xkq on node control-plane-fsn1-1-3946975372c0925c, will try to kill it
Normal SuccessfulDelete 49m daemonset-controller Deleted pod: hcloud-csi-node-p6xkq
Normal SuccessfulCreate 49m daemonset-controller Created pod: hcloud-csi-node-2zj6j
Warning FailedDaemonPod 48m daemonset-controller Found failed daemon pod kube-system/hcloud-csi-node-2zj6j on node control-plane-fsn1-1-3946975372c0925c, will try to kill it
Normal SuccessfulDelete 48m daemonset-controller Deleted pod: hcloud-csi-node-2zj6j
Normal SuccessfulCreate 48m daemonset-controller Created pod: hcloud-csi-node-d6qxv
Warning FailedDaemonPod 48m daemonset-controller Found failed daemon pod kube-system/hcloud-csi-node-8w9wg on node control-plane-fsn1-3-42f1b452575dc956, will try to kill it
Normal SuccessfulDelete 48m daemonset-controller Deleted pod: hcloud-csi-node-8w9wg
Normal SuccessfulCreate 48m daemonset-controller Created pod: hcloud-csi-node-s9ttb
Warning FailedDaemonPod 45m daemonset-controller Found failed daemon pod kube-system/hcloud-csi-node-s9ttb on node control-plane-fsn1-3-42f1b452575dc956, will try to kill it
Normal SuccessfulDelete 45m daemonset-controller Deleted pod: hcloud-csi-node-s9ttb
Normal SuccessfulCreate 45m daemonset-controller Created pod: hcloud-csi-node-zkw2p
Warning FailedDaemonPod 41m daemonset-controller Found failed daemon pod kube-system/hcloud-csi-node-zkw2p on node control-plane-fsn1-3-42f1b452575dc956, will try to kill it
Normal SuccessfulDelete 41m daemonset-controller Deleted pod: hcloud-csi-node-zkw2p
Normal SuccessfulCreate 41m daemonset-controller Created pod: hcloud-csi-node-lrtv7
Normal SuccessfulCreate 32m daemonset-controller Created pod: hcloud-csi-node-ttbwt
Warning FailedDaemonPod 29m daemonset-controller Found failed daemon pod kube-system/hcloud-csi-node-d6qxv on node control-plane-fsn1-1-3946975372c0925c, will try to kill it
Normal SuccessfulDelete 29m daemonset-controller Deleted pod: hcloud-csi-node-d6qxv
Normal SuccessfulCreate 29m daemonset-controller Created pod: hcloud-csi-node-s6lgc
Warning FailedDaemonPod 27m daemonset-controller Found failed daemon pod kube-system/hcloud-csi-node-s6lgc on node control-plane-fsn1-1-3946975372c0925c, will try to kill it
Normal SuccessfulDelete 27m daemonset-controller Deleted pod: hcloud-csi-node-s6lgc
Normal SuccessfulCreate 27m daemonset-controller Created pod: hcloud-csi-node-lmv78
Warning FailedDaemonPod 25m daemonset-controller Found failed daemon pod kube-system/hcloud-csi-node-k86xn on node control-plane-fsn1-2-32c8473e2e051454, will try to kill it
Normal SuccessfulDelete 25m daemonset-controller Deleted pod: hcloud-csi-node-k86xn
Normal SuccessfulCreate 25m daemonset-controller Created pod: hcloud-csi-node-hz6mn
Warning FailedDaemonPod 24m daemonset-controller Found failed daemon pod kube-system/hcloud-csi-node-lmv78 on node control-plane-fsn1-1-3946975372c0925c, will try to kill it
Normal SuccessfulDelete 24m daemonset-controller Deleted pod: hcloud-csi-node-lmv78
Normal SuccessfulCreate 24m daemonset-controller Created pod: hcloud-csi-node-hnbs9
Normal SuccessfulCreate 8m55s daemonset-controller Created pod: hcloud-csi-node-kc2rs
Are all pods running and ready (get pods -A -o wide)?
Yes, except the new consul replica
kubectl get pods -A -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
common alertmanager-prometheus-kube-prometheus-alertmanager-0 2/2 Running 1 (22h ago) 22h 100.105.181.20 nodes-fsn1-729c4c76bc120662 <none> <none>
common consul-connect-injector-69b95b77bd-szfvx 1/1 Running 123 (42m ago) 56d 100.105.181.53 nodes-fsn1-729c4c76bc120662 <none> <none>
common consul-server-0 1/1 Running 0 58d 100.103.23.83 nodes-fsn1-361c1a49dec42261 <none> <none>
common consul-server-1 1/1 Running 0 19h 100.105.181.44 nodes-fsn1-729c4c76bc120662 <none> <none>
common consul-server-2 0/1 ContainerCreating 0 8m40s <none> nodes-fsn1-6e1adf53318e07f1 <none> <none>
common consul-webhook-cert-manager-6c85944667-wstmx 1/1 Running 0 22h 100.105.181.61 nodes-fsn1-729c4c76bc120662 <none> <none>
common mariadb-mariadb-galera-0 2/2 Running 0 19h 100.105.181.55 nodes-fsn1-729c4c76bc120662 <none> <none>
common mariadb-mariadb-galera-1 2/2 Running 0 19h 100.105.181.45 nodes-fsn1-729c4c76bc120662 <none> <none>
common prometheus-grafana-b5dccf59-bhcr2 3/3 Running 2 (9h ago) 22h 100.105.181.10 nodes-fsn1-729c4c76bc120662 <none> <none>
common prometheus-kube-prometheus-operator-64f776b465-q9xrw 1/1 Running 0 22h 100.105.181.9 nodes-fsn1-729c4c76bc120662 <none> <none>
common prometheus-kube-state-metrics-6bdd65d76-r4gqc 1/1 Running 0 58d 100.105.181.17 nodes-fsn1-729c4c76bc120662 <none> <none>
common prometheus-prometheus-kube-prometheus-prometheus-0 2/2 Running 0 56d 100.105.181.52 nodes-fsn1-729c4c76bc120662 <none> <none>
common prometheus-prometheus-node-exporter-75kfn 1/1 Running 1 (20h ago) 56d 10.10.0.7 nodes-fsn1-729c4c76bc120662 <none> <none>
common prometheus-prometheus-node-exporter-fn4hr 1/1 Running 4 (67m ago) 2d18h 10.10.0.5 control-plane-fsn1-2-32c8473e2e051454 <none> <none>
common prometheus-prometheus-node-exporter-qc8kk 1/1 Running 0 56d 10.10.0.3 nodes-fsn1-361c1a49dec42261 <none> <none>
common prometheus-prometheus-node-exporter-sc86v 1/1 Running 2 (26m ago) 49m 10.10.0.4 control-plane-fsn1-1-3946975372c0925c <none> <none>
common prometheus-prometheus-node-exporter-tfjqs 1/1 Running 0 10m 10.10.0.9 nodes-fsn1-6e1adf53318e07f1 <none> <none>
common prometheus-prometheus-node-exporter-v4zl2 1/1 Running 5 (43m ago) 15d 10.10.0.6 control-plane-fsn1-3-42f1b452575dc956 <none> <none>
common storage-proxy-77ddc56cf6-v6srl 1/1 Running 1 (20h ago) 45d 100.105.181.62 nodes-fsn1-729c4c76bc120662 <none> <none>
common volpod 1/1 Running 0 19h 100.105.181.24 nodes-fsn1-729c4c76bc120662 <none> <none>
ingress-controller cert-manager-5ff989dc45-6zfbf 1/1 Running 51 (51m ago) 56d 100.105.181.35 nodes-fsn1-729c4c76bc120662 <none> <none>
ingress-controller cert-manager-cainjector-d8c5dc896-pm682 1/1 Running 4 (30m ago) 22h 100.105.181.21 nodes-fsn1-729c4c76bc120662 <none> <none>
ingress-controller cert-manager-webhook-67bd96ff64-dmmk5 1/1 Running 1 (20h ago) 22h 100.105.181.38 nodes-fsn1-729c4c76bc120662 <none> <none>
ingress-controller ingress-nginx-controller-7fb5787978-7rzbc 1/1 Running 0 18h 100.105.181.33 nodes-fsn1-729c4c76bc120662 <none> <none>
ingress-controller ingress-nginx-controller-7fb5787978-ctk8f 1/1 Running 0 18h 100.105.181.7 nodes-fsn1-729c4c76bc120662 <none> <none>
kube-system calico-kube-controllers-66fc944d4b-xcj2p 1/1 Running 1 (29m ago) 88m 100.103.65.141 control-plane-fsn1-2-32c8473e2e051454 <none> <none>
kube-system calico-node-9cdwg 1/1 Running 0 88m 10.10.0.7 nodes-fsn1-729c4c76bc120662 <none> <none>
kube-system calico-node-fw228 1/1 Running 0 85m 10.10.0.4 control-plane-fsn1-1-3946975372c0925c <none> <none>
kube-system calico-node-qklxl 1/1 Running 0 10m 10.10.0.9 nodes-fsn1-6e1adf53318e07f1 <none> <none>
kube-system calico-node-wh8zl 1/1 Running 0 87m 10.10.0.5 control-plane-fsn1-2-32c8473e2e051454 <none> <none>
kube-system calico-node-wlhxt 1/1 Running 0 86m 10.10.0.3 nodes-fsn1-361c1a49dec42261 <none> <none>
kube-system calico-node-zwftj 1/1 Running 0 86m 10.10.0.6 control-plane-fsn1-3-42f1b452575dc956 <none> <none>
kube-system coredns-6d7f697665-5gmss 1/1 Running 0 88m 100.105.181.41 nodes-fsn1-729c4c76bc120662 <none> <none>
kube-system coredns-6d7f697665-7n7gd 1/1 Running 0 88m 100.105.181.13 nodes-fsn1-729c4c76bc120662 <none> <none>
kube-system coredns-autoscaler-6f7745894d-6dscd 1/1 Running 0 88m 100.105.181.60 nodes-fsn1-729c4c76bc120662 <none> <none>
kube-system etcd-manager-events-control-plane-fsn1-1-3946975372c0925c 1/1 Running 2 (22h ago) 22h 10.10.0.4 control-plane-fsn1-1-3946975372c0925c <none> <none>
kube-system etcd-manager-events-control-plane-fsn1-2-32c8473e2e051454 1/1 Running 2 (22h ago) 58d 10.10.0.5 control-plane-fsn1-2-32c8473e2e051454 <none> <none>
kube-system etcd-manager-events-control-plane-fsn1-3-42f1b452575dc956 1/1 Running 3 (22h ago) 58d 10.10.0.6 control-plane-fsn1-3-42f1b452575dc956 <none> <none>
kube-system etcd-manager-main-control-plane-fsn1-1-3946975372c0925c 1/1 Running 2 (22h ago) 22h 10.10.0.4 control-plane-fsn1-1-3946975372c0925c <none> <none>
kube-system etcd-manager-main-control-plane-fsn1-2-32c8473e2e051454 1/1 Running 2 (22h ago) 58d 10.10.0.5 control-plane-fsn1-2-32c8473e2e051454 <none> <none>
kube-system etcd-manager-main-control-plane-fsn1-3-42f1b452575dc956 1/1 Running 3 (22h ago) 58d 10.10.0.6 control-plane-fsn1-3-42f1b452575dc956 <none> <none>
kube-system hcloud-cloud-controller-manager-5fdb77d49b-cn2vm 1/1 Running 0 88m 10.10.0.4 control-plane-fsn1-1-3946975372c0925c <none> <none>
kube-system hcloud-csi-controller-7b6cf877f9-mcvnf 5/5 Running 0 88m 100.105.181.19 nodes-fsn1-729c4c76bc120662 <none> <none>
kube-system hcloud-csi-node-g4bmv 3/3 Running 0 87m 100.103.23.75 nodes-fsn1-361c1a49dec42261 <none> <none>
kube-system hcloud-csi-node-hnbs9 3/3 Running 0 26m 100.116.56.206 control-plane-fsn1-1-3946975372c0925c <none> <none>
kube-system hcloud-csi-node-hz6mn 3/3 Running 0 27m 100.103.65.144 control-plane-fsn1-2-32c8473e2e051454 <none> <none>
kube-system hcloud-csi-node-kc2rs 3/3 Running 0 10m 100.101.101.65 nodes-fsn1-6e1adf53318e07f1 <none> <none>
kube-system hcloud-csi-node-lrtv7 3/3 Running 0 43m 100.116.242.7 control-plane-fsn1-3-42f1b452575dc956 <none> <none>
kube-system hcloud-csi-node-nkglw 3/3 Running 0 88m 100.105.181.40 nodes-fsn1-729c4c76bc120662 <none> <none>
kube-system kops-controller-8qvl6 1/1 Running 46 (42m ago) 58d 10.10.0.5 control-plane-fsn1-2-32c8473e2e051454 <none> <none>
kube-system kops-controller-f8nsh 1/1 Running 43 (49m ago) 58d 10.10.0.6 control-plane-fsn1-3-42f1b452575dc956 <none> <none>
kube-system kops-controller-mwvnp 1/1 Running 5 (54m ago) 22h 10.10.0.4 control-plane-fsn1-1-3946975372c0925c <none> <none>
kube-system kube-apiserver-control-plane-fsn1-1-3946975372c0925c 2/2 Running 10 (26m ago) 22h 10.10.0.4 control-plane-fsn1-1-3946975372c0925c <none> <none>
kube-system kube-apiserver-control-plane-fsn1-2-32c8473e2e051454 2/2 Running 52 (67m ago) 58d 10.10.0.5 control-plane-fsn1-2-32c8473e2e051454 <none> <none>
kube-system kube-apiserver-control-plane-fsn1-3-42f1b452575dc956 2/2 Running 50 (43m ago) 58d 10.10.0.6 control-plane-fsn1-3-42f1b452575dc956 <none> <none>
kube-system kube-controller-manager-control-plane-fsn1-1-3946975372c0925c 1/1 Running 8 (54m ago) 22h 10.10.0.4 control-plane-fsn1-1-3946975372c0925c <none> <none>
kube-system kube-controller-manager-control-plane-fsn1-2-32c8473e2e051454 1/1 Running 78 (42m ago) 58d 10.10.0.5 control-plane-fsn1-2-32c8473e2e051454 <none> <none>
kube-system kube-controller-manager-control-plane-fsn1-3-42f1b452575dc956 1/1 Running 65 (63m ago) 58d 10.10.0.6 control-plane-fsn1-3-42f1b452575dc956 <none> <none>
kube-system kube-proxy-control-plane-fsn1-1-3946975372c0925c 1/1 Running 1 (22h ago) 22h 10.10.0.4 control-plane-fsn1-1-3946975372c0925c <none> <none>
kube-system kube-proxy-control-plane-fsn1-2-32c8473e2e051454 1/1 Running 1 (22h ago) 58d 10.10.0.5 control-plane-fsn1-2-32c8473e2e051454 <none> <none>
kube-system kube-proxy-control-plane-fsn1-3-42f1b452575dc956 1/1 Running 2 (22h ago) 58d 10.10.0.6 control-plane-fsn1-3-42f1b452575dc956 <none> <none>
kube-system kube-proxy-nodes-fsn1-361c1a49dec42261 1/1 Running 0 58d 10.10.0.3 nodes-fsn1-361c1a49dec42261 <none> <none>
kube-system kube-proxy-nodes-fsn1-6e1adf53318e07f1 1/1 Running 0 10m 10.10.0.9 nodes-fsn1-6e1adf53318e07f1 <none> <none>
kube-system kube-proxy-nodes-fsn1-729c4c76bc120662 1/1 Running 0 58d 10.10.0.7 nodes-fsn1-729c4c76bc120662 <none> <none>
kube-system kube-scheduler-control-plane-fsn1-1-3946975372c0925c 1/1 Running 4 (31m ago) 22h 10.10.0.4 control-plane-fsn1-1-3946975372c0925c <none> <none>
kube-system kube-scheduler-control-plane-fsn1-2-32c8473e2e051454 1/1 Running 48 (42m ago) 58d 10.10.0.5 control-plane-fsn1-2-32c8473e2e051454 <none> <none>
kube-system kube-scheduler-control-plane-fsn1-3-42f1b452575dc956 1/1 Running 45 (69m ago) 58d 10.10.0.6 control-plane-fsn1-3-42f1b452575dc956 <none> <none>
What is the status/events for the app pod (describe pod)?
Now, I see error changed, but pod still can't be scheduled.
kubectl -n common describe pod consul-server-2
Name: consul-server-2
Namespace: common
Priority: 0
Service Account: consul-server
Node: nodes-fsn1-6e1adf53318e07f1/10.10.0.9
Start Time: Sun, 02 Jul 2023 10:35:41 +0000
Labels: app=consul
chart=consul-helm
component=server
controller-revision-hash=consul-server-76b6576bff
hasDNS=true
release=consul
statefulset.kubernetes.io/pod-name=consul-server-2
Annotations: consul.hashicorp.com/config-checksum: 0b003a5539ab09175e389b7e89105615c0394b10c54fd6893c3a084f5ce99f2e
consul.hashicorp.com/connect-inject: false
Status: Pending
IP:
IPs: <none>
Controlled By: StatefulSet/consul-server
Containers:
consul:
Container ID:
Image: hashicorp/consul:1.14.2
Image ID:
Ports: 8500/TCP, 8502/TCP, 8301/TCP, 8301/UDP, 8302/TCP, 8302/UDP, 8300/TCP, 8600/TCP, 8600/UDP
Host Ports: 0/TCP, 0/TCP, 0/TCP, 0/UDP, 0/TCP, 0/UDP, 0/TCP, 0/TCP, 0/UDP
Command:
/bin/sh
-ec
cp /consul/config/extra-from-values.json /consul/extra-config/extra-from-values.json
[ -n "${HOST_IP}" ] && sed -Ei "s|HOST_IP|${HOST_IP?}|g" /consul/extra-config/extra-from-values.json
[ -n "${POD_IP}" ] && sed -Ei "s|POD_IP|${POD_IP?}|g" /consul/extra-config/extra-from-values.json
[ -n "${HOSTNAME}" ] && sed -Ei "s|HOSTNAME|${HOSTNAME?}|g" /consul/extra-config/extra-from-values.json
exec /usr/local/bin/docker-entrypoint.sh consul agent \
-advertise="${ADVERTISE_IP}" \
-config-dir=/consul/config \
-config-file=/consul/extra-config/extra-from-values.json
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Limits:
cpu: 100m
memory: 100Mi
Requests:
cpu: 100m
memory: 100Mi
Readiness: exec [/bin/sh -ec curl http://127.0.0.1:8500/v1/status/leader \
2>/dev/null | grep -E '".+"'
] delay=5s timeout=5s period=3s #success=1 #failure=2
Environment:
ADVERTISE_IP: (v1:status.podIP)
HOST_IP: (v1:status.hostIP)
POD_IP: (v1:status.podIP)
CONSUL_DISABLE_PERM_MGMT: true
Mounts:
/consul/config from config (rw)
/consul/data from data-common (rw)
/consul/extra-config from extra-config (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-djbbh (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
data-common:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: data-common-consul-server-2
ReadOnly: false
config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: consul-server-config
Optional: false
extra-config:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
kube-api-access-djbbh:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 11m default-scheduler Successfully assigned common/consul-server-2 to nodes-fsn1-6e1adf53318e07f1
Warning FailedMount 9m16s kubelet Unable to attach or mount volumes: unmounted volumes=[data-common], unattached volumes=[extra-config kube-api-access-djbbh data-common config]: timed out waiting for the condition
Warning FailedMount 4m44s (x2 over 7m1s) kubelet Unable to attach or mount volumes: unmounted volumes=[data-common], unattached volumes=[data-common config extra-config kube-api-access-djbbh]: timed out waiting for the condition
Warning FailedMount 2m29s kubelet Unable to attach or mount volumes: unmounted volumes=[data-common], unattached volumes=[config extra-config kube-api-access-djbbh data-common]: timed out waiting for the condition
Warning FailedAttachVolume 56s (x13 over 11m) attachdetach-controller AttachVolume.Attach failed for volume "pvc-60560158-d8b4-4908-aa6f-8df0c53d810e" : rpc error: code = NotFound desc = failed to publish volume: volume not found
Warning FailedMount 15s kubelet Unable to attach or mount volumes: unmounted volumes=[data-common], unattached volumes=[kube-api-access-djbbh data-common config extra-config]: timed out waiting for the condition
What is the status/events for the PVC (describe pvc)?
I see that apparently PVC has been created some days ago (I made many tests, seems that at some time it worked), but from previous logs it seems is not found.
kubectl -n common get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
data-common-consul-server-0 Bound pvc-c5cd056a-7551-45a0-b3fe-62086acb8dbb 10Gi RWO hcloud-volumes 58d
data-common-consul-server-1 Bound pvc-85745d55-213f-4bbe-af85-6604fe9f75ca 10Gi RWO hcloud-volumes 58d
data-common-consul-server-2 Bound pvc-60560158-d8b4-4908-aa6f-8df0c53d810e 10Gi RWO hcloud-volumes 6d22h
data-mariadb-mariadb-galera-0 Bound pvc-4954ff56-ea22-4a2f-9c18-c94a53b4711f 40Gi RWO hcloud-volumes 58d
data-mariadb-mariadb-galera-1 Bound pvc-a209bb87-746c-4d2e-82b6-9b901d0940f6 40Gi RWO hcloud-volumes 58d
prometheus-grafana Bound pvc-25606845-ee48-441b-b685-656da862d248 10Gi RWO hcloud-volumes 58d
kubectl -n common describe pvc data-common-consul-server-2
Name: data-common-consul-server-2
Namespace: common
StorageClass: hcloud-volumes
Status: Bound
Volume: pvc-60560158-d8b4-4908-aa6f-8df0c53d810e
Labels: app=consul
chart=consul-helm
component=server
hasDNS=true
release=consul
Annotations: pv.kubernetes.io/bind-completed: yes
pv.kubernetes.io/bound-by-controller: yes
volume.beta.kubernetes.io/storage-provisioner: csi.hetzner.cloud
volume.kubernetes.io/selected-node: nodes-fsn1-722add04b36be89a
volume.kubernetes.io/storage-provisioner: csi.hetzner.cloud
Finalizers: [kubernetes.io/pvc-protection]
Capacity: 10Gi
Access Modes: RWO
VolumeMode: Filesystem
Used By: consul-server-2
Events: <none>
This comment is also out of context. How does the LB fit here?
Is just something "strange" I noticed with this issue: as I remember, when cluster was created, nodes were automatically added to LB, and this didn't happened now that i scaled cluster. Not sure is somehow related or is intended, just reported to (maybe) be helpful.
Steps which lead to issue:
Deploy new cluster with
kops create cluster --name=cluster1.fsn1.hetzner.mywebsite.com --ssh-public-key=/home/key.pub --cloud=hetzner --zones=fsn1 --image=debian-11 --networking=calico --network-cidr=10.10.0.0/16 --master-size cpx11 --master-count 3 --node-count 2 --node-size cpx21
Install hashicorp/consul
with 2 replicas.
Add new node, with kops edit ig my-node
, then minSize: 3
and maxSize: 3
.
Run kops update cluster --yes
helm upgrade
consul, with server.replicas=3
I deleted consul pod and PVC, scaled cluster another time to three nodes and redeployed consul, and pods are now correctly scheduled, so I'm closing issue. Thank you for your time.
2GB of memory for the masters is insufficient. This is probably the cause you are seeing so many pod restarts, which means cluster instability. Please switch to something with 4GB of memory or more and investigate pod crashes if they still happen.
/kind bug
1. What
kops
version are you running? The commandkops version
, will display this information.1.26.2
2. What Kubernetes version are you running?
kubectl version
will print the version if a cluster is running or provide the Kubernetes version specified as akops
flag.3. What cloud provider are you using?
Hetzner
4. What commands did you run? What is the simplest way to reproduce this issue?
Added a new node to a 2 nodes cluster with
kops edit ig my-node
, thenminSize: 3
andmaxSize: 3
Applied changes withkops update cluster --yes
Then
kops validate cluster
returnsYour cluster cluster1.fsn1.hetzner.mywebsite.com is ready
5. What happened after the commands executed?
No pods which uses PVC can be scheduled. In
describe pod
I see6. What did you expect to happen?
Have ability to run pods using PVC
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml
to display your cluster manifest. You may want to remove your cluster name and other sensitive information.8. Please run the commands with most verbose logging by adding the
-v 10
flag. Paste the logs into this report, or in a gist and provide the gist link here.9. Anything else do we need to know?