Open shao77622 opened 1 year ago
Do you add tolerations for you deploy to make the pod not to be migrated when node is offline?
tolerations:
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
No tolerations , but it's DaemonSet. I tested with kubeedge:v1.12.0, the pod will not be recreate.
-----sample.yaml----- apiVersion: apps/v1 kind: DaemonSet metadata: name: redis-3.2.9 namespace: edge labels: app: redis spec: selector: matchLabels: app: redis template: metadata: labels: app: redis spec: nodeSelector: xxx/redis: 3.2.9 hostNetwork: true containers:
From your description, the pod is deleted and recreated by the k8s controller manager, I am strange that the pod created by daemonset will auto add the tolerations, https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/#taints-and-tolerations, so the pod will not be evicted
the pod will be terminated, and pod with another name will be scheduled
-----sample.yaml----- apiVersion: apps/v1 kind: DaemonSet metadata: name: redis-3.2.9 namespace: edge labels: app: redis spec: selector: matchLabels: app: redis template: metadata: labels: app: redis spec: nodeSelector: xxx/redis: 3.2.9 hostNetwork: true containers: - args: - '--requirepass' - '888888' image: redis:3.2.9 name: redis ports: - containerPort: 6379 name: redis protocol: TCP volumeMounts: - mountPath: /etc/localtime name: volume-localtime volumes: - hostPath: path: /etc/localtime type: '' name: volume-localtime
Could you paste kubectl get pod xxx and cloudcore log
kubectl get pod -n edge NAME READY STATUS RESTARTS AGE mysql-arm-94wrz 1/1 Running 0 111s
before reconnect: after reconnect:
kubectl -n kubeedge logs cloudcore-69c88b5fdd-dr2fk
I1121 14:01:38.509835 1 log.go:184] http: TLS handshake error from 122.233.180.49:60889: tls: failed to verify client certificate: x509: certificate signed by unknown authority (possibly because of "x509: ECDSA verification failure" while trying to verify candidate authority certificate "KubeEdge") W1121 14:01:38.980652 1 upstream.go:217] parse message: edd3df24-919c-4126-8237-bda3d79f39a7 resource type with error, message resource: node/arm5, err: resource type not found I1121 14:01:38.980659 1 message_handler.go:122] edge node arm5 for project e632aba927ea4ac2b575ec1603d56f10 connected I1121 14:01:38.980713 1 node_session.go:136] Start session for edge node arm5 I1121 14:01:38.991104 1 upstream.go:89] Dispatch message: b16b99fc-a986-4ab6-88b2-7d0d515a9d78 I1121 14:01:38.991114 1 upstream.go:96] Message: b16b99fc-a986-4ab6-88b2-7d0d515a9d78, resource type is: membership/detail E1121 14:01:39.137759 1 upstream.go:838] create node arm5 error: nodes "arm5" already exists , register node failed
please run kubectl get pod xx -n xxx -oyaml
before reconnect, the pod is pending?
And I can conform that the pod is recreated by the kube controller
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: "2022-11-21T09:23:14Z"
generateName: mysql-arm-
labels:
app: mysql
controller-revision-hash: 85b4c54547
pod-template-generation: "3"
name: mysql-arm-csdwv
namespace: edge
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: DaemonSet
name: mysql-arm
uid: e52f4375-6b2d-4b4f-bb54-89538e5e6f97
resourceVersion: "488616"
uid: c434eb7d-b142-4e46-a0f0-e2c7b95a5eb4
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchFields:
- key: metadata.name
operator: In
values:
- arm5
containers:
- env:
- name: MYSQL_ROOT_PASSWORD
value: Minetec123!
image: registry.mwpark.cn/thirdparty/library/mysql:8.0.31
imagePullPolicy: IfNotPresent
name: mysql
ports:
- containerPort: 3306
hostPort: 3306
name: mysql
protocol: TCP
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/lib/mysql
name: mysql-persistent-storage
- mountPath: /etc/localtime
name: localtime
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-qkm2k
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
hostNetwork: true
nodeName: arm5
nodeSelector:
mwpark.cn/mysql: arm
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/disk-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/memory-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/pid-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/unschedulable
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/network-unavailable
operator: Exists
volumes:
- hostPath:
path: /var/lib/mysql
type: ""
name: mysql-persistent-storage
- hostPath:
path: /etc/localtime
type: ""
name: localtime
- name: kube-api-access-qkm2k
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2022-11-21T09:23:14Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2022-11-21T09:24:37Z"
status: "False"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2022-11-21T09:23:15Z"
status: "True"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2022-11-21T09:23:14Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: docker://19d52085e775f72062de925c168864aa8b510fbd0e4b029e09918b65849f9d65
image: mysql:8.0.31
imageID: docker-pullable://mysql@sha256:96439dd0d8d085cd90c8001be2c9dde07b8a68b472bd20efcbe3df78cff66492
lastState: {}
name: mysql
ready: true
restartCount: 0
started: true
state:
running:
startedAt: "2022-11-21T09:23:15Z"
hostIP: 10.211.55.9
phase: Running
podIP: 10.211.55.9
podIPs:
- ip: 10.211.55.9
qosClass: BestEffort
startTime: "2022-11-21T09:23:14Z"
I switch cloudcore and edgecore to 1.12.0 from the cluster , and the problem still exits. But another cluster with cloudcore 1.12.0 is OK. what's the problem?
From your description, the pod is deleted and recreated by the k8s controller manager, I am strange that the pod created by daemonset will auto add the tolerations, https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/#taints-and-tolerations, so the pod will not be evicted
the pod will be terminated, and pod with another name will be scheduled
@wackxu I've tested in two clusters.
the k8s version v1.22.10 cluster does not recreate pod, with kubeedge:V1.21.1 the k8s version v1.22.12 cluster does recreate pod, with kubeedge:V1.21.1 Both cluster was created with kubesphere. So what's the cause.
It is strange, Could you look at you kube-apiserver log to see which components delete the old pod?
@wackxu killed by controller manager, is that the new feature of k8s v1.22.12? cluster 1.22.12 I1121 21:31:59.606167 1 event.go:291] "Event occurred" object="kubeedge1" kind="Node" apiVersion="v1" type="Normal" reason="NodeNotReady" message="Node kubeedge1 status is now: NodeNotReady"
I1121 21:31:59.615374 1 event.go:291] "Event occurred" object="edge/redis-3.2.9-bwhgw" kind="Pod" apiVersion="v1" type="Warning" reason="NodeNotReady" message="Node is not ready"
I1121 21:33:37.703330 1 event.go:291] "Event occurred" object="edge/redis-3.2.9" kind="DaemonSet" apiVersion="apps/v1" type="Warning" reason="FailedDaemonPod" message="Found failed daemon pod edge/redis-3.2.9-bwhgw on node kubeedge1, will try to kill it"
I1121 21:33:37.708184 1 event.go:291] "Event occurred" object="edge/redis-3.2.9" kind="DaemonSet" apiVersion="apps/v1" type="Normal" reason="SuccessfulDelete" message="Deleted pod: redis-3.2.9-bwhgw"
I1121 21:33:37.719145 1 event.go:291] "Event occurred" object="edge/redis-3.2.9" kind="DaemonSet" apiVersion="apps/v1" type="Normal" reason="SuccessfulCreate" message="Created pod: redis-3.2.9-kx4n7"
cluster 1.22.10 I1121 21:39:22.965079 1 event.go:291] "Event occurred" object="kubeedge2" kind="Node" apiVersion="v1" type="Normal" reason="NodeNotReady" message="Node kubeedge2 status is now: NodeNotReady"
I1121 21:39:22.975212 1 event.go:291] "Event occurred" object="edge/redis-3.2.9-mdhzs" kind="Pod" apiVersion="v1" type="Warning" reason="NodeNotReady" message="Node is not ready"
the latter does not kill and recreate.
@wackxu I disconnect cloudcore node with 1.22.12 cluster, and damonset node-exporter in cloudcore node is not killed. so it is wired , only edge node has the killing problem..
controller manager log I1121 22:55:35.609053 1 event.go:291] "Event occurred" object="kubesphere-monitoring-system/node-exporter-h5g56" kind="Pod" apiVersion="v1" type="Warning" reason="NodeNotReady" message="Node is not ready"
node-exporter.yaml `kind: DaemonSet apiVersion: apps/v1 metadata: name: node-exporter namespace: kubesphere-monitoring-system labels: app.kubernetes.io/component: exporter app.kubernetes.io/name: node-exporter app.kubernetes.io/part-of: kube-prometheus app.kubernetes.io/version: 1.3.1 annotations: deprecated.daemonset.template.generation: '1' spec: selector: matchLabels: app.kubernetes.io/component: exporter app.kubernetes.io/name: node-exporter app.kubernetes.io/part-of: kube-prometheus template: metadata: creationTimestamp: null labels: app.kubernetes.io/component: exporter app.kubernetes.io/name: node-exporter app.kubernetes.io/part-of: kube-prometheus app.kubernetes.io/version: 1.3.1 spec: volumes:
- --collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/docker/.+)($|/)
- --collector.filesystem.ignored-fs-types=^(autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|sysfs|tracefs)$ resources: limits: cpu: '1' memory: 500Mi requests: cpu: 102m memory: 180Mi volumeMounts:
- --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256
I find a way to reproduce the problem. systemctl stop edgecore, wait until node notready, then systemctl start edgecore, then the daemonset pod whill be killed and recreated. But if edgecore is running, just use iptables to reject cloudcore IP, wait until node notready and clean iptable rules, then the node will reconnect to cloudcore and the daemonset pod will not be killed .
So something is happened when restart edgecore with this problem @wackxu
@shao77622 I will do some test today to find what happend
What happened: make edge node disconnected from cloud core (by modify cloudcore IP address with a non-exits host in /etc/kubeedge/config/edgecore.yaml , then systemctl restart edgecore.), after a while , reconnect.. when edge node connect to cloud core, the pod will be terminated, and pod with another name will be scheduled. What you expected to happen: the pod will not be terminated, as v1.12.0 does. How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
kubectl version
): 1.22.12cloudcore --version
andedgecore --version
):1.12.1Cloud nodes Environment:
- Hardware configuration (e.g. `lscpu`): - OS (e.g. `cat /etc/os-release`): - Kernel (e.g. `uname -a`): - Go version (e.g. `go version`): - Others:Edge nodes Environment:
- edgecore version (e.g. `edgecore --version`): - Hardware configuration (e.g. `lscpu`): - OS (e.g. `cat /etc/os-release`): - Kernel (e.g. `uname -a`): - Go version (e.g. `go version`): - Others: