All the pods in an edge node are irregularly switching back and forth between CrashLoopBackOff and Running state.

120L020430 commented 6 months ago

What happened: There is a node called knode5 in my kubeedge cluster. After restarting service edgecore in the node, the pods in it can run correctly in a little while (no exact time). But then all pods will irregularly switch back and forth between CrashLoopBackOff and Running state like the below.

$ kubectl get pods -A -owide | grep knode5
default       nginx-748c667d99-2x224                    1/1     Running            1879 (53s ago)     15d   10.244.152.231   knode5   <none>           <none>
kube-system   nvidia-device-plugin-daemonset-gg2h6      0/1     CrashLoopBackOff   6479 (48s ago)     27d   10.244.152.232   knode5   <none>           <none>
kubeedge      edgemesh-agent-cscj2                      0/1     CrashLoopBackOff   7183 (49s ago)     63d   192.168.31.206   knode5   <none>           <none>
openfaas-fn   shasum-5c9c4d9645-bcj7c                   1/1     Running            1710 (4m55s ago)   15d   10.244.152.229   knode5   <none>           <none>
sedna         lc-d9mg5                                  1/1     Running            7201 (89s ago)     44d   192.168.31.206   knode5   <none>           <none>

$ kubectl get pods -A -owide | grep knode5
default       nginx-748c667d99-2x224                    0/1     CrashLoopBackOff   1880 (71s ago)     15d   10.244.152.236   knode5   <none>           <none>
kube-system   nvidia-device-plugin-daemonset-gg2h6      1/1     Running            6481 (5m44s ago)   27d   10.244.152.234   knode5   <none>           <none>
kubeedge      edgemesh-agent-cscj2                      0/1     CrashLoopBackOff   7184 (2m6s ago)    63d   192.168.31.206   knode5   <none>           <none>
openfaas-fn   shasum-5c9c4d9645-bcj7c                   1/1     Running            1711 (5m43s ago)   15d   10.244.152.235   knode5   <none>           <none>
sedna         lc-d9mg5                                  1/1     Running            7202 (4m43s ago)   44d   192.168.31.206   knode5   <none>           <none>

What you expected to happen: All the pods in the node can run correctly at least for a long time. How to reproduce it (as minimally and precisely as possible): Sorry, this problem happened about three weeks ago. Before then, there was nothing wrong with the pods in the node. Just on the next day of fixing the nvidia device plugin, the node was suddenly trapped in the problem. I tried to delete nvidia device plugin, but the problem still existed. So I think the problem has nothing with the plugin and I don't know how to reproduce it.
Anything else we need to know?: Some information on the node:

$ kubectl describe node knode5
Name:               knode5
Roles:              agent,edge
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=knode5
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/agent=
                    node-role.kubernetes.io/edge=
Annotations:        node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Thu, 21 Dec 2023 11:09:38 +0800
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  knode5
  AcquireTime:     <unset>
  RenewTime:       Thu, 22 Feb 2024 21:05:27 +0800
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Thu, 22 Feb 2024 21:04:12 +0800   Thu, 22 Feb 2024 02:11:11 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Thu, 22 Feb 2024 21:04:12 +0800   Thu, 22 Feb 2024 02:11:11 +0800   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Thu, 22 Feb 2024 21:04:12 +0800   Thu, 22 Feb 2024 02:11:11 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Thu, 22 Feb 2024 21:04:12 +0800   Thu, 22 Feb 2024 02:11:11 +0800   EdgeReady                    edge is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  192.168.31.206
  Hostname:    knode5
Capacity:
  cpu:             6
  hugepages-1Gi:   0
  hugepages-2Mi:   0
  memory:          16299164Ki
  nvidia.com/gpu:  1
  pods:            110
Allocatable:
  cpu:             6
  hugepages-1Gi:   0
  hugepages-2Mi:   0
  memory:          16196764Ki
  nvidia.com/gpu:  0
  pods:            110
System Info:
  Machine ID:                 46e832bca62b4109a1efdbd59e8a1b53
  System UUID:                ca9d0dc1-6a1a-a98f-c091-244bfe7a503a
  Boot ID:                    1580e644-747e-4770-8747-ff2e3062521b
  Kernel Version:             6.5.0-15-generic
  OS Image:                   Ubuntu 22.04.3 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.6.26
  Kubelet Version:            v1.26.7-kubeedge-v1.15.0
  Kube-Proxy Version:         v0.0.0-master+$Format:%H$
PodCIDR:                      10.244.1.0/24
PodCIDRs:                     10.244.1.0/24
Non-terminated Pods:          (5 in total)
  Namespace                   Name                                    CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                    ------------  ----------  ---------------  -------------  ---
  default                     nginx-748c667d99-2x224                  0 (0%)        0 (0%)      0 (0%)           0 (0%)         15d
  kube-system                 nvidia-device-plugin-daemonset-gg2h6    0 (0%)        0 (0%)      0 (0%)           0 (0%)         27d
  kubeedge                    edgemesh-agent-cscj2                    100m (1%)     200m (3%)   128Mi (0%)       256Mi (1%)     63d
  openfaas-fn                 shasum-5c9c4d9645-bcj7c                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         15d
  sedna                       lc-d9mg5                                100m (1%)     0 (0%)      32Mi (0%)        128Mi (0%)     44d
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                200m (3%)   200m (3%)
  memory             160Mi (1%)  384Mi (2%)
  ephemeral-storage  0 (0%)      0 (0%)
  hugepages-1Gi      0 (0%)      0 (0%)
  hugepages-2Mi      0 (0%)      0 (0%)
  nvidia.com/gpu     0           0
Events:
  Type    Reason          Age    From             Message
  ----    ------          ----   ----             -------
  Normal  RegisteredNode  49m    node-controller  Node knode5 event: Registered Node knode5 in Controller
  Normal  RegisteredNode  19m    node-controller  Node knode5 event: Registered Node knode5 in Controller
  Normal  RegisteredNode  9m42s  node-controller  Node knode5 event: Registered Node knode5 in Controller

pod nginx log:

$ kubectl logs nginx-748c667d99-2x224
/docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration
/docker-entrypoint.sh: Looking for shell scripts in /docker-entrypoint.d/
/docker-entrypoint.sh: Launching /docker-entrypoint.d/10-listen-on-ipv6-by-default.sh
10-listen-on-ipv6-by-default.sh: info: Getting the checksum of /etc/nginx/conf.d/default.conf
10-listen-on-ipv6-by-default.sh: info: Enabled listen on IPv6 in /etc/nginx/conf.d/default.conf
/docker-entrypoint.sh: Sourcing /docker-entrypoint.d/15-local-resolvers.envsh
/docker-entrypoint.sh: Launching /docker-entrypoint.d/20-envsubst-on-templates.sh
/docker-entrypoint.sh: Launching /docker-entrypoint.d/30-tune-worker-processes.sh
/docker-entrypoint.sh: Configuration complete; ready for start up
2024/02/22 13:33:47 [notice] 1#1: using the "epoll" event method
2024/02/22 13:33:47 [notice] 1#1: nginx/1.25.4
2024/02/22 13:33:47 [notice] 1#1: built by gcc 12.2.0 (Debian 12.2.0-14) 
2024/02/22 13:33:47 [notice] 1#1: OS: Linux 6.5.0-15-generic
2024/02/22 13:33:47 [notice] 1#1: getrlimit(RLIMIT_NOFILE): 1048576:1048576
2024/02/22 13:33:47 [notice] 1#1: start worker processes
2024/02/22 13:33:47 [notice] 1#1: start worker process 29
2024/02/22 13:33:47 [notice] 1#1: start worker process 30
2024/02/22 13:33:47 [notice] 1#1: start worker process 31
2024/02/22 13:33:47 [notice] 1#1: start worker process 32
2024/02/22 13:33:47 [notice] 1#1: start worker process 33
2024/02/22 13:33:47 [notice] 1#1: start worker process 34
2024/02/22 13:33:48 [notice] 1#1: signal 3 (SIGQUIT) received, shutting down
2024/02/22 13:33:48 [notice] 29#29: gracefully shutting down
2024/02/22 13:33:48 [notice] 30#30: gracefully shutting down
2024/02/22 13:33:48 [notice] 31#31: gracefully shutting down
2024/02/22 13:33:48 [notice] 32#32: gracefully shutting down
2024/02/22 13:33:48 [notice] 30#30: exiting
2024/02/22 13:33:48 [notice] 31#31: exiting
2024/02/22 13:33:48 [notice] 32#32: exiting
2024/02/22 13:33:48 [notice] 29#29: exiting
2024/02/22 13:33:48 [notice] 31#31: exit
2024/02/22 13:33:48 [notice] 30#30: exit
2024/02/22 13:33:48 [notice] 32#32: exit
2024/02/22 13:33:48 [notice] 29#29: exit
2024/02/22 13:33:48 [notice] 33#33: gracefully shutting down
2024/02/22 13:33:48 [notice] 33#33: exiting
2024/02/22 13:33:48 [notice] 33#33: exit
2024/02/22 13:33:48 [notice] 34#34: gracefully shutting down
2024/02/22 13:33:48 [notice] 34#34: exiting
2024/02/22 13:33:48 [notice] 34#34: exit
2024/02/22 13:33:48 [notice] 1#1: signal 17 (SIGCHLD) received from 29
2024/02/22 13:33:48 [notice] 1#1: worker process 29 exited with code 0
2024/02/22 13:33:48 [notice] 1#1: signal 29 (SIGIO) received
2024/02/22 13:33:48 [notice] 1#1: signal 17 (SIGCHLD) received from 32
2024/02/22 13:33:48 [notice] 1#1: worker process 32 exited with code 0
2024/02/22 13:33:48 [notice] 1#1: signal 29 (SIGIO) received
2024/02/22 13:33:48 [notice] 1#1: signal 17 (SIGCHLD) received from 33
2024/02/22 13:33:48 [notice] 1#1: worker process 31 exited with code 0
2024/02/22 13:33:48 [notice] 1#1: worker process 33 exited with code 0
2024/02/22 13:33:48 [notice] 1#1: signal 29 (SIGIO) received
2024/02/22 13:33:48 [notice] 1#1: signal 17 (SIGCHLD) received from 30
2024/02/22 13:33:48 [notice] 1#1: worker process 30 exited with code 0
2024/02/22 13:33:48 [notice] 1#1: worker process 34 exited with code 0
2024/02/22 13:33:48 [notice] 1#1: exit

pod lc log:

$ kubectl logs lc-d9mg5 -n sedna
I0222 13:35:24.937058       1 server.go:120] manager dataset is started
I0222 13:35:24.937087       1 server.go:120] manager model is started
I0222 13:35:24.937089       1 websocket.go:176] client starts to connect global manager(address: gm.sedna:9000)
I0222 13:35:24.937092       1 server.go:120] manager jointinferenceservice is started
I0222 13:35:24.937106       1 server.go:120] manager federatedlearningjob is started
I0222 13:35:24.937113       1 server.go:120] manager incrementallearningjob is started
I0222 13:35:24.937117       1 server.go:120] manager lifelonglearningjob is started
I0222 13:35:24.937157       1 server.go:140] server binds port 9100 successfully
E0222 13:35:24.937916       1 websocket.go:192] client tries to connect global manager(address: gm.sedna:9000) failed, error: dial tcp: lookup gm.sedna on 169.254.96.16:53: read udp 169.254.96.16:59934->169.254.96.16:53: read: connection refused
E0222 13:35:29.939387       1 websocket.go:192] client tries to connect global manager(address: gm.sedna:9000) failed, error: dial tcp: lookup gm.sedna on 169.254.96.16:53: read udp 169.254.96.16:33571->169.254.96.16:53: read: connection refused
E0222 13:35:34.940659       1 websocket.go:192] client tries to connect global manager(address: gm.sedna:9000) failed, error: dial tcp: lookup gm.sedna on 169.254.96.16:53: read udp 169.254.96.16:41056->169.254.96.16:53: read: connection refused
E0222 13:35:39.941808       1 websocket.go:192] client tries to connect global manager(address: gm.sedna:9000) failed, error: dial tcp: lookup gm.sedna on 169.254.96.16:53: read udp 169.254.96.16:45266->169.254.96.16:53: read: connection refused
E0222 13:35:44.943036       1 websocket.go:192] client tries to connect global manager(address: gm.sedna:9000) failed, error: dial tcp: lookup gm.sedna on 169.254.96.16:53: read udp 169.254.96.16:54967->169.254.96.16:53: read: connection refused
E0222 13:35:49.944201       1 websocket.go:200] max retry count reached when connecting global manager(address: gm.sedna:9000)
I0222 13:35:49.944211       1 websocket.go:176] client starts to connect global manager(address: gm.sedna:9000)
E0222 13:35:49.945005       1 websocket.go:192] client tries to connect global manager(address: gm.sedna:9000) failed, error: dial tcp: lookup gm.sedna on 169.254.96.16:53: read udp 169.254.96.16:44880->169.254.96.16:53: read: connection refused
E0222 13:35:54.945761       1 websocket.go:192] client tries to connect global manager(address: gm.sedna:9000) failed, error: dial tcp: lookup gm.sedna on 169.254.96.16:53: read udp 169.254.96.16:43091->169.254.96.16:53: read: connection refused
E0222 13:35:59.946929       1 websocket.go:192] client tries to connect global manager(address: gm.sedna:9000) failed, error: dial tcp: lookup gm.sedna on 169.254.96.16:53: read udp 169.254.96.16:37669->169.254.96.16:53: read: connection refused
E0222 13:36:04.948019       1 websocket.go:192] client tries to connect global manager(address: gm.sedna:9000) failed, error: dial tcp: lookup gm.sedna on 169.254.96.16:53: read udp 169.254.96.16:50839->169.254.96.16:53: read: connection refused
E0222 13:36:09.949725       1 websocket.go:192] client tries to connect global manager(address: gm.sedna:9000) failed, error: dial tcp: lookup gm.sedna on 169.254.96.16:53: read udp 169.254.96.16:47208->169.254.96.16:53: read: connection refused
E0222 13:36:14.950261       1 websocket.go:200] max retry count reached when connecting global manager(address: gm.sedna:9000)
I0222 13:36:14.950273       1 websocket.go:176] client starts to connect global manager(address: gm.sedna:9000)
E0222 13:36:14.951057       1 websocket.go:192] client tries to connect global manager(address: gm.sedna:9000) failed, error: dial tcp: lookup gm.sedna on 169.254.96.16:53: read udp 169.254.96.16:51773->169.254.96.16:53: read: connection refused
E0222 13:36:19.951801       1 websocket.go:192] client tries to connect global manager(address: gm.sedna:9000) failed, error: dial tcp: lookup gm.sedna on 169.254.96.16:53: read udp 169.254.96.16:39107->169.254.96.16:53: read: connection refused
E0222 13:36:24.953120       1 websocket.go:192] client tries to connect global manager(address: gm.sedna:9000) failed, error: dial tcp: lookup gm.sedna on 169.254.96.16:53: read udp 169.254.96.16:37646->169.254.96.16:53: read: connection refused
E0222 13:36:29.954057       1 websocket.go:192] client tries to connect global manager(address: gm.sedna:9000) failed, error: dial tcp: lookup gm.sedna on 169.254.96.16:53: read udp 169.254.96.16:42441->169.254.96.16:53: read: connection refused

pod edgemesh log:

$ kubectl logs edgemesh-agent-cscj2 -n kubeedge
I0222 21:38:49.947664       1 server.go:56] Version: v1.14.0-dirty
I0222 21:38:49.947708       1 server.go:92] [1] Prepare agent to run
I0222 21:38:49.948091       1 netif.go:96] bridge device edgemesh0 already exists
I0222 21:38:49.948144       1 server.go:96] edgemesh-agent running on EdgeMode
I0222 21:38:49.948154       1 server.go:99] [2] New clients
I0222 21:38:49.948722       1 server.go:106] [3] Register beehive modules
I0222 21:38:49.949052       1 module.go:34] Module EdgeDNS registered successfully
I0222 21:38:49.949261       1 server.go:68] Using userspace Proxier.
I0222 21:38:50.350620       1 module.go:34] Module EdgeProxy registered successfully
I0222 21:38:50.351925       1 util.go:273] Listening to docker0 is meaningless, skip it.
I0222 21:38:50.351930       1 util.go:273] Listening to cni0 is meaningless, skip it.
I0222 21:38:50.351933       1 util.go:273] Listening to edgemesh0 is meaningless, skip it.
I0222 21:38:50.352131       1 module.go:126] Configure libp2p.ForceReachabilityPrivate()
I0222 21:38:50.445409       1 module.go:181] I'm {12D3KooWRWKXP1pskdJEBoCuZqUvYebJ58KvWuvQ4aPf2FpEp3Bs: [/ip4/127.0.0.1/tcp/20006 /ip4/192.168.31.206/tcp/20006]}
I0222 21:38:50.543584       1 module.go:203] Bootstrapping the DHT
I0222 21:38:50.543643       1 tunnel.go:435] [Bootstrap] bootstrapping to 12D3KooWNRVRy1v8Lqb5nGYsVZnyDj5x6q8dsPA8eLofzaGPH9Yd
I0222 21:38:50.672664       1 tunnel.go:445] [Bootstrap] success bootstrapped with {12D3KooWNRVRy1v8Lqb5nGYsVZnyDj5x6q8dsPA8eLofzaGPH9Yd: [/ip4/192.168.31.46/tcp/20006]}
I0222 21:38:50.673129       1 tunnel.go:80] Starting MDNS discovery service
I0222 21:38:50.673136       1 tunnel.go:93] Starting DHT discovery service
I0222 21:38:50.673205       1 module.go:34] Module EdgeTunnel registered successfully
I0222 21:38:50.673211       1 server.go:112] [4] Cache beehive modules
I0222 21:38:50.673222       1 server.go:119] [5] Start all modules
I0222 21:38:50.673254       1 core.go:24] Starting module EdgeDNS
I0222 21:38:50.673269       1 core.go:24] Starting module EdgeProxy
I0222 21:38:50.673291       1 core.go:24] Starting module EdgeTunnel
I0222 21:38:50.674578       1 tunnel.go:493] Starting relay finder
I0222 21:38:50.674704       1 dns.go:38] Runs CoreDNS v1.8.0 as a local dns
I0222 21:38:50.675081       1 config.go:317] "Starting service config controller"
I0222 21:38:50.675093       1 shared_informer.go:240] Waiting for caches to sync for service config
I0222 21:38:50.675104       1 config.go:135] "Starting endpoints config controller"
I0222 21:38:50.675109       1 shared_informer.go:240] Waiting for caches to sync for endpoints config
I0222 21:38:50.675575       1 loadbalancer.go:239] "Starting loadBalancer destinationRule controller"
I0222 21:38:50.675582       1 shared_informer.go:240] Waiting for caches to sync for loadBalancer destinationRule
I0222 21:38:50.744096       1 tunnel.go:126] [MDNS] Discovery found peer: {12D3KooWNRVRy1v8Lqb5nGYsVZnyDj5x6q8dsPA8eLofzaGPH9Yd: [/ip4/127.0.0.1/tcp/20006 /ip4/192.168.31.46/tcp/20006]}
I0222 21:38:50.745303       1 tunnel.go:205] Discovery service got a new stream from {12D3KooWNRVRy1v8Lqb5nGYsVZnyDj5x6q8dsPA8eLofzaGPH9Yd: [/ip4/192.168.31.46/tcp/20006]}
I0222 21:38:50.752587       1 tunnel.go:126] [DHT] Discovery found peer: {12D3KooWNRVRy1v8Lqb5nGYsVZnyDj5x6q8dsPA8eLofzaGPH9Yd: [/ip4/192.168.31.46/tcp/20006 /ip4/127.0.0.1/tcp/20006]}
I0222 21:38:50.752636       1 tunnel.go:156] [DHT] New stream between peer {12D3KooWNRVRy1v8Lqb5nGYsVZnyDj5x6q8dsPA8eLofzaGPH9Yd: [/ip4/192.168.31.46/tcp/20006 /ip4/127.0.0.1/tcp/20006]} success
I0222 21:38:50.752826       1 tunnel.go:234] [MDNS] Discovery from master : {12D3KooWNRVRy1v8Lqb5nGYsVZnyDj5x6q8dsPA8eLofzaGPH9Yd: [/ip4/192.168.31.46/tcp/20006]}
I0222 21:38:50.754545       1 tunnel.go:192] [DHT] Discovery to master : {12D3KooWNRVRy1v8Lqb5nGYsVZnyDj5x6q8dsPA8eLofzaGPH9Yd: [/ip4/192.168.31.46/tcp/20006 /ip4/127.0.0.1/tcp/20006]}
I0222 21:38:50.775152       1 shared_informer.go:247] Caches are synced for endpoints config 
I0222 21:38:50.775183       1 shared_informer.go:247] Caches are synced for service config 
I0222 21:38:50.843578       1 tunnel.go:156] [MDNS] New stream between peer {12D3KooWNRVRy1v8Lqb5nGYsVZnyDj5x6q8dsPA8eLofzaGPH9Yd: [/ip4/127.0.0.1/tcp/20006 /ip4/192.168.31.46/tcp/20006]} success
.:53 on 169.254.96.16
I0222 21:38:50.843801       1 log.go:184] [INFO] plugin/reload: Running configuration MD5 = 4ba233d14e39c5d57b4b7aa950136711
I0222 21:38:50.843804       1 shared_informer.go:247] Caches are synced for loadBalancer destinationRule 
I0222 21:38:50.854910       1 tunnel.go:192] [MDNS] Discovery to master : {12D3KooWNRVRy1v8Lqb5nGYsVZnyDj5x6q8dsPA8eLofzaGPH9Yd: [/ip4/127.0.0.1/tcp/20006 /ip4/192.168.31.46/tcp/20006]}
I0222 21:38:50.906164       1 log.go:184] [INFO] 169.254.96.16:49362 - 48306 "HINFO IN 7830683134293684718.6093613845066595465. udp 57 false 512" NXDOMAIN qr,rd,ra 132 0.06222017s
I0222 21:38:52.354059       1 proxier.go:895] "Opened iptables from-containers public port for service" servicePortName="openfaas/gateway-external:http" protocol=TCP nodePort=31112
I0222 21:38:52.357419       1 proxier.go:906] "Opened iptables from-host public port for service" servicePortName="openfaas/gateway-external:http" protocol=TCP nodePort=31112
I0222 21:38:52.444091       1 proxier.go:916] "Opened iptables from-non-local public port for service" servicePortName="openfaas/gateway-external:http" protocol=TCP nodePort=31112
I0222 21:38:52.454531       1 proxier.go:895] "Opened iptables from-containers public port for service" servicePortName="default/nginx" protocol=TCP nodePort=30367
I0222 21:38:52.457919       1 proxier.go:906] "Opened iptables from-host public port for service" servicePortName="default/nginx" protocol=TCP nodePort=30367
I0222 21:38:52.461636       1 proxier.go:916] "Opened iptables from-non-local public port for service" servicePortName="default/nginx" protocol=TCP nodePort=30367
I0222 21:39:50.674900       1 tunnel.go:502] [Finder] send relayMap peer: {12D3KooWNRVRy1v8Lqb5nGYsVZnyDj5x6q8dsPA8eLofzaGPH9Yd: [/ip4/192.168.31.46/tcp/20006]}
I0222 21:39:55.843576       1 core.go:35] Get os signal terminated
I0222 21:39:55.843591       1 core.go:40] Cleanup module EdgeDNS
I0222 21:39:55.864520       1 core.go:40] Cleanup module EdgeProxy
I0222 21:39:55.884927       1 core.go:40] Cleanup module EdgeTunnel
I0222 21:39:55.905303       1 module.go:36] Shutdown module EdgeTunnel
E0222 21:39:55.905315       1 tunnel.go:534] [Finder] causes an error timed out waiting for the condition
E0222 21:39:55.905323       1 tunnel.go:564] [Heartbeat] causes an error timed out waiting for the condition
I0222 21:39:55.905328       1 module.go:36] Shutdown module EdgeProxy
I0222 21:39:55.905331       1 module.go:36] Shutdown module EdgeDNS
I0222 21:39:56.276838       1 server.go:122] edgemesh-agent exited

latest 40 logs of edgecore:

$ tail -n 40 edgecorelog 
2月 22 21:47:35 knode5 edgecore[2686785]: I0222 21:47:35.850431 2686785 scope.go:115] "RemoveContainer" containerID="bb6cf36490e2733192d0464d60fbb24b823f0aeca81c654f5c2b3c3607b65882"
2月 22 21:47:35 knode5 edgecore[2686785]: E0222 21:47:35.850581 2686785 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"edgemesh-agent\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=edgemesh-agent pod=edgemesh-agent-cscj2_kubeedge(e0858a46-1a0d-48e0-aae2-e50c6fe72db8)\"" pod="kubeedge/edgemesh-agent-cscj2" podUID=e0858a46-1a0d-48e0-aae2-e50c6fe72db8
2月 22 21:47:35 knode5 edgecore[2686785]: I0222 21:47:35.928452 2686785 scope.go:115] "RemoveContainer" containerID="f8124104967f1949397e23ef5e5f7c1ca35e3552821fa66250b7863f15c84970"
2月 22 21:47:35 knode5 edgecore[2686785]: E0222 21:47:35.928613 2686785 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"lc\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=lc pod=lc-d9mg5_sedna(4d411b0e-6885-4189-9def-c270938e084b)\"" pod="sedna/lc-d9mg5" podUID=4d411b0e-6885-4189-9def-c270938e084b
2月 22 21:47:36 knode5 edgecore[2686785]: I0222 21:47:36.814898 2686785 scope.go:115] "RemoveContainer" containerID="583d6e17dd5fd3f597593caa2f4a095c43441c913a5971ff04d941f4f720d246"
2月 22 21:47:36 knode5 edgecore[2686785]: E0222 21:47:36.815019 2686785 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"nginx\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=nginx pod=nginx-748c667d99-2x224_default(94a4216f-f1e5-4f68-858d-004d754d29f9)\"" pod="default/nginx-748c667d99-2x224" podUID=94a4216f-f1e5-4f68-858d-004d754d29f9
2月 22 21:47:37 knode5 edgecore[2686785]: I0222 21:47:37.796514 2686785 scope.go:115] "RemoveContainer" containerID="3233264ff85a2d9cd610a0df7f31bd0512552d7bccad022099c3f13d4a0bf0d2"
2月 22 21:47:37 knode5 edgecore[2686785]: E0222 21:47:37.796697 2686785 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"shasum\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=shasum pod=shasum-5c9c4d9645-bcj7c_openfaas-fn(b5d9cb95-034b-4eb5-ad19-5f972be07257)\"" pod="openfaas-fn/shasum-5c9c4d9645-bcj7c" podUID=b5d9cb95-034b-4eb5-ad19-5f972be07257
2月 22 21:47:41 knode5 edgecore[2686785]: E0222 21:47:41.900942 2686785 cri_stats_provider.go:455] "Failed to get the info of the filesystem with mountpoint" err="cannot find filesystem info for device \"/dev/nvme0n1p5\"" mountpoint="/etc/containerd/containerd/io.containerd.snapshotter.v1.overlayfs"
2月 22 21:47:47 knode5 edgecore[2686785]: I0222 21:47:47.794922 2686785 scope.go:115] "RemoveContainer" containerID="1fe7a897bb508861205ce1cd3fd3fe0fae5f4ac3e9e579b61ed0bcac80028fbd"
2月 22 21:47:47 knode5 edgecore[2686785]: E0222 21:47:47.795044 2686785 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"nvidia-device-plugin-ctr\" with CrashLoopBackOff: \"back-off 1m20s restarting failed container=nvidia-device-plugin-ctr pod=nvidia-device-plugin-daemonset-7z8lw_kube-system(9e3763fb-f4ef-4fd8-9079-4f37067fee6b)\"" pod="kube-system/nvidia-device-plugin-daemonset-7z8lw" podUID=9e3763fb-f4ef-4fd8-9079-4f37067fee6b
2月 22 21:47:49 knode5 edgecore[2686785]: I0222 21:47:49.814475 2686785 scope.go:115] "RemoveContainer" containerID="f8124104967f1949397e23ef5e5f7c1ca35e3552821fa66250b7863f15c84970"
2月 22 21:47:49 knode5 edgecore[2686785]: E0222 21:47:49.814608 2686785 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"lc\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=lc pod=lc-d9mg5_sedna(4d411b0e-6885-4189-9def-c270938e084b)\"" pod="sedna/lc-d9mg5" podUID=4d411b0e-6885-4189-9def-c270938e084b
2月 22 21:47:51 knode5 edgecore[2686785]: I0222 21:47:51.811161 2686785 scope.go:115] "RemoveContainer" containerID="583d6e17dd5fd3f597593caa2f4a095c43441c913a5971ff04d941f4f720d246"
2月 22 21:47:51 knode5 edgecore[2686785]: E0222 21:47:51.811266 2686785 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"nginx\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=nginx pod=nginx-748c667d99-2x224_default(94a4216f-f1e5-4f68-858d-004d754d29f9)\"" pod="default/nginx-748c667d99-2x224" podUID=94a4216f-f1e5-4f68-858d-004d754d29f9
2月 22 21:47:51 knode5 edgecore[2686785]: E0222 21:47:51.902179 2686785 cri_stats_provider.go:455] "Failed to get the info of the filesystem with mountpoint" err="cannot find filesystem info for device \"/dev/nvme0n1p5\"" mountpoint="/etc/containerd/containerd/io.containerd.snapshotter.v1.overlayfs"
2月 22 21:47:52 knode5 edgecore[2686785]: I0222 21:47:52.154656 2686785 scope.go:115] "RemoveContainer" containerID="bb6cf36490e2733192d0464d60fbb24b823f0aeca81c654f5c2b3c3607b65882"
2月 22 21:47:52 knode5 edgecore[2686785]: E0222 21:47:52.154809 2686785 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"edgemesh-agent\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=edgemesh-agent pod=edgemesh-agent-cscj2_kubeedge(e0858a46-1a0d-48e0-aae2-e50c6fe72db8)\"" pod="kubeedge/edgemesh-agent-cscj2" podUID=e0858a46-1a0d-48e0-aae2-e50c6fe72db8
2月 22 21:47:52 knode5 edgecore[2686785]: I0222 21:47:52.814945 2686785 scope.go:115] "RemoveContainer" containerID="3233264ff85a2d9cd610a0df7f31bd0512552d7bccad022099c3f13d4a0bf0d2"
2月 22 21:47:52 knode5 edgecore[2686785]: E0222 21:47:52.815095 2686785 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"shasum\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=shasum pod=shasum-5c9c4d9645-bcj7c_openfaas-fn(b5d9cb95-034b-4eb5-ad19-5f972be07257)\"" pod="openfaas-fn/shasum-5c9c4d9645-bcj7c" podUID=b5d9cb95-034b-4eb5-ad19-5f972be07257
2月 22 21:48:01 knode5 edgecore[2686785]: I0222 21:48:01.282474 2686785 scope.go:115] "RemoveContainer" containerID="f8124104967f1949397e23ef5e5f7c1ca35e3552821fa66250b7863f15c84970"
2月 22 21:48:01 knode5 edgecore[2686785]: E0222 21:48:01.282608 2686785 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"lc\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=lc pod=lc-d9mg5_sedna(4d411b0e-6885-4189-9def-c270938e084b)\"" pod="sedna/lc-d9mg5" podUID=4d411b0e-6885-4189-9def-c270938e084b
2月 22 21:48:01 knode5 edgecore[2686785]: E0222 21:48:01.903025 2686785 cri_stats_provider.go:455] "Failed to get the info of the filesystem with mountpoint" err="cannot find filesystem info for device \"/dev/nvme0n1p5\"" mountpoint="/etc/containerd/containerd/io.containerd.snapshotter.v1.overlayfs"
2月 22 21:48:02 knode5 edgecore[2686785]: I0222 21:48:02.789774 2686785 scope.go:115] "RemoveContainer" containerID="bb6cf36490e2733192d0464d60fbb24b823f0aeca81c654f5c2b3c3607b65882"
2月 22 21:48:02 knode5 edgecore[2686785]: E0222 21:48:02.789933 2686785 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"edgemesh-agent\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=edgemesh-agent pod=edgemesh-agent-cscj2_kubeedge(e0858a46-1a0d-48e0-aae2-e50c6fe72db8)\"" pod="kubeedge/edgemesh-agent-cscj2" podUID=e0858a46-1a0d-48e0-aae2-e50c6fe72db8
2月 22 21:48:02 knode5 edgecore[2686785]: I0222 21:48:02.887019 2686785 scope.go:115] "RemoveContainer" containerID="1fe7a897bb508861205ce1cd3fd3fe0fae5f4ac3e9e579b61ed0bcac80028fbd"
2月 22 21:48:03 knode5 edgecore[2686785]: I0222 21:48:03.036496 2686785 server.go:144] "Got registration request from device plugin with resource" resourceName="nvidia.com/gpu"
2月 22 21:48:07 knode5 edgecore[2686785]: I0222 21:48:07.380859 2686785 scope.go:115] "RemoveContainer" containerID="583d6e17dd5fd3f597593caa2f4a095c43441c913a5971ff04d941f4f720d246"
2月 22 21:48:07 knode5 edgecore[2686785]: E0222 21:48:07.380962 2686785 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"nginx\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=nginx pod=nginx-748c667d99-2x224_default(94a4216f-f1e5-4f68-858d-004d754d29f9)\"" pod="default/nginx-748c667d99-2x224" podUID=94a4216f-f1e5-4f68-858d-004d754d29f9
2月 22 21:48:07 knode5 edgecore[2686785]: I0222 21:48:07.794251 2686785 scope.go:115] "RemoveContainer" containerID="3233264ff85a2d9cd610a0df7f31bd0512552d7bccad022099c3f13d4a0bf0d2"
2月 22 21:48:07 knode5 edgecore[2686785]: E0222 21:48:07.794390 2686785 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"shasum\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=shasum pod=shasum-5c9c4d9645-bcj7c_openfaas-fn(b5d9cb95-034b-4eb5-ad19-5f972be07257)\"" pod="openfaas-fn/shasum-5c9c4d9645-bcj7c" podUID=b5d9cb95-034b-4eb5-ad19-5f972be07257
2月 22 21:48:11 knode5 edgecore[2686785]: E0222 21:48:11.904250 2686785 cri_stats_provider.go:455] "Failed to get the info of the filesystem with mountpoint" err="cannot find filesystem info for device \"/dev/nvme0n1p5\"" mountpoint="/etc/containerd/containerd/io.containerd.snapshotter.v1.overlayfs"
2月 22 21:48:14 knode5 edgecore[2686785]: I0222 21:48:14.674498 2686785 scope.go:115] "RemoveContainer" containerID="f8124104967f1949397e23ef5e5f7c1ca35e3552821fa66250b7863f15c84970"
2月 22 21:48:16 knode5 edgecore[2686785]: I0222 21:48:16.803563 2686785 scope.go:115] "RemoveContainer" containerID="bb6cf36490e2733192d0464d60fbb24b823f0aeca81c654f5c2b3c3607b65882"
2月 22 21:48:16 knode5 edgecore[2686785]: E0222 21:48:16.803763 2686785 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"edgemesh-agent\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=edgemesh-agent pod=edgemesh-agent-cscj2_kubeedge(e0858a46-1a0d-48e0-aae2-e50c6fe72db8)\"" pod="kubeedge/edgemesh-agent-cscj2" podUID=e0858a46-1a0d-48e0-aae2-e50c6fe72db8
2月 22 21:48:21 knode5 edgecore[2686785]: I0222 21:48:21.806974 2686785 scope.go:115] "RemoveContainer" containerID="3233264ff85a2d9cd610a0df7f31bd0512552d7bccad022099c3f13d4a0bf0d2"
2月 22 21:48:21 knode5 edgecore[2686785]: E0222 21:48:21.807134 2686785 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"shasum\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=shasum pod=shasum-5c9c4d9645-bcj7c_openfaas-fn(b5d9cb95-034b-4eb5-ad19-5f972be07257)\"" pod="openfaas-fn/shasum-5c9c4d9645-bcj7c" podUID=b5d9cb95-034b-4eb5-ad19-5f972be07257
2月 22 21:48:21 knode5 edgecore[2686785]: E0222 21:48:21.905422 2686785 cri_stats_provider.go:455] "Failed to get the info of the filesystem with mountpoint" err="cannot find filesystem info for device \"/dev/nvme0n1p5\"" mountpoint="/etc/containerd/containerd/io.containerd.snapshotter.v1.overlayfs"
2月 22 21:48:22 knode5 edgecore[2686785]: I0222 21:48:22.803388 2686785 scope.go:115] "RemoveContainer" containerID="583d6e17dd5fd3f597593caa2f4a095c43441c913a5971ff04d941f4f720d246"
2月 22 21:48:22 knode5 edgecore[2686785]: E0222 21:48:22.803503 2686785 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"nginx\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=nginx pod=nginx-748c667d99-2x224_default(94a4216f-f1e5-4f68-858d-004d754d29f9)\"" pod="default/nginx-748c667d99-2x224" podUID=94a4216f-f1e5-4f68-858d-004d754d29f9

I tried to fix the error "Failed to get the info of the filesystem with mountpoint" err="cannot find filesystem info for device \"/dev/nvme0n1p5\"" mountpoint="/etc/containerd/containerd/io.containerd.snapshotter.v1.overlayfs" . I changed the storage location of ‘contained’, but no matter whether it’s on different partitions of the solid-state drive or the mechanical hard drive, this error will be reported. I don't know whether the error is important. Environment:

Kubernetes version (use kubectl version): Client Version: v1.26.7, Kustomize Version: v4.5.7, Server Version: v1.26.11
KubeEdge version(e.g. cloudcore --version and edgecore --version): v1.15.0
Cloud nodes Environment:
- Hardware configuration (e.g. `lscpu`): - ``` 架构： x86_64 CPU 运行模式： 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual 字节序： Little Endian CPU: 6 在线 CPU 列表： 0-5 厂商 ID： GenuineIntel 型号名称： Intel(R) Xeon(R) CPU E5-2603 v3 @ 1.60GHz ``` - OS (e.g. `cat /etc/os-release`): Ubuntu 22.04.3 LTS - Kernel (e.g. `uname -a`): Linux master 6.5.0-14-generic \#14~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Nov 20 18:15:30 UTC 2 x86_64 x86_64 x86_64 GNU/Linux - Go version (e.g. `go version`): go version go1.19.12 linux/amd64 - Others:
Edge nodes Environment:
- edgecore version (e.g. `edgecore --version`): v1.15.0 - Hardware configuration (e.g. `lscpu`): - ``` 架构： x86_64 CPU 运行模式： 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual 字节序： Little Endian CPU: 6 在线 CPU 列表： 0-5 厂商 ID： GenuineIntel 型号名称： Intel(R) Core(TM) i5-9400F CPU @ 2.90GHz ``` - OS (e.g. `cat /etc/os-release`): Ubuntu 22.04.3 LTS - Kernel (e.g. `uname -a`): Linux knode5 6.5.0-15-generic \#15~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Jan 12 18:54:30 UTC 2 x86_64 x86_64 x86_64 GNU/Linux - Go version (e.g. `go version`): go version go1.18.1 linux/amd64 - Others:

WillardHu commented 6 months ago

It may be that edgemesh's iptables rules are confused, if you can reinstall edgemesh, you can try it.

120L020430 commented 6 months ago

Ok. I will try it. Thanks.

120L020430 commented 6 months ago

It may be that edgemesh's iptables rules are confused, if you can reinstall edgemesh, you can try it.

Thank you again. I uninstalled edgemesh, cleared iptables rules and reinstalled edgemesh. But the problem happened again. A log of edgecore caught my eye, which is as follows. nvidia-container-cli: initialization error: nvml error: driver/library version mismatch: unknown I changed my default runtime of containerd from nvidia to runc. Luckily, it worked.

Later, I tried to use nvidia-smi and got Failed to initialize NVML: Driver/library version mismatch. So my driver of gpu is wrong. But I remember that several days ago, I could get right results of nvidia-smi, while the problem of this issue existed. So I still don't know the crash problem is caused by whether the wrong driver or wrong configuration of containerd runtime.

I will keep finding it out.

federicocagnola commented 6 months ago

I seem to have encountered the same problem, but I have a feeling it is not related to edgemesh as my deployment does not include it.

Instead i noticed this issue today, while trying to bump Ubuntu on the edge node from 20.04 to 22.04. I never had any such issue with Ubuntu20 LTS, but started having the exact same behavior when i tried the newer Ubuntu on the edge node. I see @120L020430 also has Ubuntu22 on the edge node.

I have tried deploying several pods, a redis instance, an mqtt instance and a custom python image, and they all experience the same problem. Code works fine, logs look great, then the container receives a shutdown signal and stops running processes. At the same time edgecore logs that it failed to \"StartContainer\" for \"redis\" with CrashLoopBackOff, as if the process within the container was never successfully started. I wonder if this could be related to the OS, or with the way kubeedge can tell whether a container was successfully started or not.

Mar 08 14:42:07 edgeNode edgecore[53223]: E0308 14:42:07.273208   53223 pod_workers.go:1294] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"core\" with CrashLoopBackOff: \"back-off 1m20s restarting failed container=core pod=core-edgeNode-685fb7b7fc-mjscv_edge-namespace(ac734606-ea10-4c75-bf4a-b7b92674c62d)\"" pod="edge-namespace/core-edgeNode-685fb7b7fc-mjscv" podUID="ac734606-ea10-4c75-bf4a-b7b92674c62d"
Mar 08 14:42:21 edgeNode edgecore[53223]: I0308 14:42:21.259117   53223 scope.go:115] "RemoveContainer" containerID="eb7c16e8b4df87b4f2e068b2358b41e651648e7e4174194fea46c58812f1c7db"
Mar 08 14:42:50 edgeNode edgecore[53223]: I0308 14:42:50.656521   53223 scope.go:115] "RemoveContainer" containerID="ed43fab20a36292da72be8ff08dbb56ba9f095344dd63d382e625356fe39cbcd"
Mar 08 14:42:50 edgeNode edgecore[53223]: I0308 14:42:50.656655   53223 pod_container_deletor.go:80] "Container not found in pod's containers" containerID="e7c28857cfa8e86ec4c4bfc28cf0329618a23924f8eac7e649480dd1def8cf0b"
Mar 08 14:42:51 edgeNode edgecore[53223]: E0308 14:42:51.054033   53223 pod_workers.go:1294] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"redis\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=redis pod=redis-edgeNode-65dd65d6c7-lkjdh_edge-namespace(525c9cf6-ed98-464a-b4c5-bda0dee8a1e6)\"" pod="edge-namespace/redis-edgeNode-65dd65d6c7-lkjdh" podUID="525c9cf6-ed98-464a-b4c5-bda0dee8a1e6"
Mar 08 14:42:51 edgeNode edgecore[53223]: I0308 14:42:51.802070   53223 scope.go:115] "RemoveContainer" containerID="eb23b5df38ccf5f4dced0454f79c2bfb3bc3f39bdd7a43166d395c687ae54c3f"
Mar 08 14:42:51 edgeNode edgecore[53223]: E0308 14:42:51.802629   53223 pod_workers.go:1294] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"redis\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=redis pod=redis-edgeNode-65dd65d6c7-lkjdh_edge-namespace(525c9cf6-ed98-464a-b4c5-bda0dee8a1e6)\"" pod="edge-namespace/redis-edgeNode-65dd65d6c7-lkjdh" podUID="525c9cf6-ed98-464a-b4c5-bda0dee8a1e6"
Mar 08 14:43:05 edgeNode edgecore[53223]: I0308 14:43:05.265078   53223 scope.go:115] "RemoveContainer" containerID="eb23b5df38ccf5f4dced0454f79c2bfb3bc3f39bdd7a43166d395c687ae54c3f"
Mar 08 14:43:05 edgeNode edgecore[53223]: E0308 14:43:05.265541   53223 pod_workers.go:1294] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"redis\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=redis pod=redis-edgeNode-65dd65d6c7-lkjdh_edge-namespace(525c9cf6-ed98-464a-b4c5-bda0dee8a1e6)\"" pod="edge-namespace/redis-edgeNode-65dd65d6c7-lkjdh" podUID="525c9cf6-ed98-464a-b4c5-bda0dee8a1e6"
Mar 08 14:43:17 edgeNode edgecore[53223]: I0308 14:43:17.277374   53223 scope.go:115] "RemoveContainer" containerID="eb23b5df38ccf5f4dced0454f79c2bfb3bc3f39bdd7a43166d395c687ae54c3f"
Mar 08 14:43:17 edgeNode edgecore[53223]: E0308 14:43:17.277870   53223 pod_workers.go:1294] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"redis\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=redis pod=redis-edgeNode-65dd65d6c7-lkjdh_edge-namespace(525c9cf6-ed98-464a-b4c5-bda0dee8a1e6)\"" pod="edge-namespace/redis-edgeNode-65dd65d6c7-lkjdh" podUID="525c9cf6-ed98-464a-b4c5-bda0dee8a1e6"
Mar 08 14:43:32 edgeNode edgecore[53223]: I0308 14:43:32.407199   53223 scope.go:115] "RemoveContainer" containerID="eb23b5df38ccf5f4dced0454f79c2bfb3bc3f39bdd7a43166d395c687ae54c3f"
Mar 08 14:43:32 edgeNode edgecore[53223]: E0308 14:43:32.407680   53223 pod_workers.go:1294] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"redis\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=redis pod=redis-edgeNode-65dd65d6c7-lkjdh_edge-namespace(525c9cf6-ed98-464a-b4c5-bda0dee8a1e6)\"" pod="edge-namespace/redis-edgeNode-65dd65d6c7-lkjdh" podUID="525c9cf6-ed98-464a-b4c5-bda0dee8a1e6"
Mar 08 14:43:44 edgeNode edgecore[53223]: I0308 14:43:44.728456   53223 scope.go:115] "RemoveContainer" containerID="eb23b5df38ccf5f4dced0454f79c2bfb3bc3f39bdd7a43166d395c687ae54c3f"
Mar 08 14:43:44 edgeNode edgecore[53223]: E0308 14:43:44.728938   53223 pod_workers.go:1294] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"redis\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=redis pod=redis-edgeNode-65dd65d6c7-lkjdh_edge-namespace(525c9cf6-ed98-464a-b4c5-bda0dee8a1e6)\"" pod="edge-namespace/redis-edgeNode-65dd65d6c7-lkjdh" podUID="525c9cf6-ed98-464a-b4c5-bda0dee8a1e6"
Mar 08 14:43:56 edgeNode edgecore[53223]: I0308 14:43:56.272156   53223 scope.go:115] "RemoveContainer" containerID="eb23b5df38ccf5f4dced0454f79c2bfb3bc3f39bdd7a43166d395c687ae54c3f"
Mar 08 14:43:56 edgeNode edgecore[53223]: E0308 14:43:56.272574   53223 pod_workers.go:1294] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"redis\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=redis pod=redis-edgeNode-65dd65d6c7-lkjdh_edge-namespace(525c9cf6-ed98-464a-b4c5-bda0dee8a1e6)\"" pod="edge-namespace/redis-edgeNode-65dd65d6c7-lkjdh" podUID="525c9cf6-ed98-464a-b4c5-bda0dee8a1e6"

kubectl view of the node:

kubectl get pods -A -owide | grep edgeNode
edge-namespace         core-edgeNode-685fb7b7fc-mjscv                              1/1     Running            6 (2m46s ago)      32m     10.1.8.7         edgeNode                                         <none>           <none>
edge-namespace         mqtt-local-edgeNode-5cd99f69d-zs7jj                         1/1     Running            8 (109s ago)       33m     10.1.8.7         edgeNode                                         <none>           <none>
edge-namespace         redis-edgeNode-65dd65d6c7-lkjdh                             0/1     CrashLoopBackOff   7 (19s ago)        32m     10.1.8.7         edgeNode                                         <none>           <none>

federicocagnola commented 2 months ago

I managed to get kubeedge working on Ubuntu22 @120L020430 @WillardHu After thorough investigation it looked like the culprit was cgroups v2 and container compatibility. I hope that following my steps will allow other developers to solve this issue for everyone

References:

Blog article referencing the wider issue: link
Overview on containers/cgroups v2: link

To fix cgroup management to work with kubeedge:

sudo nano /etc/default/grub

i edited the following line:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"

to this:

GRUB_CMDLINE_LINUX_DEFAULT="systemd.unified_cgroup_hierarchy=0 quiet splash"

then

sudo update-grub

and reboot

kubeedge / kubeedge

All the pods in an edge node are irregularly switching back and forth between CrashLoopBackOff and Running state. #5430