kubeedge / kubeedge

Kubernetes Native Edge Computing Framework (project under CNCF)
https://kubeedge.io
Apache License 2.0
6.8k stars 1.73k forks source link

cloudcore stop suddenly #4145

Open ljjzhka opened 2 years ago

ljjzhka commented 2 years ago

The detailed log is as follows:

I0829 16:40:18.216591 4346 upstream.go:89] Dispatch message: a0a31574-3922-4820-b2fc-7dc65a3dd3e1 I0829 16:40:18.216596 4346 upstream.go:96] Message: a0a31574-3922-4820-b2fc-7dc65a3dd3e1, resource type is: membership/detail I0829 16:40:18.221807 4346 upstream.go:89] Dispatch message: 39be648a-9cb0-4df4-adbd-5cc92809f2bf I0829 16:40:18.221815 4346 upstream.go:96] Message: 39be648a-9cb0-4df4-adbd-5cc92809f2bf, resource type is: membership/detail W0829 16:40:18.993887 4346 upstream.go:468] message: 63c59d88-ac53-4862-9f10-bd98eae50780 process failure, node dev15 not found I0829 16:40:27.200015 4346 upstream.go:376] message: 76ec0bfc-f525-4023-9086-90690e432f06, pod delete successfully, namespace: aiot-test, name: remote-connect-21-59f9548d8c-mqvcb W0829 16:40:27.255840 4346 messagehandler.go:394] node c1 is deleted, data for node will be cleaned up E0829 16:40:27.255941 4346 ws.go:122] failed to read message, error: read tcp 172.18.185.67:10000->121.35.47.182:51954: use of closed network connection W0829 16:40:27.256016 4346 upstream.go:187] parse message: 8ed7b6b9-0b92-4b55-9de3-6dab0723d8e6 resource type with error: resource type not found I0829 16:40:27.256158 4346 synccontroller.go:148] ObjectSync c1.0566eec0-6352-44d5-a8eb-4ee105f18f34 will be deleted since node c1 has been deleted W0829 16:40:27.256359 4346 messagehandler.go:201] Stop keepalive check for node: c1 panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x17fb062]

goroutine 976557 [running]: github.com/kubeedge/kubeedge/cloud/pkg/cloudhub/handler.(MessageHandle).MessageWriteLoop(0xc000a3edc0, 0xc014447f80, 0xc000a9c100) /root/kubeedge/cloud/pkg/cloudhub/handler/messagehandler.go:441 +0xe2 created by github.com/kubeedge/kubeedge/cloud/pkg/cloudhub/handler.(MessageHandle).ServeConn /root/kubeedge/cloud/pkg/cloudhub/handler/messagehandler.go:308 +0x208

ljjzhka commented 2 years ago

The device that joins the cluster subsequently cannot be found. The log is as follows:

[root@master ~]# journalctl -xef | grep cloudcore | grep twowin Aug 29 17:31:06 master cloudcore[4390]: I0829 17:31:06.390911 4390 upstream.go:428] node: twowin already exists, do nothing Aug 29 17:33:20 master cloudcore[4390]: E0829 17:33:20.672483 4390 messagehandler.go:471] Failed to send event to node: twowin, affected event: id: 932c5a11-6aea-4c9d-abcb-f05a95c33094, parent_id: , group: resource, source: edgecontroller, resource: aiot-test/pod/remote-connect-25-86d9dfb85-8tk45, operation: update, err: use of closed network connection Aug 29 17:34:20 master cloudcore[4390]: I0829 17:34:20.994217 4390 synccontroller.go:148] ObjectSync twowin.cffc2a7d-c421-4b1b-8b4b-b021cb7e4d7a will be deleted since node twowin has been deleted Aug 29 17:34:20 master cloudcore[4390]: W0829 17:34:20.999169 4390 messagehandler.go:394] node twowin is deleted, data for node will be cleaned up Aug 29 17:34:21 master cloudcore[4390]: E0829 17:34:20.999320 4390 messagehandler.go:447] nodeQueue for node twowin has shutdown Aug 29 17:34:21 master cloudcore[4390]: W0829 17:34:20.999348 4390 messagehandler.go:201] Stop keepalive check for node: twowin Aug 29 17:34:21 master cloudcore[4390]: I0829 17:34:21.000749 4390 synccontroller.go:148] ObjectSync twowin.548b37fc-592b-49ce-be2d-99bcc6a4bbe4 will be deleted since node twowin has been deleted Aug 29 17:34:21 master cloudcore[4390]: I0829 17:34:21.003794 4390 synccontroller.go:148] ObjectSync twowin.52ac182c-af69-4f4d-8ade-b3faf6a598c9 will be deleted since node twowin has been deleted Aug 29 17:34:21 master cloudcore[4390]: I0829 17:34:21.007568 4390 synccontroller.go:148] ObjectSync twowin.7534f791-2a25-43e8-bdd7-77e89345b6a0 will be deleted since node twowin has been deleted Aug 29 17:34:27 master cloudcore[4390]: W0829 17:34:27.482498 4390 upstream.go:468] message: fdd3b922-5174-4883-94ca-47cd46595d10 process failure, node twowin not found Aug 29 17:34:32 master cloudcore[4390]: E0829 17:34:32.088579 4390 messagehandler.go:126] Failed to load node : twowin Aug 29 17:34:37 master cloudcore[4390]: W0829 17:34:37.525128 4390 upstream.go:468] message: 266c4007-ab39-40ee-ab2b-72d0d44b14ac process failure, node twowin not found Aug 29 17:34:47 master cloudcore[4390]: E0829 17:34:47.089535 4390 messagehandler.go:126] Failed to load node : twowin Aug 29 17:34:47 master cloudcore[4390]: W0829 17:34:47.565485 4390 upstream.go:468] message: a15d3b43-2898-4bf6-a418-f67f17ee955a process failure, node twowin not found Aug 29 17:34:57 master cloudcore[4390]: I0829 17:34:57.593978 4390 messagehandler.go:304] edge node twowin for project e632aba927ea4ac2b575ec1603d56f10 connected Aug 29 17:34:57 master cloudcore[4390]: W0829 17:34:57.603331 4390 upstream.go:468] message: 3f156101-4430-4b0f-adde-b569fc9c9063 process failure, node twowin not found Aug 29 17:34:57 master cloudcore[4390]: W0829 17:34:57.604671 4390 upstream.go:468] message: f19819a8-cfa2-4ad4-b8fe-923fd8ffaafe process failure, node twowin not found Aug 29 17:34:57 master cloudcore[4390]: W0829 17:34:57.612764 4390 upstream.go:468] message: 06bf976a-31ac-4ad5-9ac6-5e4218c502d0 process failure, node twowin not found Aug 29 17:34:57 master cloudcore[4390]: E0829 17:34:57.625866 4390 messagehandler.go:602] Delete Success Point failed with error: objectsyncs.reliablesyncs.kubeedge.io "twowin." not found Aug 29 17:34:57 master cloudcore[4390]: W0829 17:34:57.721322 4390 upstream.go:468] message: 832ec386-20ef-48bc-b1a7-8bdcbf873709 process failure, node twowin not found Aug 29 17:35:07 master cloudcore[4390]: W0829 17:35:07.674800 4390 upstream.go:468] message: 78563fb9-6bde-497f-bc3b-dbc1c0118f1c process failure, node twowin not found Aug 29 17:35:07 master cloudcore[4390]: W0829 17:35:07.778154 4390 upstream.go:468] message: 3d70bc05-76e7-4ef3-90d3-bf5670de3136 process failure, node twowin not found Aug 29 17:35:17 master cloudcore[4390]: W0829 17:35:17.702964 4390 upstream.go:468] message: f5f782d3-bbca-40d5-af0b-70650e64771d process failure, node twowin not found Aug 29 17:35:17 master cloudcore[4390]: W0829 17:35:17.840325 4390 upstream.go:468] message: e3c41f21-3403-4808-8e47-a78e830f1935 process failure, node twowin not found Aug 29 17:35:27 master cloudcore[4390]: W0829 17:35:27.749316 4390 upstream.go:468] message: dd8db745-f1b9-4896-8a17-cc563e9e7003 process failure, node twowin not found Aug 29 17:35:27 master cloudcore[4390]: W0829 17:35:27.899450 4390 upstream.go:468] message: aa421d63-b292-48db-a1be-822e44c94cce process failure, node twowin not found Aug 29 17:35:37 master cloudcore[4390]: W0829 17:35:37.813122 4390 upstream.go:468] message: e90da3ef-a03b-45f1-8dc1-2329c7452a60 process failure, node twowin not found Aug 29 17:35:37 master cloudcore[4390]: W0829 17:35:37.958092 4390 upstream.go:468] message: d23b339b-f012-44bc-b65b-41e604a89cbd process failure, node twowin not found Aug 29 17:35:48 master cloudcore[4390]: W0829 17:35:48.017624 4390 upstream.go:468] message: d00a2d7b-629d-49ac-98e0-805de91047c5 process failure, node twowin not found Aug 29 17:35:58 master cloudcore[4390]: W0829 17:35:58.077751 4390 upstream.go:468] message: 3c25320b-d254-46a8-8af9-6ecd8fd236ca process failure, node twowin not found Aug 29 17:36:04 master cloudcore[4390]: W0829 17:36:04.596660 4390 upstream.go:764] message: f88a0c46-eb5c-4cd3-84e1-ee6e9db6a8e0 process failure, node twowin not found Aug 29 17:36:11 master cloudcore[4390]: E0829 17:36:11.795240 4390 messagehandler.go:471] Failed to send event to node: twowin, affected event: id: 0d3b8d01-9a02-4276-8d6a-8f6fcf4fd87e, parent_id: , group: resource, source: edgecontroller, resource: aiot-test/pod/video-26-7dd8b74ddd-c4d6z, operation: update, err: use of closed network connection

vincentgoat commented 2 years ago

Hi, @LeiJJ8 can you provide the version of cloudcore and edgecore?

ljjzhka commented 2 years ago

Hi,@vincentgoat The version I am using is 1.10.0.

vincentgoat commented 2 years ago

I think it is because of the type assert panic, more details show below. image

Fortunately, we can fix this issue after this PR https://github.com/kubeedge/kubeedge/pull/4105 is merged.

ljjzhka commented 2 years ago

@vincentgoat Excuse me, Has it been merged in the latest version? For example 1.10.3/1.11.2

vincentgoat commented 2 years ago

This PR https://github.com/kubeedge/kubeedge/pull/4105 was just merged in the master branch and released in version v1.12.0. Is this problem reproduced frequently? We highly appreciate it if you work on it as well.

ljjzhka commented 2 years ago

@vincentgoat Often reproduced in 1.10.1, I will be ready to test the latest version.