taosdata / TDengine

High-performance, scalable time-series database designed for Industrial IoT (IIoT) scenarios
https://tdengine.com
GNU Affero General Public License v3.0
23.34k stars 4.85k forks source link

从2.0.4.0升级到2.0.5.0之后,服务运行一段时间就会停掉 #3805

Closed quchunhui closed 3 years ago

quchunhui commented 4 years ago

Bug Description 我的服务器有3个节点,尝试从2.0.4.0升级到2.0.5.0。 升级之后,第2个节点和第3个节点taosd进程运行一段时间之后就宕掉了。 在升级之前没有这个问题。

To Reproduce

Expected Behavior 升级之后服务可以正常、稳定使用

Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

Additional Context 截取了部分日志 第一个节点: 10/12 14:19:37.157064 0x7f882dff0700 DND msg:status will be processed in mpeer queue 10/12 14:19:37.157067 0x7f882dff0700 MND (nil), msg:status in mpeer queue, will be redireced, numOfEps:2 inUse:0 10/12 14:19:37.157069 0x7f882dff0700 MND mnode index:0 ep:rexel-ids001:6035 10/12 14:19:37.157071 0x7f882dff0700 MND mnode index:1 ep:rexel-ids002:6035 10/12 14:19:37.157091 0x7f882dff0700 DND msg:status will be processed in mpeer queue 10/12 14:19:37.157094 0x7f882dff0700 MND (nil), msg:status in mpeer queue, will be redireced, numOfEps:2 inUse:0 10/12 14:19:37.157095 0x7f882dff0700 MND mnode index:0 ep:rexel-ids001:6035 10/12 14:19:37.157097 0x7f882dff0700 MND mnode index:1 ep:rexel-ids002:6035 10/12 14:19:37.157117 0x7f882dff0700 DND msg:status will be processed in mpeer queue 10/12 14:19:37.157120 0x7f882dff0700 MND (nil), msg:status in mpeer queue, will be redireced, numOfEps:2 inUse:0 10/12 14:19:37.157121 0x7f882dff0700 MND mnode index:0 ep:rexel-ids001:6035 10/12 14:19:37.157122 0x7f882dff0700 MND mnode index:1 ep:rexel-ids002:6035 10/12 14:19:37.157131 0x7f8857fff700 RPC WARN DND-C 0x7f885f554478 (nil), too many redirects, quit 10/12 14:19:37.157137 0x7f8857fff700 DND mnode EP list for peer is changed, numOfEps:2 inUse:0 10/12 14:19:37.157140 0x7f8857fff700 DND mnode index:0 rexel-ids001:6030 10/12 14:19:37.157142 0x7f8857fff700 DND mnode index:1 rexel-ids002:6030 10/12 14:19:37.157145 0x7f8857fff700 DND ERROR status rsp is received, error:Unable to establish connection 10/12 14:19:37.177218 0x7f882dff0700 DND msg:status will be processed in mpeer queue 10/12 14:19:37.177230 0x7f882dff0700 MND (nil), msg:status in mpeer queue, will be redireced, numOfEps:2 inUse:0 10/12 14:19:37.177232 0x7f882dff0700 MND mnode index:0 ep:rexel-ids001:6035 10/12 14:19:37.177234 0x7f882dff0700 MND mnode index:1 ep:rexel-ids002:6035 10/12 14:19:37.178488 0x7f882dff0700 DND msg:status will be processed in mpeer queue 10/12 14:19:37.178495 0x7f882dff0700 MND (nil), msg:status in mpeer queue, will be redireced, numOfEps:2 inUse:0 10/12 14:19:37.178497 0x7f882dff0700 MND mnode index:0 ep:rexel-ids001:6035 10/12 14:19:37.178499 0x7f882dff0700 MND mnode index:1 ep:rexel-ids002:6035 10/12 14:19:37.179715 0x7f882dff0700 DND msg:status will be processed in mpeer queue 10/12 14:19:37.179720 0x7f882dff0700 MND (nil), msg:status in mpeer queue, will be redireced, numOfEps:2 inUse:0 10/12 14:19:37.179721 0x7f882dff0700 MND mnode index:0 ep:rexel-ids001:6035 10/12 14:19:37.179723 0x7f882dff0700 MND mnode index:1 ep:rexel-ids002:6035 10/12 14:19:37.180957 0x7f882dff0700 DND msg:status will be processed in mpeer queue 10/12 14:19:37.180961 0x7f882dff0700 MND (nil), msg:status in mpeer queue, will be redireced, numOfEps:2 inUse:0 10/12 14:19:37.180963 0x7f882dff0700 MND mnode index:0 ep:rexel-ids001:6035 10/12 14:19:37.180964 0x7f882dff0700 MND mnode index:1 ep:rexel-ids002:6035 10/12 14:19:37.182175 0x7f882dff0700 DND msg:status will be processed in mpeer queue 10/12 14:19:37.182179 0x7f882dff0700 MND (nil), msg:status in mpeer queue, will be redireced, numOfEps:2 inUse:0 10/12 14:19:37.182181 0x7f882dff0700 MND mnode index:0 ep:rexel-ids001:6035 10/12 14:19:37.182182 0x7f882dff0700 MND mnode index:1 ep:rexel-ids002:6035 10/12 14:19:37.183404 0x7f882dff0700 DND msg:status will be processed in mpeer queue 10/12 14:19:37.183408 0x7f882dff0700 MND (nil), msg:status in mpeer queue, will be redireced, numOfEps:2 inUse:0 10/12 14:19:37.183410 0x7f882dff0700 MND mnode index:0 ep:rexel-ids001:6035 10/12 14:19:37.183411 0x7f882dff0700 MND mnode index:1 ep:rexel-ids002:6035

第二个节点: 10/12 14:12:18.988099 0x7f2428fd3700 SYN vgId:1 peer:rexel-ids001:6040, status msg received, self:slave ver:10062 peer:master ver:10062, ack:1 10/12 14:12:18.988104 0x7f2428fd3700 SYN vgId:1 peer:rexel-ids001:6040, own role:slave, new peer role:master 10/12 14:12:18.988105 0x7f2428fd3700 SYN vgId:1 peer:rexel-ids001:6040, it is the master, ver:10062 10/12 14:12:18.988111 0x7f2428fd3700 SYN vgId:1 peer:rexel-ids001:6040, status msg is sent 10/12 14:12:18.988117 0x7f2428fd3700 SYN vgId:1 peer:rexel-ids001:6040, status msg received, self:slave ver:10062 peer:master ver:10062, ack:0 10/12 14:12:18.988119 0x7f2428fd3700 SYN vgId:1 peer:rexel-ids001:6040, own role:slave, new peer role:master 10/12 14:12:18.988121 0x7f2428fd3700 SYN vgId:1 peer:rexel-ids001:6040, it is the master, ver:10062 10/12 14:12:18.988582 0x7f2423fff700 SYN vgId:13 peer:rexel-ids001:6040, status msg received, self:slave ver:14 peer:master ver:14, ack:1 10/12 14:12:18.988586 0x7f2423fff700 SYN vgId:13 peer:rexel-ids001:6040, own role:slave, new peer role:master 10/12 14:12:18.988589 0x7f2423fff700 SYN vgId:13 peer:rexel-ids001:6040, it is the master, ver:14 10/12 14:12:18.988596 0x7f2423fff700 SYN vgId:13 peer:rexel-ids001:6040, status msg is sent 10/12 14:12:18.988602 0x7f2423fff700 SYN vgId:13 peer:rexel-ids001:6040, status msg received, self:slave ver:14 peer:master ver:14, ack:0 10/12 14:12:18.988605 0x7f2423fff700 SYN vgId:13 peer:rexel-ids001:6040, own role:slave, new peer role:master 10/12 14:12:18.988606 0x7f2423fff700 SYN vgId:13 peer:rexel-ids001:6040, it is the master, ver:14 10/12 14:12:18.993066 0x7f245bfff700 DND module status:1 is set, start mnode module 10/12 14:12:18.993073 0x7f245bfff700 MND mnode module already started... 10/12 14:12:19.131688 0x7f2461c7f700 CQ vgId:13, try connect to TDengine 10/12 14:12:19.135628 0x7f2422ffd700 CQ vgId:13, id:1 CQ:select avg(velocity) from st_znsllj interval(1m) sliding(30s); is openned 10/12 14:12:20.261680 0x7f2461c7f700 SYN vgId:13 peer:rexel-ids003:6040, check peer connection 10/12 14:12:20.263423 0x7f2461c7f700 SYN vgId:13 peer:rexel-ids003:6040, connection to peer server is setup 10/12 14:12:20.265015 0x7f2428fd3700 SYN vgId:13 peer:rexel-ids003:6040, status msg received, self:slave ver:14 peer:slave ver:14, ack:1 10/12 14:12:20.265021 0x7f2428fd3700 SYN vgId:13 peer:rexel-ids003:6040, own role:slave, new peer role:slave 10/12 14:12:20.265023 0x7f2428fd3700 SYN vgId:13 peer:rexel-ids001:6040, it is the master, ver:14 10/12 14:12:20.265030 0x7f2428fd3700 SYN vgId:13 peer:rexel-ids001:6040, status msg is sent 10/12 14:12:20.265035 0x7f2428fd3700 SYN vgId:13 peer:rexel-ids003:6040, status msg is sent 10/12 14:12:20.265041 0x7f2428fd3700 SYN vgId:13 peer:rexel-ids003:6040, status msg is sent 10/12 14:12:20.266208 0x7f2423fff700 SYN vgId:13 peer:rexel-ids001:6040, status msg received, self:slave ver:14 peer:master ver:14, ack:0 10/12 14:12:20.266220 0x7f2423fff700 SYN vgId:13 peer:rexel-ids001:6040, own role:slave, new peer role:master 10/12 14:12:20.266223 0x7f2423fff700 SYN vgId:13 peer:rexel-ids001:6040, it is the master, ver:14 10/12 14:12:20.266607 0x7f2428fd3700 SYN vgId:13 peer:rexel-ids003:6040, status msg received, self:slave ver:14 peer:slave ver:14, ack:1 10/12 14:12:20.266611 0x7f2428fd3700 SYN vgId:13 peer:rexel-ids003:6040, own role:slave, new peer role:slave 10/12 14:12:20.266613 0x7f2428fd3700 SYN vgId:13 peer:rexel-ids001:6040, it is the master, ver:14 10/12 14:12:20.266618 0x7f2428fd3700 SYN vgId:13 peer:rexel-ids003:6040, status msg is sent 10/12 14:12:20.266647 0x7f2428fd3700 SYN vgId:13 peer:rexel-ids003:6040, status msg received, self:slave ver:14 peer:slave ver:14, ack:0 10/12 14:12:20.266650 0x7f2428fd3700 SYN vgId:13 peer:rexel-ids003:6040, own role:slave, new peer role:slave 10/12 14:12:20.266652 0x7f2428fd3700 SYN vgId:13 peer:rexel-ids001:6040, it is the master, ver:14 10/12 14:12:21.350116 0x7f24307e2700 MND user:_root, failed to auth user, mnode is not master 10/12 14:12:21.350142 0x7f24307e2700 DND user:_root, send auth msg to mnodes 10/12 14:12:21.351500 0x7f24307e2700 DND user:_root, auth msg received from mnodes 10/12 14:12:21.351534 0x7f24567f4700 DND 0x7f87dc000b20, msg:query will be processed in vread queue, qtype:0, msg:0x7f23fc001490 10/12 14:12:21.351773 0x7f2455ff3700 DND (nil), msg:query will be processed in vread queue, qtype:4, msg:0x7f24000f7530

第三个节点: 10/12 14:20:05.464450 0x7f3fdcce7700 RPC WARN DND-C 0x7f3fdfd12188 (nil), too many redirects, quit 10/12 14:20:05.464465 0x7f3fdcce7700 DND mnode EP list for peer is changed, numOfEps:2 inUse:0 10/12 14:20:05.464469 0x7f3fdcce7700 DND mnode index:0 rexel-ids001:6030 10/12 14:20:05.464472 0x7f3fdcce7700 DND mnode index:1 rexel-ids002:6030 10/12 14:20:05.464476 0x7f3fdcce7700 DND ERROR status rsp is received, error:Unable to establish connection 10/12 14:20:06.474431 0x7f3fdcce7700 RPC WARN DND-C 0x7f3fdfd12188 (nil), too many redirects, quit 10/12 14:20:06.474449 0x7f3fdcce7700 DND mnode EP list for peer is changed, numOfEps:2 inUse:0 10/12 14:20:06.474452 0x7f3fdcce7700 DND mnode index:0 rexel-ids001:6030 10/12 14:20:06.474455 0x7f3fdcce7700 DND mnode index:1 rexel-ids002:6030 10/12 14:20:06.474458 0x7f3fdcce7700 DND ERROR status rsp is received, error:Unable to establish connection 10/12 14:20:07.282326 0x7f3fde53f700 RPC ERROR failed to connect to:0x635b13ac:6030 10/12 14:20:07.282345 0x7f3fde53f700 RPC ERROR TSC 0x7f3f54000fa0, failed to set up connection(Unable to establish connection) 10/12 14:20:07.282356 0x7f3fde53f700 TSC ERROR 0x7f3f7c007b00 get tableMeta failed, code:Unable to establish connection 10/12 14:20:07.282359 0x7f3fde53f700 TSC ERROR 0x7f3f7c007b00 add into queued async res, code:Unable to establish connection 10/12 14:20:07.282369 0x7f3f857c9700 TSC ERROR 0x7f3f7c007b00 stream:0x7f3f7c00c250, query data failed, code:0x8000000b, retry in 11000ms 10/12 14:20:07.282382 0x7f3f857c9700 UTL ERROR cache:tableMeta, NULL data to release 10/12 14:20:07.484494 0x7f3fdcce7700 RPC WARN DND-C 0x7f3fdfd12188 (nil), too many redirects, quit 10/12 14:20:07.484511 0x7f3fdcce7700 DND mnode EP list for peer is changed, numOfEps:2 inUse:0 10/12 14:20:07.484514 0x7f3fdcce7700 DND mnode index:0 rexel-ids001:6030 10/12 14:20:07.484516 0x7f3fdcce7700 DND mnode index:1 rexel-ids002:6030 10/12 14:20:07.484520 0x7f3fdcce7700 DND ERROR status rsp is received, error:Unable to establish connection 10/12 14:20:08.494436 0x7f3fdcce7700 RPC WARN DND-C 0x7f3fdfd12188 (nil), too many redirects, quit 10/12 14:20:08.494452 0x7f3fdcce7700 DND mnode EP list for peer is changed, numOfEps:2 inUse:0 10/12 14:20:08.494455 0x7f3fdcce7700 DND mnode index:0 rexel-ids001:6030 10/12 14:20:08.494458 0x7f3fdcce7700 DND mnode index:1 rexel-ids002:6030 10/12 14:20:08.494461 0x7f3fdcce7700 DND ERROR status rsp is received, error:Unable to establish connection 10/12 14:20:09.504594 0x7f3fdcce7700 RPC WARN DND-C 0x7f3fdfd12188 (nil), too many redirects, quit 10/12 14:20:09.504608 0x7f3fdcce7700 DND mnode EP list for peer is changed, numOfEps:2 inUse:0 10/12 14:20:09.504612 0x7f3fdcce7700 DND mnode index:0 rexel-ids001:6030 10/12 14:20:09.504615 0x7f3fdcce7700 DND mnode index:1 rexel-ids002:6030 10/12 14:20:09.504618 0x7f3fdcce7700 DND ERROR status rsp is received, error:Unable to establish connection 10/12 14:20:10.514488 0x7f3fdcce7700 RPC WARN DND-C 0x7f3fdfd12188 (nil), too many redirects, quit 10/12 14:20:10.514502 0x7f3fdcce7700 DND mnode EP list for peer is changed, numOfEps:2 inUse:0 10/12 14:20:10.514505 0x7f3fdcce7700 DND mnode index:0 rexel-ids001:6030 10/12 14:20:10.514509 0x7f3fdcce7700 DND mnode index:1 rexel-ids002:6030 10/12 14:20:10.514512 0x7f3fdcce7700 DND ERROR status rsp is received, error:Unable to establish connection 10/12 14:20:11.524413 0x7f3fdcce7700 RPC WARN DND-C 0x7f3fdfd12188 (nil), too many redirects, quit 10/12 14:20:11.524429 0x7f3fdcce7700 DND mnode EP list for peer is changed, numOfEps:2 inUse:0 10/12 14:20:11.524433 0x7f3fdcce7700 DND mnode index:0 rexel-ids001:6030 10/12 14:20:11.524436 0x7f3fdcce7700 DND mnode index:1 rexel-ids002:6030 10/12 14:20:11.524439 0x7f3fdcce7700 DND ERROR status rsp is received, error:Unable to establish connection

guanshengliang commented 4 years ago

与正在处理的一个已知问题表现非常接近,我修复了之后,给你发个单独的beta版本验证一下。

Aries-Lee1991 commented 4 years ago

@quchunhui 您好,这边问题解决了吗?

guanshengliang commented 4 years ago

2052版本已经私信发给你,您验证一下是否还有此问题

quchunhui commented 3 years ago

测试问题解决