taosdata / TDengine

High-performance, scalable time-series database designed for Industrial IoT (IIoT) scenarios
https://tdengine.com
GNU Affero General Public License v3.0
23.33k stars 4.85k forks source link

执行REDISTRIBUTE VGROUP操作无法完成 #28109

Open everccc opened 3 weeks ago

everccc commented 3 weeks ago

Bug Description 库是设置的单副本,执行REDISTRIBUTE VGROUP vgroup_no DNODE dnode_id1;命令执行后6个小时以后仍未完成。查询超级表会提示:DB error: Sync leader is restoring

To Reproduce Steps to reproduce the behavior 1.执行语句REDISTRIBUTE VGROUP 554 DNODE 2; 2.vnode状态一直停在下面状态

vgroup_id  |            db_name             |   tables    | v1_dnode |  v1_status  | v2_dnode |  v2_status  | v3_dnode |  v3_status  | v4_dnode |  v4_status  |  cacheload  | cacheelements | tsma |
======================================================================================================================================================================================================
         554 | dfcv                           |        7060 |        7 | leader**    |        2 | follower    | NULL     | NULL        | NULL     | NULL        |           0 |             0 |    0 |

日志滚动出现如下内容

taosdlog日志

09/25 16:34:50.888542 03634967 SYN WARN vgId:554, out of buffer range. index:471749156, term:3. log buffer: [-1 -1 -1, 0)
09/25 16:34:50.956298 03625585 MND stb:1.log.keeper_monitor, start to retrieve meta
09/25 16:34:50.956315 03625585 MND stb:1.log.taosadapter_restful_http_request_in_flight, start to retrieve meta
09/25 16:34:51.040574 03625588 MND stb:1.log.cluster_info, start to retrieve meta
09/25 16:34:51.040590 03625588 MND stb:1.log.log_dir, start to retrieve meta
09/25 16:34:51.040599 03625588 MND stb:1.log.vgroups_info, start to retrieve meta
09/25 16:34:51.040608 03625588 MND stb:1.log.taosadapter_restful_http_request_fail, start to retrieve meta
09/25 16:34:51.040617 03625588 MND stb:1.log.temp_dir, start to retrieve meta
09/25 16:34:51.241496 03625561 RPC WARN DND-C conn 0x7f010aa32e00 send cost:5276344us, send exception
09/25 16:34:51.241529 03634966 VND ERROR vgId:554, msg:0x7efe2ac04628 failed to process since restore not finished, type:alter-confirm, gtid:0xa370d3205ec70022:0x31f6228518c2b0ac
09/25 16:34:51.241549 03634966 SYN vgId:554, sync get retry epset numOfEps:2 inUse:0
09/25 16:34:51.357171 03625586 MND stb:1.log.cluster_info, start to retrieve meta
09/25 16:34:51.357199 03625586 MND stb:1.log.log_dir, start to retrieve meta
09/25 16:34:51.357208 03625586 MND stb:1.log.temp_dir, start to retrieve meta
09/25 16:34:51.357217 03625586 MND stb:1.log.vgroups_info, start to retrieve meta
09/25 16:34:51.357225 03625586 MND stb:1.log.taosadapter_restful_http_request_fail, start to retrieve meta
09/25 16:34:51.357233 03625586 MND stb:1.dfcv.td_d00b, start to retrieve meta
09/25 16:34:51.640422 03625589 MND trans:46700, continue to execute, stage:redoAction createTime:1727233285832 topHalf:1
09/25 16:34:51.640445 03625589 MND trans:46700, execute 8 actions serial, current redoAction:3
09/25 16:34:51.640450 03625589 MND trans:46700, redoAction:3 is in progress and wait it finish
09/25 16:34:51.640460 03625589 MND trans:46700, stage keep on redoAction since Action in progress
09/25 16:34:52.105982 03625585 MND stb:1.log.vnodes_role, start to retrieve meta
09/25 16:34:52.106008 03625585 MND stb:1.dfcv.td_0200, start to retrieve meta

Environment (please complete the following information): OS: Ubuntu 20.04.6 LTS (GNU/Linux 5.4.0-181-generic x86_64) [Memory, CPU, current Disk Space](mem:256G Disk: 20T) TDengine Version 3.1.0.2 ,6节点集群,256个vgroup

Additional Context 能否手动终止REDISTRIBUTE 动作?目前该vnode状态导致相关vnode查询一直无法恢复正常。

yu285 commented 1 week ago

leader ** 是在执行命令后出现,还是一直就有呢?

猜测中间应该有过异常停止,show transactions\G 看看是否有正在执行的事务,可以kill。

然后更新到最新版本 3.3.3.0 ,再做观察,仍有疑问可以微信联系a15652223354,做具体排查