housepower / ckman

This is a tool which used to manage and monitor ClickHouse database
Apache License 2.0
433 stars 108 forks source link

副本模式下模拟宕机,加replicated节点失败 #281

Closed XuankuF closed 8 months ago

XuankuF commented 1 year ago

【ckman版本】 2.3.6 【操作系统架构】 CentOS 7.2 【clickhouse版本】 23.3.7.5 【问题描述】

  1. 由两台主机新建副本模式集群,建测试表并导入测试数据后;
  2. 把一台节点关机,再在 ckman 上删除关机的节点,metrika.xml 文件会修改成剩余的节点;
  3. 加入新节点,两个节点的 metrika.xml 文件都修改成功了,且两个节点的服务都是 running 状态,但是 ckman 显示失败; image

ckman 报错日志:

2023-08-07T18:15:27.294+0800    ERROR   runner/runner.go:59     clickhouse.addnode failed:[ConfigExt]: : code: 519, message: All attempts to get table structure failed. Log:
github.com/housepower/ckman/service/runner.(*RunnerService).CheckTaskEvent.func1
        /root/chenyc/build/ckman/service/runner/runner.go:59
github.com/housepower/ckman/common.runFunc
        /root/chenyc/build/ckman/common/workerpool.go:91
github.com/housepower/ckman/common.(*WorkerPool).wokerFunc
        /root/chenyc/build/ckman/common/workerpool.go:68

加入集群的节点错误日志:

2023.08.07 18:15:27.286317 [ 2330 ] {b34e74a4-6cac-4128-98b0-470415b43b81} <Error> executeQuery: Code: 519. DB::NetException: All attempts to get table structure failed. Log: 

. (NO_REMOTE_SHARD_AVAILABLE) (version 23.3.7.5 (official build)) (from [::ffff:192.168.33.34]:57318) (in query: SELECT DISTINCT database AS name, concat('CREATE DATABASE IF NOT EXISTS "', name, '"') AS cre
ate_db_query FROM remote('192.168.33.72', system, tables, 'default', '[HIDDEN]') WHERE database NOT IN ('system', 'information_schema', 'INFORMATION_SCHEMA') SETTINGS skip_unavailable_shards = 1), Stack tra
ce (when copying this message, always include the lines below):

0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0xe1dc395 in /usr/bin/clickhouse
1. ? @ 0x1420ec64 in /usr/bin/clickhouse
2. DB::getStructureOfRemoteTable(DB::Cluster const&, DB::StorageID const&, std::shared_ptr<DB::Context const>, std::shared_ptr<DB::IAST> const&) @ 0x1420e9f8 in /usr/bin/clickhouse
3. DB::TableFunctionRemote::getActualTableStructure(std::shared_ptr<DB::Context const>) const @ 0x121be822 in /usr/bin/clickhouse
4. DB::TableFunctionRemote::executeImpl(std::shared_ptr<DB::IAST> const&, std::shared_ptr<DB::Context const>, String const&, DB::ColumnsDescription) const @ 0x121bda0a in /usr/bin/clickhouse
5. DB::ITableFunction::execute(std::shared_ptr<DB::IAST> const&, std::shared_ptr<DB::Context const>, String const&, DB::ColumnsDescription, bool, bool) const @ 0x12513c0f in /usr/bin/clickhouse
6. DB::Context::executeTableFunction(std::shared_ptr<DB::IAST> const&, DB::ASTSelectQuery const*) @ 0x12d7b7fe in /usr/bin/clickhouse
7. DB::JoinedTables::getLeftTableStorage() @ 0x139107b7 in /usr/bin/clickhouse
8. DB::InterpreterSelectQuery::InterpreterSelectQuery(std::shared_ptr<DB::IAST> const&, std::shared_ptr<DB::Context> const&, std::optional<DB::Pipe>, std::shared_ptr<DB::IStorage> const&, DB::SelectQueryOptions const&, std::vector<String, std::allocator<String>> const&, std::shared_ptr<DB::StorageInMemoryMetadata const> const&, std::shared_ptr<DB::PreparedSets>) @ 0x1382d016 in /usr/bin/clickhouse
9. DB::InterpreterSelectWithUnionQuery::buildCurrentChildInterpreter(std::shared_ptr<DB::IAST> const&, std::vector<String, std::allocator<String>> const&) @ 0x138cbf22 in /usr/bin/clickhouse
10. DB::InterpreterSelectWithUnionQuery::InterpreterSelectWithUnionQuery(std::shared_ptr<DB::IAST> const&, std::shared_ptr<DB::Context>, DB::SelectQueryOptions const&, std::vector<String, std::allocator<String>> const&) @ 0x138c9b93 in /usr/bin/clickhouse
11. DB::InterpreterFactory::get(std::shared_ptr<DB::IAST>&, std::shared_ptr<DB::Context>, DB::SelectQueryOptions const&) @ 0x137e8092 in /usr/bin/clickhouse
12. ? @ 0x13bf3ae2 in /usr/bin/clickhouse
13. DB::executeQuery(String const&, std::shared_ptr<DB::Context>, bool, DB::QueryProcessingStage::Enum) @ 0x13bf18ad in /usr/bin/clickhouse
14. DB::TCPHandler::runImpl() @ 0x149c554c in /usr/bin/clickhouse
15. DB::TCPHandler::run() @ 0x149dad59 in /usr/bin/clickhouse
16. Poco::Net::TCPServerConnection::start() @ 0x179217d4 in /usr/bin/clickhouse
17. Poco::Net::TCPServerDispatcher::run() @ 0x179229fb in /usr/bin/clickhouse
18. Poco::PooledThread::run() @ 0x17aaa287 in /usr/bin/clickhouse
19. Poco::ThreadImpl::runnableEntry(void*) @ 0x17aa7cbd in /usr/bin/clickhouse
20. start_thread @ 0x7ea5 in /usr/lib64/libpthread-2.17.so
21. clone @ 0xfe9fd in /usr/lib64/libc-2.17.so
YenchangChan commented 11 months ago

好像有些间歇性remote密码校验不通过,我也遇到过几次,暂时没有找到原因,不过一般多试几次就好了。

YenchangChan commented 8 months ago

fixed by #287