Closed codingknees closed 1 year ago
有两个节点的INFO日志是这样: 一个节点日志是这样:
@codingknees Please paste metad-2 INFO log here
上面INFO日志图片,第一张图片的日志是2个节点的日志,应该是2个从节点。第二张图片是主节点。现在服务已经被我重启多次了,环境被破坏了。
重新部署了一下,日志如下:
INFO log from nebula-metad-0:
Log file created at: 2022/10/13 06:12:39
Running on machine: nebula-metad-0
Running duration (h:mm:ss): 0:00:00
Log line format: [IWEF]yyyymmdd hh:mm:ss.uuuuuu threadid file:line] msg
I20221013 06:12:39.106065 1 MetaDaemon.cpp:135] localhost = "nebula-metad-0.nebula-metad-headless.nebula.svc.cluster.local":9559
I20221013 06:12:39.111194 1 NebulaStore.cpp:51] Start the raft service...
I20221013 06:12:39.111518 1 NebulaSnapshotManager.cpp:25] Send snapshot is rate limited to 10485760 for each part by default
I20221013 06:12:39.137447 1 RaftexService.cpp:46] Start raft service on 9560
I20221013 06:12:39.137612 1 NebulaStore.cpp:85] Scan the local path, and init the spaces_
E20221013 06:12:39.137632 1 FileUtils.cpp:377] Failed to read the directory "/usr/local/nebula/data/meta/nebula" (2): No such file or
directory
I20221013 06:12:39.137771 1 NebulaStore.cpp:271] Init data from partManager for "nebula-metad-0.nebula-metad-headless.nebula.svc.clus
ter.local":9559
I20221013 06:12:39.137785 1 NebulaStore.cpp:387] Create data space 0
I20221013 06:12:39.177439 1 RocksEngine.cpp:97] open rocksdb on /usr/local/nebula/data/meta/nebula/0/data
I20221013 06:12:39.187948 1 NebulaStore.cpp:459] Space 0, part 0 has been added, asLearner 0
I20221013 06:12:39.187981 1 NebulaStore.cpp:78] Register handler...
I20221013 06:12:39.187991 1 MetaDaemonInit.cpp:101] Waiting for the leader elected...
I20221013 06:12:39.188000 1 MetaDaemonInit.cpp:113] Leader has not been elected, sleep 1s
E20221013 06:12:39.435500 62 ThriftClientManager-inl.h:70] Failed to resolve address for 'nebula-metad-1.nebula-metad-headless.nebula.
svc.cluster.local': Name or service not known (error=-2): Unknown error -2
E20221013 06:12:39.438771 62 ThriftClientManager-inl.h:70] Failed to resolve address for 'nebula-metad-2.nebula-metad-headless.nebula.
svc.cluster.local': Name or service not known (error=-2): Unknown error -2
I20221013 06:12:40.188081 1 MetaDaemonInit.cpp:113] Leader has not been elected, sleep 1s
E20221013 06:12:41.005990 63 ThriftClientManager-inl.h:70] Failed to resolve address for 'nebula-metad-1.nebula-metad-headless.nebula.
svc.cluster.local': Name or service not known (error=-2): Unknown error -2
E20221013 06:12:41.007997 63 ThriftClientManager-inl.h:70] Failed to resolve address for 'nebula-metad-2.nebula-metad-headless.nebula.
svc.cluster.local': Name or service not known (error=-2): Unknown error -2
I20221013 06:12:41.188194 1 MetaDaemonInit.cpp:113] Leader has not been elected, sleep 1s
I20221013 06:12:42.188305 1 MetaDaemonInit.cpp:113] Leader has not been elected, sleep 1s
E20221013 06:12:42.704993 64 ThriftClientManager-inl.h:70] Failed to resolve address for 'nebula-metad-1.nebula-metad-headless.nebula.
svc.cluster.local': Name or service not known (error=-2): Unknown error -2
INFO log from nebula-metad-1:
Log file created at: 2022/10/13 06:12:53
Running on machine: nebula-metad-1
Running duration (h:mm:ss): 0:00:00
Log line format: [IWEF]yyyymmdd hh:mm:ss.uuuuuu threadid file:line] msg
I20221013 06:12:53.455024 1 MetaDaemon.cpp:135] localhost = "nebula-metad-1.nebula-metad-headless.nebula.svc.cluster.local":9559
I20221013 06:12:53.458636 1 NebulaStore.cpp:51] Start the raft service...
I20221013 06:12:53.459043 1 NebulaSnapshotManager.cpp:25] Send snapshot is rate limited to 10485760 for each part by default
I20221013 06:12:53.493963 1 RaftexService.cpp:46] Start raft service on 9560
I20221013 06:12:53.494081 1 NebulaStore.cpp:85] Scan the local path, and init the spaces_
I20221013 06:12:53.494138 1 NebulaStore.cpp:92] Scan path "/usr/local/nebula/data/meta/nebula/0"
I20221013 06:12:53.494146 1 NebulaStore.cpp:271] Init data from partManager for "nebula-metad-1.nebula-metad-headless.nebula.svc.clus
ter.local":9559
I20221013 06:12:53.494163 1 NebulaStore.cpp:387] Create data space 0
I20221013 06:12:53.523059 1 RocksEngine.cpp:97] open rocksdb on /usr/local/nebula/data/meta/nebula/0/data
I20221013 06:12:53.533579 1 NebulaStore.cpp:459] Space 0, part 0 has been added, asLearner 0
I20221013 06:12:53.533618 1 NebulaStore.cpp:78] Register handler...
I20221013 06:12:53.533627 1 MetaDaemonInit.cpp:101] Waiting for the leader elected...
I20221013 06:12:53.533634 1 MetaDaemonInit.cpp:113] Leader has not been elected, sleep 1s
I20221013 06:12:53.842078 62 ThriftClientManager-inl.h:67] resolve "nebula-metad-0.nebula-metad-headless.nebula.svc.cluster.local":956
0 as "10.214.2.127":9560
E20221013 06:12:53.850257 62 ThriftClientManager-inl.h:70] Failed to resolve address for 'nebula-metad-2.nebula-metad-headless.nebula.
svc.cluster.local': Name or service not known (error=-2): Unknown error -2
I20221013 06:12:54.533761 1 KVBasedClusterIdMan.h:109] There is no clusterId existed in kvstore!
I20221013 06:12:54.533792 1 MetaDaemonInit.cpp:129] I am follower, wait for the leader's clusterId
I20221013 06:12:54.533797 1 MetaDaemonInit.cpp:131] Waiting for the leader's clusterId
I20221013 06:12:55.533919 1 KVBasedClusterIdMan.h:109] There is no clusterId existed in kvstore!
I20221013 06:12:55.533950 1 MetaDaemonInit.cpp:131] Waiting for the leader's clusterId
INFO log from nebula-metad-2:
Log file created at: 2022/10/13 06:13:09
Running on machine: nebula-metad-2
Running duration (h:mm:ss): 0:00:00
Log line format: [IWEF]yyyymmdd hh:mm:ss.uuuuuu threadid file:line] msg
I20221013 06:13:09.043592 1 MetaDaemon.cpp:135] localhost = "nebula-metad-2.nebula-metad-headless.nebula.svc.cluster.local":9559
I20221013 06:13:09.047709 1 NebulaStore.cpp:51] Start the raft service...
I20221013 06:13:09.048156 1 NebulaSnapshotManager.cpp:25] Send snapshot is rate limited to 10485760 for each part by default
I20221013 06:13:09.069625 1 RaftexService.cpp:46] Start raft service on 9560
I20221013 06:13:09.069696 1 NebulaStore.cpp:85] Scan the local path, and init the spaces_
I20221013 06:13:09.069732 1 NebulaStore.cpp:92] Scan path "/usr/local/nebula/data/meta/nebula/0"
I20221013 06:13:09.069741 1 NebulaStore.cpp:271] Init data from partManager for "nebula-metad-2.nebula-metad-headless.nebula.svc.cluster.local":9559
I20221013 06:13:09.069756 1 NebulaStore.cpp:387] Create data space 0
I20221013 06:13:09.117659 1 RocksEngine.cpp:97] open rocksdb on /usr/local/nebula/data/meta/nebula/0/data
I20221013 06:13:09.128047 1 NebulaStore.cpp:459] Space 0, part 0 has been added, asLearner 0
I20221013 06:13:09.128072 1 NebulaStore.cpp:78] Register handler...
I20221013 06:13:09.128080 1 MetaDaemonInit.cpp:101] Waiting for the leader elected...
I20221013 06:13:09.128087 1 MetaDaemonInit.cpp:113] Leader has not been elected, sleep 1s
I20221013 06:13:10.128196 1 KVBasedClusterIdMan.h:109] There is no clusterId existed in kvstore!
I20221013 06:13:10.128226 1 MetaDaemonInit.cpp:129] I am follower, wait for the leader's clusterId
I20221013 06:13:10.128232 1 MetaDaemonInit.cpp:131] Waiting for the leader's clusterId
I20221013 06:13:11.128335 1 KVBasedClusterIdMan.h:109] There is no clusterId existed in kvstore!
@codingknees ok, we'll try to run a test on huwei cloud.
@codingknees I've tested on huawei cloud, I ran into the same error that put clusterid failed, we are in the analysis.
@codingknees We've found out the cause of the problem, I will submit a PR to fix it. Also, you can update the value of charts nebula-cluster/values.yaml metad.dataStorage to 10Gi and metad.logStorage to 2Gi.
The problem was fixed in operator 1.3.0.
General Question
参考文档,以nubula-operator方式进行部署,环境华为云,k8s版本1.21。 nebula-metad无法启动:
pv和pvc状态看起来是正常的:
pod的describe信息显示是进程未就绪:
查看pod中日志文件内容: 0号实例连接2号实例失败
2号实例未启动,原因是持久化报错:
进入2号实例中,目录挂载路径没问题,读写权限看起来也是正常的:
重启pod不能解决这个问题。 请问,我还能如何进一步定位问题,并且进行处理?