vesoft-inc / nebula-operator

Operation utilities for Nebula Graph
https://vesoft-inc.github.io/nebula-operator
Apache License 2.0
80 stars 28 forks source link

Failed to deploy k8s in huawei cloud #164

Closed codingknees closed 1 year ago

codingknees commented 2 years ago

General Question

参考文档,以nubula-operator方式进行部署,环境华为云,k8s版本1.21。 nebula-metad无法启动:

image

pv和pvc状态看起来是正常的:

image

pod的describe信息显示是进程未就绪: image

查看pod中日志文件内容: 0号实例连接2号实例失败

image

2号实例未启动,原因是持久化报错:

image

进入2号实例中,目录挂载路径没问题,读写权限看起来也是正常的:

image

重启pod不能解决这个问题。 请问,我还能如何进一步定位问题,并且进行处理?

codingknees commented 2 years ago

有两个节点的INFO日志是这样: image 一个节点日志是这样: image

MegaByte875 commented 2 years ago

@codingknees Please paste metad-2 INFO log here

codingknees commented 1 year ago

上面INFO日志图片,第一张图片的日志是2个节点的日志,应该是2个从节点。第二张图片是主节点。现在服务已经被我重启多次了,环境被破坏了。

codingknees commented 1 year ago

重新部署了一下,日志如下:

INFO log from nebula-metad-0:

Log file created at: 2022/10/13 06:12:39
Running on machine: nebula-metad-0
Running duration (h:mm:ss): 0:00:00
Log line format: [IWEF]yyyymmdd hh:mm:ss.uuuuuu threadid file:line] msg
I20221013 06:12:39.106065     1 MetaDaemon.cpp:135] localhost = "nebula-metad-0.nebula-metad-headless.nebula.svc.cluster.local":9559
I20221013 06:12:39.111194     1 NebulaStore.cpp:51] Start the raft service...
I20221013 06:12:39.111518     1 NebulaSnapshotManager.cpp:25] Send snapshot is rate limited to 10485760 for each part by default
I20221013 06:12:39.137447     1 RaftexService.cpp:46] Start raft service on 9560
I20221013 06:12:39.137612     1 NebulaStore.cpp:85] Scan the local path, and init the spaces_
E20221013 06:12:39.137632     1 FileUtils.cpp:377] Failed to read the directory "/usr/local/nebula/data/meta/nebula" (2): No such file or
 directory
I20221013 06:12:39.137771     1 NebulaStore.cpp:271] Init data from partManager for "nebula-metad-0.nebula-metad-headless.nebula.svc.clus
ter.local":9559
I20221013 06:12:39.137785     1 NebulaStore.cpp:387] Create data space 0
I20221013 06:12:39.177439     1 RocksEngine.cpp:97] open rocksdb on /usr/local/nebula/data/meta/nebula/0/data
I20221013 06:12:39.187948     1 NebulaStore.cpp:459] Space 0, part 0 has been added, asLearner 0
I20221013 06:12:39.187981     1 NebulaStore.cpp:78] Register handler...
I20221013 06:12:39.187991     1 MetaDaemonInit.cpp:101] Waiting for the leader elected...
I20221013 06:12:39.188000     1 MetaDaemonInit.cpp:113] Leader has not been elected, sleep 1s
E20221013 06:12:39.435500    62 ThriftClientManager-inl.h:70] Failed to resolve address for 'nebula-metad-1.nebula-metad-headless.nebula.
svc.cluster.local': Name or service not known (error=-2): Unknown error -2
E20221013 06:12:39.438771    62 ThriftClientManager-inl.h:70] Failed to resolve address for 'nebula-metad-2.nebula-metad-headless.nebula.
svc.cluster.local': Name or service not known (error=-2): Unknown error -2
I20221013 06:12:40.188081     1 MetaDaemonInit.cpp:113] Leader has not been elected, sleep 1s
E20221013 06:12:41.005990    63 ThriftClientManager-inl.h:70] Failed to resolve address for 'nebula-metad-1.nebula-metad-headless.nebula.
svc.cluster.local': Name or service not known (error=-2): Unknown error -2
E20221013 06:12:41.007997    63 ThriftClientManager-inl.h:70] Failed to resolve address for 'nebula-metad-2.nebula-metad-headless.nebula.
svc.cluster.local': Name or service not known (error=-2): Unknown error -2
I20221013 06:12:41.188194     1 MetaDaemonInit.cpp:113] Leader has not been elected, sleep 1s
I20221013 06:12:42.188305     1 MetaDaemonInit.cpp:113] Leader has not been elected, sleep 1s
E20221013 06:12:42.704993    64 ThriftClientManager-inl.h:70] Failed to resolve address for 'nebula-metad-1.nebula-metad-headless.nebula.
svc.cluster.local': Name or service not known (error=-2): Unknown error -2

INFO log from nebula-metad-1:

Log file created at: 2022/10/13 06:12:53
Running on machine: nebula-metad-1
Running duration (h:mm:ss): 0:00:00
Log line format: [IWEF]yyyymmdd hh:mm:ss.uuuuuu threadid file:line] msg
I20221013 06:12:53.455024     1 MetaDaemon.cpp:135] localhost = "nebula-metad-1.nebula-metad-headless.nebula.svc.cluster.local":9559
I20221013 06:12:53.458636     1 NebulaStore.cpp:51] Start the raft service...
I20221013 06:12:53.459043     1 NebulaSnapshotManager.cpp:25] Send snapshot is rate limited to 10485760 for each part by default
I20221013 06:12:53.493963     1 RaftexService.cpp:46] Start raft service on 9560
I20221013 06:12:53.494081     1 NebulaStore.cpp:85] Scan the local path, and init the spaces_
I20221013 06:12:53.494138     1 NebulaStore.cpp:92] Scan path "/usr/local/nebula/data/meta/nebula/0"
I20221013 06:12:53.494146     1 NebulaStore.cpp:271] Init data from partManager for "nebula-metad-1.nebula-metad-headless.nebula.svc.clus
ter.local":9559
I20221013 06:12:53.494163     1 NebulaStore.cpp:387] Create data space 0
I20221013 06:12:53.523059     1 RocksEngine.cpp:97] open rocksdb on /usr/local/nebula/data/meta/nebula/0/data
I20221013 06:12:53.533579     1 NebulaStore.cpp:459] Space 0, part 0 has been added, asLearner 0
I20221013 06:12:53.533618     1 NebulaStore.cpp:78] Register handler...
I20221013 06:12:53.533627     1 MetaDaemonInit.cpp:101] Waiting for the leader elected...
I20221013 06:12:53.533634     1 MetaDaemonInit.cpp:113] Leader has not been elected, sleep 1s
I20221013 06:12:53.842078    62 ThriftClientManager-inl.h:67] resolve "nebula-metad-0.nebula-metad-headless.nebula.svc.cluster.local":956
0 as "10.214.2.127":9560
E20221013 06:12:53.850257    62 ThriftClientManager-inl.h:70] Failed to resolve address for 'nebula-metad-2.nebula-metad-headless.nebula.
svc.cluster.local': Name or service not known (error=-2): Unknown error -2
I20221013 06:12:54.533761     1 KVBasedClusterIdMan.h:109] There is no clusterId existed in kvstore!
I20221013 06:12:54.533792     1 MetaDaemonInit.cpp:129] I am follower, wait for the leader's clusterId
I20221013 06:12:54.533797     1 MetaDaemonInit.cpp:131] Waiting for the leader's clusterId
I20221013 06:12:55.533919     1 KVBasedClusterIdMan.h:109] There is no clusterId existed in kvstore!
I20221013 06:12:55.533950     1 MetaDaemonInit.cpp:131] Waiting for the leader's clusterId

INFO log from nebula-metad-2:

Log file created at: 2022/10/13 06:13:09
Running on machine: nebula-metad-2
Running duration (h:mm:ss): 0:00:00
Log line format: [IWEF]yyyymmdd hh:mm:ss.uuuuuu threadid file:line] msg
I20221013 06:13:09.043592     1 MetaDaemon.cpp:135] localhost = "nebula-metad-2.nebula-metad-headless.nebula.svc.cluster.local":9559
I20221013 06:13:09.047709     1 NebulaStore.cpp:51] Start the raft service...
I20221013 06:13:09.048156     1 NebulaSnapshotManager.cpp:25] Send snapshot is rate limited to 10485760 for each part by default
I20221013 06:13:09.069625     1 RaftexService.cpp:46] Start raft service on 9560
I20221013 06:13:09.069696     1 NebulaStore.cpp:85] Scan the local path, and init the spaces_
I20221013 06:13:09.069732     1 NebulaStore.cpp:92] Scan path "/usr/local/nebula/data/meta/nebula/0"
I20221013 06:13:09.069741     1 NebulaStore.cpp:271] Init data from partManager for "nebula-metad-2.nebula-metad-headless.nebula.svc.cluster.local":9559
I20221013 06:13:09.069756     1 NebulaStore.cpp:387] Create data space 0
I20221013 06:13:09.117659     1 RocksEngine.cpp:97] open rocksdb on /usr/local/nebula/data/meta/nebula/0/data
I20221013 06:13:09.128047     1 NebulaStore.cpp:459] Space 0, part 0 has been added, asLearner 0
I20221013 06:13:09.128072     1 NebulaStore.cpp:78] Register handler...
I20221013 06:13:09.128080     1 MetaDaemonInit.cpp:101] Waiting for the leader elected...
I20221013 06:13:09.128087     1 MetaDaemonInit.cpp:113] Leader has not been elected, sleep 1s
I20221013 06:13:10.128196     1 KVBasedClusterIdMan.h:109] There is no clusterId existed in kvstore!
I20221013 06:13:10.128226     1 MetaDaemonInit.cpp:129] I am follower, wait for the leader's clusterId
I20221013 06:13:10.128232     1 MetaDaemonInit.cpp:131] Waiting for the leader's clusterId
I20221013 06:13:11.128335     1 KVBasedClusterIdMan.h:109] There is no clusterId existed in kvstore!
MegaByte875 commented 1 year ago

@codingknees ok, we'll try to run a test on huwei cloud.

MegaByte875 commented 1 year ago

@codingknees I've tested on huawei cloud, I ran into the same error that put clusterid failed, we are in the analysis.

MegaByte875 commented 1 year ago

@codingknees We've found out the cause of the problem, I will submit a PR to fix it. Also, you can update the value of charts nebula-cluster/values.yaml metad.dataStorage to 10Gi and metad.logStorage to 2Gi.

MegaByte875 commented 1 year ago

The problem was fixed in operator 1.3.0.