gluster / gluster-kubernetes

GlusterFS Native Storage Service for Kubernetes
Apache License 2.0
875 stars 390 forks source link

stat: cannot stat '/var/lib/heketi/heketi.db': No such file or directory #516

Open shootkin opened 6 years ago

shootkin commented 6 years ago

Hello! I have a heketi pod on my kubernetes cluster. Today node on which it was running was rebooted, so that heketi pod was re-assigned to another node. After that, if I run command: kubectl logs heketi-7947d8f8b-vxszl I see such output:

stat: cannot stat '/var/lib/heketi/heketi.db': No such file or directory
Heketi v7.0.0-5-gc10cbd1-release-7
[heketi] INFO 2018/09/06 10:15:54 Loaded kubernetes executor
[heketi] INFO 2018/09/06 10:15:54 GlusterFS Application Loaded
Listening on port 8080

If I run a commands: kubectl get pods | grep gluster and kubectl exec -ti glusterfs-c7tpd gluster peer status I see that all glusterfs pods are running and nodes are in live state: glusterfs-c7tpd 1/1 Running 2 49d glusterfs-gs69n 1/1 Running 1 28d glusterfs-lvnkt 1/1 Running 2 40d glusterfs-mzb84 1/1 Running 0 49d

Number of Peers: 3

Hostname: hostname1
Uuid: d35ffe48-cca9-4c78-8a67-e4ac34afb9ff
State: Peer in Cluster (Connected)

Hostname: hostname2
Uuid: be609485-84e1-4488-8f26-01db681f4f35
State: Peer in Cluster (Connected)

Hostname: hostname3
Uuid: df45df31-fb69-4a0a-8ea8-95542d72ae8f
State: Peer in Cluster (Connected)

All volumes are in live state too.

So the question is: How to fix this bug?

shootkin commented 6 years ago

Hm... Very strange. I checked the status of heketidbstorage volume, and it turned out that 1 of 3 brick is offline. So I stopped this volume and started it back. After that I recreated heketi pod. Now all works fine but it is very unappropriate bug.

phlogistonjohn commented 6 years ago

Having 1 out 3 bricks offline is not great but it would not lead to that particular error message from stat IMO. Even if the volume was read-only the stat should not report "No such file or directory" unless that path didn't exist.

Was the content of the heketi db surviving pod migration from node to node? If not, that's a bigger issue that we should look into.

Otherwise, if you are just seeing that message on the first times the heketi pod starts, its not a big issue but we could fix it. I just want to make sure we are not focusing on something small (the error output) vs something bigger like the heketidbstorage not replicating correctly on your setup.

shootkin commented 6 years ago

Having 1 out 3 bricks offline is not great but it would not lead to that particular error message from stat IMO. Even if the volume was read-only the stat should not report "No such file or directory" unless that path didn't exist.

Was the content of the heketi db surviving pod migration from node to node? If not, that's a bigger issue that we should look into.

Otherwise, if you are just seeing that message on the first times the heketi pod starts, its not a big issue but we could fix it. I just want to make sure we are not focusing on something small (the error output) vs something bigger like the heketidbstorage not replicating correctly on your setup.

"Having 1 out 3 bricks offline is not great but it would not lead to that particular error message from stat IMO" - I'm sure that it is an issue with the 1 out of 3 bricks offline, cause all my persistent volumes became unvisible to pods. When I restarted ALL volumes manually by command "for i in $(gluster volume list); do echo 'y' | gluster volume stop $i && gluster volume start $i;" and deleted old pods, new pods started succesfully interacting with persistence volumes. I don't know why I see such behavior. I have 4 nodes in my gluster cluster, with Replica 3, so 1 brick offline mustn't cause an error, but it cause.