bricks not mounting after node reboot

timkock commented 6 years ago

Excellent work in this repository, really awesome. It was a little tricky to get to work with acs-engine and premium managed disks in Azure but it has delivered a lot of joy to an autoscaling machine learning environment using the excellent work from dask and kubernetes.

I have encountered a problem where I simulated a crash of the storage nodes (3 VM's in a WMSS scaleset with each 256G SSD's attached) by deallocating them and bringing them back online via the portal.

The nodes come back online and re-appear in kubernetes.

The heketi pod fails with

MountVolume.SetUp failed for volume "db" : mount failed: mount failed: exit status 1 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/cd6f42fc-851f-11e8-9b02-000d3a39f7ed/volumes/kubernetes.io~glusterfs/db --scope -- mount -t glusterfs -o log-level=ERROR,log-file=/var/lib/kubelet/plugins/kubernetes.io/glusterfs/db/heketi-86f98754c-tmgb8-glusterfs.log,backup-volfile-servers=10.240.0.34:10.240.0.65:10.240.0.96 10.240.0.34:heketidbstorage /var/lib/kubelet/pods/cd6f42fc-851f-11e8-9b02-000d3a39f7ed/volumes/kubernetes.io~glusterfs/db Output: Running scope as unit run-r444bfff746a941598aea516c48fc5587.scope. WARNING: getfattr not found, certain checks will be skipped.. Mount failed. Please check the log file for more details. the following error information was pulled from the glusterfs log to help diagnose this issue: [2018-07-11 15:35:06.254003] E [rpc-clnt.c:362:saved_frames_unwind] (--> /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x1a3)[0x7f0cf6858953] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_unwind+0x1d1)[0x7f0cf6629231] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f0cf662934e] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e)[0x7f0cf662ab1e] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0x88)[0x7f0cf662b338] ))))) 0-glusterfs: forced unwinding frame type(GlusterFS Handshake) op(GETSPEC(2)) called at 2018-07-11 15:34:18.820462 (xid=0x1) [2018-07-11 15:35:06.254043] E [glusterfsd-mgmt.c:1603:mgmt_getspec_cbk] 0-mgmt: failed to fetch volume file (key:heketidbstorage)

the gluster pods start and they are able to see each other. Checked with gluster peer status

root@k8s-storagepool-28947289-vmss000002:/# gluster peer probe 10.240.0.34 peer probe: success. Host 10.240.0.34 port 24007 already in peer list root@k8s-storagepool-28947289-vmss000002:/# gluster peer probe 10.240.0.65 peer probe: success. Host 10.240.0.65 port 24007 already in peer list
when running gluster volume status I get this
```
> gluster volume status 
```

Status of volume: heketidbstorage Gluster process TCP Port RDMA Port Online Pid

Brick 10.240.0.34:/var/lib/heketi/mounts/vg _51efbe5a164aea2fa33494e18a9ffbf9/brick_29d 842fea9638032b7f56a9e0535d3fe/brick N/A N/A N N/A Brick 10.240.0.65:/var/lib/heketi/mounts/vg _d3637b69dbc9f7f16c2ec57970e91ec6/brick_05f 7605bb3a9cdfa23688bef3d437bc4/brick N/A N/A N N/A Brick 10.240.0.96:/var/lib/heketi/mounts/vg _ff83b346d75d6f006058760a9ea7612e/brick_12d c5a9894b5d1d35fc758bf93a3c192/brick N/A N/A N N/A Self-heal Daemon on localhost N/A N/A Y 1463 Self-heal Daemon on 10.240.0.96 N/A N/A Y 1008 Self-heal Daemon on 10.240.0.34 N/A N/A Y 1100

Task Status of Volume heketidbstorage

There are no active volume tasks

Status of volume: vol_0914949f0473811c7c52cd063485778e Gluster process TCP Port RDMA Port Online Pid

Brick 10.240.0.96:/var/lib/heketi/mounts/vg _ff83b346d75d6f006058760a9ea7612e/brick_c53 b8198f3bd0d24b157b1e3f20c462b/brick N/A N/A N N/A Brick 10.240.0.34:/var/lib/heketi/mounts/vg _bce1174abeceb199416b126276164dda/brick_42a d33ec09f0ec78378a32ca55929056/brick N/A N/A N N/A Brick 10.240.0.65:/var/lib/heketi/mounts/vg _ca08d758b917b9b59d7a118701109e19/brick_59a 845a8bec3ec44fe48535f69cf8749/brick N/A N/A N N/A Brick 10.240.0.96:/var/lib/heketi/mounts/vg _c64eb42a12eccd3a2b38d09e8239b395/brick_27c c9ae0695607a1412551f4b19c7795/brick N/A N/A N N/A Brick 10.240.0.34:/var/lib/heketi/mounts/vg _731de882b06e8688547a776b17f2b121/brick_8dd ce90c41efa54aca645d486acc6363/brick N/A N/A N N/A Brick 10.240.0.65:/var/lib/heketi/mounts/vg _9818c228513387be593ae0f664f52637/brick_bf6 1579ff96f10041bf64b0ef4880245/brick N/A N/A N N/A Self-heal Daemon on localhost N/A N/A Y 1463 Self-heal Daemon on 10.240.0.96 N/A N/A Y 1008 Self-heal Daemon on 10.240.0.34 N/A N/A Y 1100

Task Status of Volume vol_0914949f0473811c7c52cd063485778e

There are no active volume tasks



* There is nothing in `/etc/fstab` in the pod but when I try to mount what is in `/var/lib/heketi/fstab` it doesn't work.

My question is if you guys can point me into the right direction for re-initiating the the gluster cluster properly. This is the issue that comes closest but it doesn't provide a lead to solving this is https://github.com/gluster/gluster-kubernetes/issues/366

I would like it if the storage nodes (gluster pods and heketi pod) are deallocated and then boot up again (storage remains attached and disks still contain data) that the persistent volume claim keeps working.

timkock commented 6 years ago

What can I do to facilitate someone to help me with this. I already feel quite bad haha that I am bothering you with this as the work on this repo was already awesome.

shalakhin commented 6 years ago

Would be great to have an update on how to resolve this issue

phlogistonjohn commented 6 years ago

I missed this issue when it was originally raised.

Since you mention gluster pods I will assume you're using containerized gluster as opposed to external gluster. Can you please update with your gluster container image and version?

As you discovered, the gluster pods don't use the typical /etc/fstab. The /var/lib/heketi/fstab that serves the same purpose is used by a custom start up script in the container to mount the brick file systems. You mention that it doesn't work. Could you please provide more details about what errors occurred when you tried to mount the content of the file yourself?

You mention Azure. I have heard that azure does not keep device names like /dev/sdX stable. Is it possible that when you do your test that devices that were referenced by one name changes to another name after the simulated crash?

OlivierMary commented 6 years ago

Hi there I have same problem

Before node reboot:

[root@k8s-01 /]# lsblk
NAME                                                                              MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda                                                                                 8:0    0  40G  0 disk
└─sda1                                                                              8:1    0  40G  0 part /var/lib/glusterd
sdb                                                                                 8:16   0  40G  0 disk
├─vg_09203dbe2b452ba01d8289ee5a1489b1-tp_8a586f6fdc926a5711ff0ebda31a93e8_tmeta   253:0    0  12M  0 lvm
│ └─vg_09203dbe2b452ba01d8289ee5a1489b1-tp_8a586f6fdc926a5711ff0ebda31a93e8-tpool 253:2    0   2G  0 lvm
│   ├─vg_09203dbe2b452ba01d8289ee5a1489b1-tp_8a586f6fdc926a5711ff0ebda31a93e8     253:3    0   2G  0 lvm
│   └─vg_09203dbe2b452ba01d8289ee5a1489b1-brick_8a586f6fdc926a5711ff0ebda31a93e8  253:4    0   2G  0 lvm  /var/lib/heketi/mounts/vg_09203dbe2b452ba01d8289ee5a1489b1/brick_8a586f6fdc926a5711ff0ebda31a93e8
└─vg_09203dbe2b452ba01d8289ee5a1489b1-tp_8a586f6fdc926a5711ff0ebda31a93e8_tdata   253:1    0   2G  0 lvm
  └─vg_09203dbe2b452ba01d8289ee5a1489b1-tp_8a586f6fdc926a5711ff0ebda31a93e8-tpool 253:2    0   2G  0 lvm
    ├─vg_09203dbe2b452ba01d8289ee5a1489b1-tp_8a586f6fdc926a5711ff0ebda31a93e8     253:3    0   2G  0 lvm
    └─vg_09203dbe2b452ba01d8289ee5a1489b1-brick_8a586f6fdc926a5711ff0ebda31a93e8  253:4    0   2G  0 lvm  /var/lib/heketi/mounts/vg_09203dbe2b452ba01d8289ee5a1489b1/brick_8a586f6fdc926a5711ff0ebda31a93e8

after node reboot + daemon reboot:

[root@k8s-01 /]# lsblk
NAME                                                                              MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda                                                                                 8:0    0  40G  0 disk
└─sda1                                                                              8:1    0  40G  0 part /var/lib/glusterd
sdb                                                                                 8:16   0  40G  0 disk

How to mount all the /var/lib/heketi/fstab ?

I tried:

[root@k8s-01 /]# mount -a --fstab /var/lib/heketi/fstab
mount: special device /dev/mapper/vg_67043fcaa37dcb3b5560a4d70cdde6e8-brick_bbdf160150776be346a650c278cb101b does not exist
mount: special device /dev/mapper/vg_89aa25beca04a1ed8b26f1d3d916abd2-brick_a42ca5903123ba1a9204af946aa32d55 does not exist
mount: special device /dev/mapper/vg_09203dbe2b452ba01d8289ee5a1489b1-brick_8a586f6fdc926a5711ff0ebda31a93e8 does not exist

...

[root@k8s-01 /]# gluster-setup.sh
mkdir: cannot create directory ‘/var/log/glusterfs/container’: File exists
/etc/glusterfs is not empty
/var/log/glusterfs is not empty
/var/lib/glusterd is not empty
mount: special device /dev/mapper/vg_67043fcaa37dcb3b5560a4d70cdde6e8-brick_bbdf160150776be346a650c278cb101b does not exist
mount: special device /dev/mapper/vg_89aa25beca04a1ed8b26f1d3d916abd2-brick_a42ca5903123ba1a9204af946aa32d55 does not exist
mount: special device /dev/mapper/vg_09203dbe2b452ba01d8289ee5a1489b1-brick_8a586f6fdc926a5711ff0ebda31a93e8 does not exist

/usr/sbin/gluster-setup.sh: line 83: [: 4 /var/log/glusterfs/container/failed_bricks: integer expression expected
Script Ran Successfully

...

[root@k8s-01 /]# lsblk
NAME   MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda      8:0    0  40G  0 disk
└─sda1   8:1    0  40G  0 part /var/lib/heketi
sdb      8:16   0  40G  0 disk

OlivierMary commented 6 years ago

Re,

I find for me ...

This modules wasn't persist after reboot

"dm_snapshot"
"dm_mirror"
"dm_thin_pool"

phlogistonjohn commented 6 years ago

@OlivierMary depending on your os/distro you want to make sure the device mapper modules are loaded. We try to make it possible to allow the pod to auto-load kernel modules (see https://github.com/gluster/gluster-kubernetes/blob/master/deploy/kube-templates/glusterfs-daemonset.yaml#L63 ) but if those modules are not loaded even after your pod starts it might be that either that line or mount point does not exist. It depends on what version you used.

daskanu commented 5 years ago

We recently faced exact the same issue with our GlusterFS cluster running in Kubernetes on Azure. I found out that the mappings in /dev/mapper were missing compared to the underlying host system when I executed "blkid" in the GlusterFS pods.

By adding the host path "/dev" to the daemon set and restarting all pods regarding GlusterFS, I got our cluster back up online.

I hope this information is helpful.

OS: CentOS Linux release 7.5.1804 Distribution (Kubernetes): OpenShift OKD 3.11

phlogistonjohn commented 5 years ago

@nixpanic do you think the comment by @daskanu here could be related to what you did in 2f1114a, similar to #542 ?

nixpanic commented 5 years ago

Hmm, yes, that seems possible. I guess we'll need to add /dev/mapper as well. Users may have configured multipath (or other device-mapper targets) and in that case they would want to pass /dev/mapper/... device names.

sigh

lasselj commented 5 years ago

I found that in the instance where udev has created the files in /dev/mapper as symlinks, then they are not available to the gluster pod. If I manually delete them (rm -f /dev/mapper/vg_* #CarefulHere), and then recreate them by running vgscan --mknodes which then defaults to "direct link creation" then this resolves the problem. It's of course necessary to restart the gluster pods on the individual nodes.

I.e. this does not work for me:

[root@srv04 mapper]# ls -al total 0 drwxr-xr-x 2 root root 200 Jun 27 15:22 . drwxr-xr-x 21 root root 4400 Jul 1 12:04 .. crw------- 1 root root 10, 236 Jun 27 15:22 control lrwxrwxrwx 1 root root 7 Jun 27 15:22 fedora-root -> ../dm-0 lrwxrwxrwx 1 root root 7 Jun 27 15:22 fedora-swap -> ../dm-1 lrwxrwxrwx 1 root root 7 Jun 27 15:22 vg_4cca2fc8cc3671bcf8c482f156f0438f-brick_5b1459e816925bc320e5a0ef7d284fd9 -> ../dm-6 lrwxrwxrwx 1 root root 7 Jun 27 15:22 vg_4cca2fc8cc3671bcf8c482f156f0438f-tp_5b1459e816925bc320e5a0ef7d284fd9 -> ../dm-5 lrwxrwxrwx 1 root root 7 Jun 27 15:22 vg_4cca2fc8cc3671bcf8c482f156f0438f-tp_5b1459e816925bc320e5a0ef7d284fd9_tdata -> ../dm-3 lrwxrwxrwx 1 root root 7 Jun 27 15:22 vg_4cca2fc8cc3671bcf8c482f156f0438f-tp_5b1459e816925bc320e5a0ef7d284fd9_tmeta -> ../dm-2 lrwxrwxrwx 1 root root 7 Jun 27 15:22 vg_4cca2fc8cc3671bcf8c482f156f0438f-tp_5b1459e816925bc320e5a0ef7d284fd9-tpool -> ../dm-4

but this works a treat:

[root@srv04 mapper]# ls -al total 0 drwxr-xr-x 2 root root 300 Jul 1 18:01 . drwxr-xr-x 21 root root 4400 Jul 1 12:04 .. crw------- 1 root root 10, 236 Jun 27 15:22 control lrwxrwxrwx 1 root root 7 Jun 27 15:22 fedora-root -> ../dm-0 lrwxrwxrwx 1 root root 7 Jun 27 15:22 fedora-swap -> ../dm-1 brw-rw---- 1 root disk 253, 6 Jul 1 18:01 vg_4cca2fc8cc3671bcf8c482f156f0438f-brick_5b1459e816925bc320e5a0ef7d284fd9 brw-rw---- 1 root disk 253, 11 Jul 1 18:01 vg_4cca2fc8cc3671bcf8c482f156f0438f-brick_9825c39fb4efe83c80bb8aef85eef8ee brw-rw---- 1 root disk 253, 5 Jul 1 18:01 vg_4cca2fc8cc3671bcf8c482f156f0438f-tp_5b1459e816925bc320e5a0ef7d284fd9 brw-rw---- 1 root disk 253, 3 Jul 1 18:01 vg_4cca2fc8cc3671bcf8c482f156f0438f-tp_5b1459e816925bc320e5a0ef7d284fd9_tdata brw-rw---- 1 root disk 253, 2 Jul 1 18:01 vg_4cca2fc8cc3671bcf8c482f156f0438f-tp_5b1459e816925bc320e5a0ef7d284fd9_tmeta brw-rw---- 1 root disk 253, 4 Jul 1 18:01 vg_4cca2fc8cc3671bcf8c482f156f0438f-tp_5b1459e816925bc320e5a0ef7d284fd9-tpool brw-rw---- 1 root disk 253, 10 Jul 1 18:01 vg_4cca2fc8cc3671bcf8c482f156f0438f-tp_9825c39fb4efe83c80bb8aef85eef8ee brw-rw---- 1 root disk 253, 8 Jul 1 18:01 vg_4cca2fc8cc3671bcf8c482f156f0438f-tp_9825c39fb4efe83c80bb8aef85eef8ee_tdata brw-rw---- 1 root disk 253, 7 Jul 1 18:01 vg_4cca2fc8cc3671bcf8c482f156f0438f-tp_9825c39fb4efe83c80bb8aef85eef8ee_tmeta brw-rw---- 1 root disk 253, 9 Jul 1 18:01 vg_4cca2fc8cc3671bcf8c482f156f0438f-tp_9825c39fb4efe83c80bb8aef85eef8ee-tpool

and the path between them is simply :

[root@srv04 mapper]# rm -f vg_4cca2fc8cc3671bcf8c482f156f0438f-* [root@srv04 mapper]# vgscan --mknodes Reading volume groups from cache. Found volume group "vg_4cca2fc8cc3671bcf8c482f156f0438f" using metadata type lvm2 Found volume group "fedora" using metadata type lvm2 The link /dev/vg_4cca2fc8cc3671bcf8c482f156f0438f/brick_9825c39fb4efe83c80bb8aef85eef8ee should have been created by udev but it was not found. Falling back to direct link creation. Command failed with status code 5.

Hope that helps.

ook commented 5 years ago

Kudos @lasselj : thank you for your instructions: you helped me bringing back my heketidbstorage volume which got read-only due to missing bricks.

pavelzamyatin commented 5 years ago

Hey, @lasselj! Thanks a lot, mate. You saved my day!

gluster / gluster-kubernetes