gluster / gluster-kubernetes

GlusterFS Native Storage Service for Kubernetes
Apache License 2.0
875 stars 389 forks source link

Glusterd Failed to initialize IB Device #559

Closed Collin-Moore closed 5 years ago

Collin-Moore commented 5 years ago

So I'm trying to setup glusterfs within our 4 node kubernetes cluster (2 master and two workers). My team is trying to automate the deployment and teardown of our cluster, so we are at the step of automating glusterfs setup and teardown. I had a working cluster that could dynamically provision volumes through hetketi before this, but we need to be able to do this over and over so I attempted to clean the vm's of glusterfs and rebuild. In our case this involves recreating our block device as well.

(As an aside our team does not have full control of our infrastructure, which is why we want an alternative to reimaging the vm's or restoring from a snapshot since we cannot start that process whenever we want. This is for an academic project so keep that in mind as you read, I have a feeling these aren't best practices.)

Before I get into details here is environment info:

# kubectl version
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.3", GitCommit:"435f92c719f279a3a67808c80521ea17d5715c66", GitTreeState:"clean", BuildDate:"2018-11-26T12:46:57Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.3", GitCommit:"435f92c719f279a3a67808c80521ea17d5715c66", GitTreeState:"clean", BuildDate:"2018-11-26T12:46:57Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}
# cat /etc/redhat-release
CentOS Linux release 7.5.1804 (Core)
# docker -v
Docker version 18.06.1-ce, build e68fc7a

I also modified the glusterfs-daemonset.yaml file to add the NoSchedule toleration to the PodSpec, that way we could run glusterfs on the master nodes (not a great idea but its our current strategy)

The current problem I am running into is after running gk-deploy to completion and all the pods are running, I check the block devices using lsblk and see that all but one of the two workers (it flips each time I rebuild between worker1 and worker2) look like this:

loop0                                                                               7:0    0  9.8G  0 loop
├─vg_f53b0be1ddf4397844e5154a994733a5-tp_41dba2cc5a8c30e6dcd850f0bba031dd_tmeta   253:0    0   12M  0 lvm
│ └─vg_f53b0be1ddf4397844e5154a994733a5-tp_41dba2cc5a8c30e6dcd850f0bba031dd-tpool 253:2    0    2G  0 lvm
│   ├─vg_f53b0be1ddf4397844e5154a994733a5-tp_41dba2cc5a8c30e6dcd850f0bba031dd     253:3    0    2G  0 lvm
│   └─vg_f53b0be1ddf4397844e5154a994733a5-brick_41dba2cc5a8c30e6dcd850f0bba031dd  253:4    0    2G  0 lvm
└─vg_f53b0be1ddf4397844e5154a994733a5-tp_41dba2cc5a8c30e6dcd850f0bba031dd_tdata   253:1    0    2G  0 lvm
  └─vg_f53b0be1ddf4397844e5154a994733a5-tp_41dba2cc5a8c30e6dcd850f0bba031dd-tpool 253:2    0    2G  0 lvm
    ├─vg_f53b0be1ddf4397844e5154a994733a5-tp_41dba2cc5a8c30e6dcd850f0bba031dd     253:3    0    2G  0 lvm
    └─vg_f53b0be1ddf4397844e5154a994733a5-brick_41dba2cc5a8c30e6dcd850f0bba031dd  253:4    0    2G  0 lvm

Whereas one of my worker nodes I will get this output from lsblk:

loop0    7:0    0  9.8G  0 loop

Below is the output for vgdisplay on the good nodes:

VG Name               vg_f53b0be1ddf4397844e5154a994733a5
System ID
Format                lvm2
Metadata Areas        1
Metadata Sequence No  6
VG Access             read/write
VG Status             resizable
MAX LV                0
Cur LV                2
Open LV               1
Max PV                0
Cur PV                1
Act PV                1
VG Size               <9.64 GiB
PE Size               4.00 MiB
Total PE              2467
Alloc PE / Size       518 / 2.02 GiB
Free  PE / Size       1949 / 7.61 GiB
VG UUID               MiGXGI-blMS-1rg1-xj0D-unl6-3FMs-5K58uN

This is what the bad node looks like:

  --- Volume group ---
VG Name               vg_cc2f1031f1ee6dff93e2f211bb77fcba
System ID
Format                lvm2
Metadata Areas        1
Metadata Sequence No  1
VG Access             read/write
VG Status             resizable
MAX LV                0
Cur LV                0
Open LV               0
Max PV                0
Cur PV                1
Act PV                1
VG Size               <9.64 GiB
PE Size               4.00 MiB
Total PE              2467
Alloc PE / Size       0 / 0
Free  PE / Size       2467 / <9.64 Gi
VG UUID               MdRb6z-hcMq-nh30-jaCh-cikV-T1nf-XJzn4Y

The issue appears to be that glusterd cannot mount my block device. Here are the first couple glusterd logs on the problem node:

[2019-01-18 05:31:28.115094] I [MSGID: 100030] [glusterfsd.c:2741:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 4.1.5 (args: /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO)
[2019-01-18 05:31:28.147838] I [MSGID: 106478] [glusterd.c:1423:init] 0-management: Maximum allowed open file descriptors set to 65536
[2019-01-18 05:31:28.147884] I [MSGID: 106479] [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working directory
[2019-01-18 05:31:28.147896] I [MSGID: 106479] [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file working directory
[2019-01-18 05:31:28.169239] W [MSGID: 103071] [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event channel creation failed [No such device]
[2019-01-18 05:31:28.169268] W [MSGID: 103055] [rdma.c:4938:init] 0-rdma.management: Failed to initialize IB Device
[2019-01-18 05:31:28.169279] W [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' initialization failed
[2019-01-18 05:31:28.169372] W [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot create listener, initing the transport failed
[2019-01-18 05:31:28.169385] E [MSGID: 106244] [glusterd.c:1764:init] 0-management: creation of 1 listeners failed, continuing with succeeded transport
[2019-01-18 05:31:29.542647] E [MSGID: 101032] [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to /var/lib/glusterd/glusterd.info. [No such file or directory]
[2019-01-18 05:31:29.542709] E [MSGID: 101032] [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to /var/lib/glusterd/glusterd.info. [No such file or directory]
[2019-01-18 05:31:29.542716] I [MSGID: 106514] [glusterd-store.c:2262:glusterd_restore_op_version] 0-management: Detected new install. Setting op-version to maximum : 40100
[2019-01-18 05:31:29.551575] I [MSGID: 106194] [glusterd-store.c:3849:glusterd_store_retrieve_missed_snaps_list] 0-management: No missed snaps list.
Final graph:
+------------------------------------------------------------------------------+
  1: volume management
  2:     type mgmt/glusterd
  3:     option rpc-auth.auth-glusterfs on
  4:     option rpc-auth.auth-unix on
  5:     option rpc-auth.auth-null on
  6:     option rpc-auth-allow-insecure on
  7:     option transport.listen-backlog 10
  8:     option event-threads 1
  9:     option ping-timeout 0
 10:     option transport.socket.read-fail-log off
 11:     option transport.socket.keepalive-interval 2
 12:     option transport.socket.keepalive-time 10
 13:     option transport-type rdma
 14:     option working-directory /var/lib/glusterd
 15: end-volume
 16:
+------------------------------------------------------------------------------+
[2019-01-18 05:31:29.552171] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2019-01-18 05:32:41.124071] I [MSGID: 106163] [glusterd-handshake.c:1356:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 40100
[2019-01-18 05:32:41.124132] E [MSGID: 101032] [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to /var/lib/glusterd/glusterd.info. [No such file or directory]
[2019-01-18 05:32:41.124189] I [MSGID: 106477] [glusterd.c:190:glusterd_uuid_generate_save] 0-management: generated UUID: a3b3b5f6-1a02-4ed4-bdc1-8fb3549741ed
[2019-01-18 05:32:41.142917] I [MSGID: 106490] [glusterd-handler.c:2899:__glusterd_handle_probe_query] 0-glusterd: Received probe from uuid: be2324ce-17db-465c-bac2-5c9e877cec81
[2019-01-18 05:32:41.168959] I [MSGID: 106128] [glusterd-handler.c:2934:__glusterd_handle_probe_query] 0-glusterd: Unable to find peerinfo for host: master2 (24007)
[2019-01-18 05:32:41.178601] W [MSGID: 106061] [glusterd-handler.c:3408:glusterd_transport_inet_options_build] 0-glusterd: Failed to get tcp-user-timeout
[2019-01-18 05:32:41.178644] I [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2019-01-18 05:32:41.181826] I [MSGID: 106498] [glusterd-handler.c:3561:glusterd_friend_add] 0-management: connect returned 0
[2019-01-18 05:32:41.181884] I [MSGID: 106493] [glusterd-handler.c:2962:__glusterd_handle_probe_query] 0-glusterd: Responded to master2, op_ret: 0, op_errno: 0, ret: 0
[2019-01-18 05:32:41.182552] I [MSGID: 106490] [glusterd-handler.c:2548:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: be2324ce-17db-465c-bac2-5c9e877cec81
[2019-01-18 05:32:41.202085] I [MSGID: 106493] [glusterd-handler.c:3811:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to master2 (0), ret: 0, op_ret: 0
[2019-01-18 05:32:41.236220] I [MSGID: 106511] [glusterd-rpc-ops.c:262:__glusterd_probe_cbk] 0-management: Received probe resp from uuid: be2324ce-17db-465c-bac2-5c9e877cec81, host: master2 [2019-01-18 05:32:41.236247] I [MSGID: 106511] [glusterd-rpc-ops.c:422:__glusterd_probe_cbk] 0-glusterd: Received resp to probe req

So the next thing I should probably explain is what my clean up process looks like, because there may be something I'm leaving out.

  1. Run the gk-deploy --abort
  2. Remove the namespace from kubernetes
  3. Run the following command on all 4 nodes:
    sudo wipefs -a -f /dev/loop0 \
    ; sudo vgremove -ff $(sudo pvs | grep -o "vg_\w*") \
    ; sudo pvremove -ff /dev/loop0 \
    ; sudo rm -rf /var/lib/glusterd /var/lib/heketi /var/lib/misc/glusterfsd /etc/glusterfs /var/log/glusterfs

    These seem to get rid of all the files, volume groups, and persistent volumes

  4. Reboot the nodes
  5. Remove the glusterfs.conf file I created in the /etc/modules-load.d/ directory to load dm_thin_pool, dm_mirror, and dm_snapshot at reboot.
  6. Then I delete the loop0 device using sudo losetup -d /dev/loop0
  7. Then delete our 10GiB glusterfs.img file we were using as our "block device" (yes this is hacky, we'll try to get some virtual block devices soon)
  8. Then I disable and remove the loop0.service which looks like:

    [Unit]
    Description=Create loop devices for glusterfs
    Before=kubelet.service
    
    [Service]
    ExecStart=/sbin/losetup /dev/loop0 /home/csse/glusterfs.img
    type=oneshot
    
    [Install]
    WantedBy=local-fs.target
  9. Reboot the nodes again

My main questions are as follows:

In the meantime I'll keep trying to debug this, any help is much appreciated and I will do my best to quickly reply and provide additional information as needed.

Collin-Moore commented 5 years ago

Update, now each node has a 40GB raw device named sdb. Modified my topology to use that device and reinstalled. Now all four nodes have the Failed to initialize IB Device message that I had above. Two of the nodes when I ran lsblk I would get the following output:

lsblk: dm-0: failed to get device path
lsblk: dm-1: failed to get device path
lsblk: dm-0: failed to get device path
lsblk: dm-1: failed to get device path
lsblk: dm-2: failed to get device path
lsblk: dm-3: failed to get device path
lsblk: dm-4: failed to get device path
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
fd0      2:0    1    4K  0 disk
sda      8:0    0   60G  0 disk
└─sda1   8:1    0   60G  0 part /
sdb      8:16   0   40G  0 disk
sr0     11:0    1 1024M  0 rom

These two display the partitions after a reboot, not sure why.

One node did not have any partitions in the lsblk output, and the other had normal lsblk output like below:

NAME                                                                              MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
fd0                                                                                 2:0    1    4K  0 disk
sda                                                                                 8:0    0   [2019-01-24 16:25:01.853316] W [MSGID: 106117] [glusterd-handler.c:6407:__glusterd_peer_rpc_not60G  0 disk
└─sda1                                                                              8:1    0   60G  0 part /
sdb                                                                                 8:16   0   40G  0 disk
├─vg_523582f49c2350faec18aec9c70dbd7c-tp_f2d04a049a5145724b7b94bac0efd6c5_tmeta   253:0    0   12M  0 lvm
│ └─vg_523582f49c2350faec18aec9c70dbd7c-tp_f2d04a049a5145724b7b94bac0efd6c5-tpool 253:2    0    2G  0 lvm
│   ├─vg_523582f49c2350faec18aec9c70dbd7c-tp_f2d04a049a5145724b7b94bac0efd6c5     253:3    0    2G  0 lvm
│   └─vg_523582f49c2350faec18aec9c70dbd7c-brick_a517663655fb980df6c8ae55f0215a7f  253:4    0    2G  0 lvm
└─vg_523582f49c2350faec18aec9c70dbd7c-tp_f2d04a049a5145724b7b94bac0efd6c5_tdata   253:1    0    2G  0 lvm
  └─vg_523582f49c2350faec18aec9c70dbd7c-tp_f2d04a049a5145724b7b94bac0efd6c5-tpool 253:2    0    2G  0 lvm
    ├─vg_523582f49c2350faec18aec9c70dbd7c-tp_f2d04a049a5145724b7b94bac0efd6c5     253:3    0    2G  0 lvm
    └─vg_523582f49c2350faec18aec9c70dbd7c-brick_a517663655fb980df6c8ae55f0215a7f  253:4    0    2G  0 lvm
sr0                                                                                11:0    1 1024M  0 rom

In light of these updates any ideas what is going on?

nixpanic commented 5 years ago

It sounds like the /dev entries are not in-sync with what is available on the hosts. Can you make sure that you are using the most recent container images and that the daemonset for the glusterfs-server pods have a HOST_DEV_DIR bind-mount.

Collin-Moore commented 5 years ago

@nixpanic I was able to get the nodes re-imaged last Friday and I just reinstalled the cluster a few minutes ago. Getting the same error on all the nodes, though only one of them does not show the partitions in their lsblk output. Below are the image version taken from kubectl of all the containers running in the cluster as well as part of my glusterfs-daemonset.yaml. Do note I added the NoSchedule toleration so that we could run on the master nodes. Could this be causing problems?

Image versions

4 coredns/coredns:1.2.6
2 gcr.io/google_containers/cluster-proportional-autoscaler-amd64:1.3.0
4 gcr.io/google-containers/kube-apiserver:v1.12.3
4 gcr.io/google-containers/kube-controller-manager:v1.12.3
8 gcr.io/google-containers/kube-proxy:v1.12.3
2 gcr.io/google_containers/kubernetes-dashboard-amd64:v1.10.0
4 gcr.io/google-containers/kube-scheduler:v1.12.3
8 gluster/gluster-centos:latest
2 heketi/heketi:dev
4 nginx:1.13
2 quay.io/calico/kube-controllers:v3.1.3
8 quay.io/calico/node:v3.1.3

glusterfs-daemonset.yaml

spec:
  template:
    metadata:
      name: glusterfs
      labels:
        glusterfs: pod
        glusterfs-node: pod
    spec:
      tolerations:
        - key: "node-role.kubernetes.io/master"
          operator: "Exists"
          effect: "NoSchedule"
      nodeSelector:
        storagenode: glusterfs
      hostNetwork: true
      containers:
      - image: gluster/gluster-centos:latest
        imagePullPolicy: IfNotPresent
        name: glusterfs
        env:
        # alternative for /dev volumeMount to enable access to *all* devices
        - name: HOST_DEV_DIR
          value: "/mnt/host-dev"
        # set GLUSTER_BLOCKD_STATUS_PROBE_ENABLE to "1" so the
        # readiness/liveness probe validate gluster-blockd as well
        - name: GLUSTER_BLOCKD_STATUS_PROBE_ENABLE
          value: "1"
nixpanic commented 5 years ago

This all looks pretty good to me. The Failed to initialize IB Device error should not be an issue, rdma is likely not available/configured, and tcp should just work.

Heketi creates the LVM structures on the disks when these are added through the tolopogy.json or with heketi-cli device add ... Devices need to be empty when adding, otherwise heketi will not do so (overwriting existing data is not nice).

You probably should inspect which devices heketi has configured. If some devices are missing, you might be able to find in the logs why adding the devices failed.

Collin-Moore commented 5 years ago

@nixpanic thanks for explaining the significance of those messages, it sounds like I may have been worried about the wrong thing. I took a look at things using the heketi-cli and got the output from each node, which all seem fine, just the one that didn't seem to create any volumes.

Id:5c93098c942855162753f34d8ba3afc9   Name:/dev/sdb            State:online    Size (GiB):39      Used (GiB):2       Free (GiB):37      Bricks:1
Id:80ff81efa6991f2e76020b6addddc8d7   Name:/dev/sdb            State:online    Size (GiB):39      Used (GiB):2       Free (GiB):37      Bricks:1
Id:4eac023880d4fca9aead6e2be451da57   Name:/dev/sdb            State:online    Size (GiB):39      Used (GiB):2       Free (GiB):37      Bricks:1
Id:44851475ce9fd52e14d65b7b2d556a69   Name:/dev/sdb            State:online    Size (GiB):39      Used (GiB):0       Free (GiB):39      Bricks:0

However, I checked the topology info and noticed that the volume has this field Replica: 3. This would explain why heketi has not created any volumes on one of my nodes. Is this something to worry about? I would like to have my volumes available on all 4 nodes. Or does their need to be an odd number so there is a clear majority if a network partition occurs?

Cluster Id: fb5a90fd1633816fca4038088912801b

    File:  true
    Block: true

    Volumes:

        Name: heketidbstorage
        Size: 2
        Id: b67b85c3b57642a101d87683969070e0
        Cluster Id: fb5a90fd1633816fca4038088912801b
        Mount: 137.112.89.104:heketidbstorage
        Mount Options: backup-volfile-servers=137.112.89.103,137.112.89.106,137.112.89.105
        Durability Type: replicate
        Replica: 3
        Snapshot: Disabled

                Bricks:
                        Id: 16231ad118a28d5361d16ca9daf1a66c
                        Path: /var/lib/heketi/mounts/vg_80ff81efa6991f2e76020b6addddc8d7/brick_16231ad118a28d5361d16ca9daf1a66c/brick
                        Size (GiB): 2
                        Node: 96551d24e0140b240d0c2ce6160e6230
                        Device: 80ff81efa6991f2e76020b6addddc8d7

                        Id: 89e5dd96f6bc4deae1e2f273f8baa918
                        Path: /var/lib/heketi/mounts/vg_4eac023880d4fca9aead6e2be451da57/brick_89e5dd96f6bc4deae1e2f273f8baa918/brick
                        Size (GiB): 2
                        Node: 362b33b567dd07f14bf6e16bb88e694b
                        Device: 4eac023880d4fca9aead6e2be451da57

                        Id: fb21d50cd95fb83e3db2f5460ef3628e
                        Path: /var/lib/heketi/mounts/vg_5c93098c942855162753f34d8ba3afc9/brick_fb21d50cd95fb83e3db2f5460ef3628e/brick
                        Size (GiB): 2
                        Node: 96ebc21c66d75c29fcf2ae02a865ffec
                        Device: 5c93098c942855162753f34d8ba3afc9
nixpanic commented 5 years ago

The recommendation is to have "replica 3" for volumes. That means the data of the volumes will be replicated on three nodes. Not all volumes will be on the same nodes, the nodes used per volume can differ.

The advantage to have four nodes, is that even when a single node is unavailable, everything will continue to work. It will stay possible to create new volume with replica-3. And, of course when one node is offline, two others will still have the data and the contents of the volumes can still be used.

HTH, Niels

Collin-Moore commented 5 years ago

@nixpanic Thanks for all the help. Everything appears to be working after the re-image. Heketi gave me a persistent volume for mongodb just like its supposed to.