gluster / gluster-kubernetes

GlusterFS Native Storage Service for Kubernetes
Apache License 2.0
875 stars 389 forks source link

Behaviour with a failed node #251

Open timbrd opened 7 years ago

timbrd commented 7 years ago

I have a 3 node openshift cluster and have installed gluster-kubernetes on it. It works flawlessly unless a node is failing. I shutdown one of the 3 machines, but openshift / kubernetes does not notice that one of the gluster pods is not reachable anymore. The existing volumes are still mounted on the pods and can be accessed without problems, heketi tries to create a brick on the failed node though after creating a new pvc. This obviously does not work, so heketi cancels the creation of the new volume.

Is this a reasonable behavious of heketi? IMO, since there are still 2 of 3 active gluster pods, it should still be able to create a new volume?

jarrpa commented 7 years ago

There is on-going discussion on the heketi project about how it behaves when one or more nodes are in a failure state. So yes, you've hit a hot-button issue. :) I'll let one of @raghavendra-talur or @MohamedAshiqrh comment on this one further.

MohamedAshiqrh commented 7 years ago

@timbrd Hi, For now all the volume create in pvc creates replica 3 volumes. Now in kube 1.6 and origin 1.6 we have added the volume type option. In which you can specify replica:2 which just needs two nodes to be up and running.

XericZephyr commented 6 years ago

I have encountered the same issue.

Is there any workaround to temporarily remove the failure mount and start over?

Like this one, I would like to remove all those LVMs on the device sdb and start over.

$ lsblk
NAME                                                                              MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sdb                                                                                 8:16   0   3.7T  0 disk 
├─vg_54fd064328efff4c9addedcc02ddad63-tp_5abf7b2064014b8dbadadc500b76fb88_tmeta   252:0    0    12M  0 lvm  
│ └─vg_54fd064328efff4c9addedcc02ddad63-tp_5abf7b2064014b8dbadadc500b76fb88-tpool 252:2    0     2G  0 lvm  
│   ├─vg_54fd064328efff4c9addedcc02ddad63-tp_5abf7b2064014b8dbadadc500b76fb88     252:3    0     2G  0 lvm  
│   └─vg_54fd064328efff4c9addedcc02ddad63-brick_5abf7b2064014b8dbadadc500b76fb88  252:4    0     2G  0 lvm  
└─vg_54fd064328efff4c9addedcc02ddad63-tp_5abf7b2064014b8dbadadc500b76fb88_tdata   252:1    0     2G  0 lvm  
  └─vg_54fd064328efff4c9addedcc02ddad63-tp_5abf7b2064014b8dbadadc500b76fb88-tpool 252:2    0     2G  0 lvm  
    ├─vg_54fd064328efff4c9addedcc02ddad63-tp_5abf7b2064014b8dbadadc500b76fb88     252:3    0     2G  0 lvm  
    └─vg_54fd064328efff4c9addedcc02ddad63-brick_5abf7b2064014b8dbadadc500b76fb88  252:4    0     2G  0 lvm