Glusterfs with network encryption fails mounting when single node is down

aliscott commented 7 years ago

Glusterfs volume fails to mount when client/server network encryption is enabled and a single gluster node is unavailable

Version

openshift v1.4.1
kubernetes v1.4.0+776c994
etcd 3.1.0-rc.0

Steps To Reproduce

Create a persistent-volume-claim that uses glusterfs dynamic provisioning.
Stop the volume, enable client.ssl and server.ssl and restart the volume.
Mount the volume to a pod.
Test reading and writing files to the volume. Everything works fine at this stage.
Shut down a single glusterfs node
Test reading and writing files to the volume.

Expected Result

I should be able to mount the volume and it should still be readable and writeable, since 2/3 of the gluster nodes are still running.

Current Result

Neither reading or writing to the volume works. When recreating the pod the volume fails to mount:

8s  8s  1 {kubelet node-fbcb9413635c}   Warning  FailedMount Unable to mount volumes for pod "nginx_test1(867619c0-dcc3-11e6-9ab1-06ff73935133)": timeout expired waiting for volumes to attach/mount for pod "nginx"/"test1". list of unattached/unmounted volumes=[mypd]
8s  8s  1 {kubelet node-fbcb9413635c}   Warning  FailedSync Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "nginx"/"test1". list of unattached/unmounted volumes=[mypd]

The volume works again when I bring up unavailable node or if I disable client and server SSL.

Additional Information

glusterd.log.txt

wenskys commented 7 years ago

I also have this question.

aliscott commented 7 years ago

I'm also having this problem when only management encryption is enabled.

obnoxxx commented 7 years ago

looking...

@raghavendra-talur , @ramkrsna, @humblec, @MohamedAshiqrh

raghavendra-talur commented 7 years ago

@aliscott I am not sure if enabling on management encryption is supported. I refer to https://kshlm.in/post/network-encryption-in-glusterfs/ for Gluster and ssl setup.

raghavendra-talur commented 7 years ago

@aliscott On the main issue

how did you generate certs?
self signed or common CA
Does this happen when any of the 3 nodes are down or is it only one special node?

obnoxxx commented 7 years ago

@aliscott any update on this?

(github should introduce a "needinfo" flag ...)

aliscott commented 7 years ago

Sorry, I missed this.

how did you generate certs?

I followed the guide here: https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3.1/html/Administration_Guide/chap-Network_Encryption.html

self signed or common CA

Self-signed

Does this happen when any of the 3 nodes are down or is it only one special node?

Any of the nodes

openshift-bot commented 6 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

alikhajeh1 commented 6 years ago

/remove-lifecycle stale any update on this?

JasonGiedymin commented 6 years ago

I'm going to link my update to an encryption related ticket only because there is so little documentation related to transit and rest encryption: https://github.com/openshift/origin/issues/13013

openshift-bot commented 6 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot commented 6 years ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

openshift-bot commented 6 years ago

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci-robot commented 6 years ago

@openshift-bot: Closing this issue.

In response to [this](https://github.com/openshift/origin/issues/12985#issuecomment-425861705): >Rotten issues close after 30d of inactivity. > >Reopen the issue by commenting `/reopen`. >Mark the issue as fresh by commenting `/remove-lifecycle rotten`. >Exclude this issue from closing again by commenting `/lifecycle frozen`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

openshift / origin