Closed jorgeprcn closed 7 years ago
Will be continuing the discussion from #5197 here.
i'm having this problem with origin 3.6.0 and fedora 26 atomic host.
@jorgefromhell @dustymabe Can either of you try to see if there are any log files under /var/lib/kubelet
that end with -glusterfs.log
? ...or anywhere on the nodes with the failing pods, really? ;)
If logs aren't there they're probably in the journal for the node service on the host where the pod is attempting to run.
journalctl --no-pager -u origin-node
@jarrpa @sdodson nothing different from the logs I've posted before. When my deploy starts, I start a tail on the origin-node logs and the only thing I get is this message, repeatedly:
Aug 17 11:54:39 app-dc-prd-osnode02 journal: I0817 11:54:39.160860 6238 glusterfs.go:148] glusterfs: endpoints &Endpoints{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:glusterfs-dynamic-postgresql,GenerateName:,Namespace:prd-teste3,SelfLink:/api/v1/namespaces/prd-teste3/endpoints/glusterfs-dynamic-postgresql,UID:654e9cff-8352-11e7-b2df-00505680c54c,ResourceVersion:390646,Generation:0,CreationTimestamp:2017-08-17 10:45:56 -0300 -03,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{gluster.kubernetes.io/provisioned-for-pvc: postgresql,},Annotations:map[string]string{},OwnerReferences:[],Finalizers:[],ClusterName:,},Subsets:[{[{10.20.0.50 <nil> nil} {10.20.0.51 <nil> nil} {10.20.0.52 <nil> nil}] [] [{ 1 TCP}]}],}
I'm only getting this error using the containerized=true
on my inventory. As a workaround for now, I'm using the rpm version.
Thanks!
@jorgefromhell All right, I'm not too familiar with the GlusterFS provisioner in Kube. I'll need some time to pull in others or go diving myself. :) Note that currently I'm traveling out at a conference. Ping me in a week or so if I don't reply before then.
I am yet to read both the threads in details, however if I summarize, the PVC is created successfully in all the cases, however the mount is failing. The application pods are running on either Fedora 26 or CentOS 7.3 ( with contenarized installation) where we see the mount failure. RHEL works perfectly, is RHEL too contenarized ? Also, any one got a chance to try "rpm" based CentOS or Fedora installation? if yes, did it work ? For more isolation, can we try manual mounting of the share in one of the problematic node ( where app pod is expected to run or tried to run) by running
mount -t glusterfs <IP>:/sharename /<somepath>
? Is this working ? where IP
could be any of the IP from the endpoint, ie here either 10.20.0.50
or 10.20.0.51
.
@humblec Currently the issue comes in when Openshift is deployed "containerized", that is the Openshift components are in containers rather than RPMs. Containerized mode can be used on RPM-based distros as well as Atomic. This and another thread have verified that non-containerized installation works just fine, so we're looking at scenarios where containerized=true
is set.
We were also wondering if this may have something to do with: https://github.com/openshift/origin/issues/15950
I'm willing to bet that a manual mount of a GlusterFS volume on the host will work. @jorgefromhell, can you verify?
@jarrpa , yes, mounting a GlusterFS volume on the host itself works just fine.
Thanks @jorgefromhell. @humblec, any insights on what the error in https://github.com/openshift/openshift-ansible/issues/5118#issuecomment-328538148 could mean?
@jorgefromhell this means mostly bind mount
to the container goes wrong when you we are contenarized=true
mode. To make sure thats what happening at time of using pvc when it try to bring up app pod, can you watch host and see there is a proper glusterfs mount ?
@jarrpa thats not an error, its just a information message. I believe some thing goes wrong in bind mount due to mount propagation.
@jorgefromhell @dustymabe I have a potential workaround. After you dynamically provision a GlusterFS volume, can you try deleting the dynamically created endpoints then manually creating an endpoints with the IP addresses of your GlusterFS nodes and see if the pods will come up?
Hi @jarrpa, tried that too, same error.
@jorgefromhell Damn, back to the thinktank...
@jorgefromhell Can you try @humblec's suggestion? When a pod that's trying to use a GlusterFS volume is coming up, see if there is a GlusterFs volume being mounted on the host. Something like watch -n1 "mount | grep gluster"
on the node running the pod should suffice.
@jarrpa tried that too, didn't use the "watch" on my line but I monitored it while that informational message was being displayed, I didn't catch nothing different happening. If you think there's something to it, I'll deploy another containerized cluster, currently I'm using the rpm version for our apps here.
@jorgefromhell Thanks.
@humblec Looks like it's not even mounting on the host. Any further ideas?
Same issue here, I run 6 CentOS Atomic host, 3 run GlusterFs ; I can provide logs and confs if needed.
Had the same issue on a newly installed 12 machine atomic cluster. PVC was created, but mount failed. I upgraded te atomic hosts to version
$ atomic host status State: idle Deployments: ● rhel-atomic-host-ostree:rhel-atomic-host/7/x86_64/standard Version: 7.4.1 (2017-08-30 19:29:56) Commit: e83c16780259c5272684221e2a6007300d94bbfdc5432f9ab6025300f447145b
rhel-atomic-host-ostree:rhel-atomic-host/7/x86_64/standard Version: 7.3.3 (2017-02-27 16:31:38) Commit: bfc591ba1a4395c6b8e54d34964b05df4a61e0d82d20cc1a2fd817855c7e2da5
So there was a problem in 7.3.x, but it seems to be fixed in 7.4.x
Out 3.6 cluster also encountered this problem. It happens that the versions of the glusterfs client packages installed on some cluster nodes are 3.8.4, some are 3.10.5. Mounts only fail on the nodes with 3.10.5 installed.
@yaxinlx Can you provide more environment informaiton? Were you also running containerized?
Yes, containerized cluster. Redhat 7.2 and Heketi 4.0. The version of glusterfs servers is 3.10.5.
@yaxinlx So the versions of glusterfs-fuse on the mounting nodes varied between 3.8.4 and 3.10.5, but all the glusterfs pods were 3.10.5? Can you verify that if you downgrade the client nodes running 3.10 to 3.8 that they start being able to mount?
but all the glusterfs pods were 3.10.5
no, glusterfs servers are standalone, not containerized. But their versions are 3.10.5 for sure.
My last comment mean openshift cluster is containerized (in fact, maybe my impression is wrong for this).
Can you verify that if you downgrade the client nodes running 3.10 to 3.8 that they start being able to mount?
We verified it is true.
@jorgefromhell @fabiomartinelli @rabem00 Does anyone have the ability to test and see if a 3.8 version of the glusterfs FUSE client resolves the issue?
@dustymabe As well
@jarrpa I'm sorry, some info I provided above is not accurate. The openshift cluster is installed on centos 7.2. Now there is no problem to mount glusterfs volume in this cluster.
But volumes still fail to mount on another redhat 7.2 openshift cluster.
@yaxinlx A few questions:
fuse
kernel module loaded?Good news again.
We just found the glusterfs client version on the redhat cluster is a little different from the version on the centos cluster. We reinstalled the glusterfs client on redhat cluster to make them consistent with the contos cluster, then bang!, volumes also mount in the redhat cluster. :)
The detailed rpm files we use now are:
glusterfs-3.8.4-18.4.el7.centos.x86_64.rpm
glusterfs-fuse-3.8.4-18.4.el7.centos.x86_64.rpm
glusterfs-client-xlators-3.8.4-18.4.el7.centos.x86_64.rpm
glusterfs-libs-3.8.4-18.4.el7.centos.x86_64.rpm
The version which does not work:
glusterfs-libs-3.8.4-1.el7.x86_64
glusterfs-3.8.4-1.el7.x86_64
glusterfs-fuse-3.8.4-1.el7.x86_64
glusterfs-client-xlators-3.8.4-1.el7.x86_64
btw, we run non-atomic OSes.
@yaxinlx Awesome! Hopefully this means we're getting closer to a solution. :) Can you report exactly which RPM versions of GlusterFS 3.10 were failing as well?
@yaxinlx Also, I have no experience setting up containerized OpenShift. Would you be willing to help me troubleshoot this further? I'd like to see if downgrading GlusterFS on @jorgefromhell's setup (CentOS 7.3.1611) resolves the issue. For that I'd need either you to do it or you to show me how to configure openshift-ansible to install containerized openshift. I've tried before, but just setting containerized=true
leads to problems. :)
We can't find the exact version of old 3.10.5 client now. We install them with yum from the official repo for the centos cluster. Our redhat cluster is built in a closed environment.
I will get some scripts on install containerized openshift from my colleagues and email to you after a while.
@jarrpa my colleagues think the script is so simple that it can be pasted here.
The file hosts
:
[OSEv3:children]
masters
etcd
nodes
[OSEv3:vars]
ansible_ssh_user=inst
ansible_become=true
openshift_deployment_type=origin
deployment_type=origin
osn_storage_plugin_deps=[]
osm_image=10.162.148.165:5000/openshift/origin
osn_image=10.162.148.165:5000/openshift/node
osn_ovs_image=10.162.148.165:5000/openshift/openvswitch
openshift_image_tag=v3.6.0
openshift_version=v3.6.0
openshift_release=v3.6.0
#containerized=True
#is_containerized=True
oreg_url=10.162.148.165:5000/openshift/origin-${component}:${version}
cli_docker_additional_registries=10.162.148.165:5000
cli_docker_insecure_registries=10.162.148.165:5000
openshift_master_api_port=8443
openshift_master_console_port=8443
osm_cluster_network_cidr=172.26.0.0/16
openshift_portal_net=172.25.0.0/16
osm_host_subnet_length=8
enable_excluders=false
osm_etcd_image=registry.access.redhat.com/rhel7/etcd
#openshift_master_cluster_method=native
openshift_master_cluster_hostname=10.162.148.153
openshift_master_cluster_public_hostname=10.162.148.153
openshift_disable_check=memory_availability,disk_availability,docker_storage,docker_image_availability,package_version,package_availability,package_update
# host group for masters
[masters]
10.162.148.153 containerized=true openshift_hostname=10.162.148.153 openshift_ip=10.162.148.153
[etcd]
10.162.148.154 containerized=true openshift_hostname=10.162.148.154 openshift_ip=10.162.148.154
# host group for nodes, includes region info
[nodes]
10.162.148.159 openshift_node_labels="{'region': 'infra', 'zone': 'default'}" containerized=true openshift_ip=10.162.148.159 openshift_hostname=10.162.148.159
10.162.148.160 openshift_node_labels="{'region': 'primary', 'zone': 'default'}" containerized=true openshift_ip=10.162.148.160 openshift_hostname=10.162.148.160
ansible-playbook -i PATH/TO/hosts PATH/TO/openshift-ansible/playbooks/byo/config.yml
@yaxinlx Thanks!
@jarrpa - I was on vacation, so i didn't see your request you made 7 days ago. Did you find what the problem is for this issue? Do you need me to test something? Let me know.
@rabem00 No problem! Can you see if going @yaxinlx's route resolves the problem you were having? They downgraded to an earlier glusterfs-fuse package version on all OpenShift nodes, from CentOS specifically.
@jarrpa Just installed a new cluster using the following specs:
15m 15m 1 logging-kibana-1-deploy Pod Warning FailedMount kubelet, openshift04.isl.belastingdienst.nl MountVolume.SetUp failed for volume "kubernetes.io/secret/90ea35eb-ac1c-11e7-8506-005056b00580-deployer-token-zdsj3" (spec.Name: "deployer-token-zdsj3") pod "90ea35eb-ac1c-11e7-8506-005056b00580" (UID: "90ea35eb-ac1c-11e7-8506-005056b00580") with: secret "logging"/"deployer-token-zdsj3" not registered
2m 18m 182 logging-es-0 PersistentVolumeClaim Normal FailedBinding persistentvolume-controller no persistent volumes available for this claim and no storage class is set
3m 53m 202 metrics-cassandra-1 PersistentVolumeClaim Normal FailedBinding persistentvolume-controller no persistent volumes available for this claim and no storage class is set
$ oc get pvc
NAME STATUS VOLUME CAPACITY ACCESSMODES STORAGECLASS AGE
metrics-cassandra-1 Pending 54m
metrics-cassandra-2 Pending 54m
metrics-cassandra-3 Pending 54m
$ oc get pvc
NAME STATUS VOLUME CAPACITY ACCESSMODES STORAGECLASS AGE
logging-es-0 Pending 55m
logging-es-1 Pending 55m
logging-es-2 Pending 54m
$ oc get pvc
NAME STATUS VOLUME CAPACITY ACCESSMODES STORAGECLASS AGE
registry-claim **Bound** registry-volume 500Gi RWX 1h
@jarrpa I created a new pvc from the openshift web console and that one get bound as long if use the storageclass i created. If i trying to create one without a storageclass selected then i see in the event log the message: no persistent volumes available for this claim and no storage class is set (and request is pending). This is the same message i get with a fresh install (see previous post, metrics and logging fail with same message).
I think that this is strange, because I don't see anything in the documentation that it is mandatory to set a storageclass and/or to make a default storageclass. Also the registry has no storageclass set and that one is mounting ok. The other services are complaining that there is no storage-class (but nothing in the metrics or logging documentation about setting a storageclass option in the inventory file).
I leave the cluster as is, so if you need me to try something (or have questions about the setup) let me know.
@rabem00 First, it sounds like the downgrade resolved the original Issue for you? Please confirm.
Second, if you are not using a StorageClass you need to manually create a GlusterFS volume using heketi-cli
then also create a PersistentVolume in Kubernetes pointing to that GlusterFS volume. THEN you create a PVC with attributes matching the desired PV and it should bind to that PV.
@rabem00 Also, fixed your formatting. :)
@jarrpa Let's forget the Storageclass issue for now, otherwise we are mixing problems (although they could be related, I don't know).
I had to look up what you meant with
downgrade
, because did not downgrade anything. I don't use the setting that @yaxinlx is using. He's using non-atomic hosts, where they can change rpm versions, etc. We are using atomic hosts and containerised installation and a couple of weeks ago we ran into the same problem the initial issue @jorgefromhell had.
What I did was a clean install with the latest atomic version, openshift-ansible-3.6.173.0.45-1 and openshift-origin v3.6.0 setting in the inventory file. The gluster containers are using glusterfs version 3.10.2. After the clean install the registry is mounting perfectly, but the metrics and logging are not. They fail with: no persistent volumes available for this claim and no storage class is set This is strange, because they should also work out of the box.
I will try the steps @jorgefromhell is mentioning:
Steps To Reproduce
Will let you know the results.
@rabem00 Read what I said: If you are not using a StorageClass you need to manually provision your GlusterFS volumes BEFORE creating the PVCs. Further, using GlusterFS for logging and metrics is currently not supported.
At this point, I will ask you to open a new Issue. You are obviously not hitting the problem encountered in this Issue. If you can manage to replicate it and find a workaround, please let us know.
@jarrpa as I said where hitting the same issue a couple of weeks ago (that's why i was in this thread). The message was the same as @jorgefromhell:
glusterfs: could not open log file for pod
But after the fresh install, this message can't be recreated with deployments. Will create a new issue for the storage class problem. Thx
@rabem00 Yeah, I'm not saying you weren't hitting this issue, you certainly had valid reason to chime in here. :) Looking forward to your new issue!
@jorgefromhell If you're still around, let us know if you want to continue experimenting with this or not. I'll close this issue after a week of inactivity.
ok so I can confirm it seems like an issue with the version of gluster on the system. I got this working fine with fedora 27 atomic host where fedora 26 atomic host was having the glusterfs: mount failed: exit status 1 the following error information was pulled from the glusterfs log to help diagnose this issue: glusterfs: could not open log file for pod
issues.
Not working f26 versions:
glusterfs-3.10.6-3.fc26.x86_64
glusterfs-client-xlators-3.10.6-3.fc26.x86_64
glusterfs-fuse-3.10.6-3.fc26.x86_64
glusterfs-libs-3.10.6-3.fc26.x86_64
Working f27 versions:
glusterfs-3.12.1-2.fc27.x86_64
glusterfs-client-xlators-3.12.1-2.fc27.x86_64
glusterfs-fuse-3.12.1-2.fc27.x86_64
glusterfs-libs-3.12.1-2.fc27.x86_64
Hmm... same thing on CentOS but it was on 3.8 versions... ugh. All right, well, this thread has gotten long enough and the OP has gone missing. :) If you want to pursue this further please open a new Issue. This should probably be or end up being a GlusterFS issue or a gluster-containers issue.
Description
I've been using the openshift-ansible and have 1.5.0 and 1.5.1 clusters fully working with the glusterfs/heketi dynamic provisioning perfectly, but after updating (tried with a new installation as well) to v3.6.0, glusterfs volumes cannot be mounted during the deployment. The only thing that changed was the openshift cluster really (v3.6.0), the gluster/heketi cluster is the same we used before with the 1.5.0 and 1.5.1 clusters.
Version
Steps To Reproduce
Expected Results
Pod fully functional with a gluster volume mapped to it.
Observed Results
Pods get stuck trying to mount the gluster volume. I haven't been able to pinpoint exactly the error message, I've even changed the log level in origin-node, origin-master-api, origin-master-controller to 6 all I get are these messages:
At the node, I've been getting these messages (a lot of them, up until the pod fails):
Additional Information
For the same deploymentconfig, if I remove the volume, it works as expected.
For all hosts I'm using CentOS Linux release 7.3.1611 (Core)