openshift / openshift-ansible

Install and config an OpenShift 3.x cluster
https://try.openshift.com
Apache License 2.0
2.19k stars 2.32k forks source link

Glusterfs installation breaks the playbook #8476

Closed megastallman closed 6 years ago

megastallman commented 6 years ago

Description

I'm using the containerized openshift-ansible installer. It breaks on containerized Glusterfs and Heketi installer part.

Version

The container versions are:

docker.io/openshift/origin-ansible:v3.9.19
and
docker.io/openshift/origin-ansible:v3.9
Steps To Reproduce
docker run -t -u `id -u` -v $PWD/$KEYFILE:/opt/app-root/src/.ssh/id_rsa:Z -v $PWD/$INVFILE:/tmp/inventory:Z -e INVENTORY_FILE=/tmp/inventory  -e PLAYBOOK_FILE=playbooks/openshift-glusterfs/config.yml -e OPTS="-v" docker.io/openshift/origin-ansible:v3.9.19

Or rerun the whole playbooks/deploy_cluster.yml playbook. Results are the same.

Expected Results

I expect the playbooks to run up to the end. That is what happens without Glusterfs. I've also been using this Openshift-origin cluster before trying Glusterfs. I'm experimenting with Openshift installations/uninstallations(with playbooks/adhoc/uninstall.yml)

Observed Results

Describe what is actually happening.

TASK [openshift_storage_glusterfs : Wait for deploy-heketi pod] ********************************
Tuesday 22 May 2018  14:00:54 +0000 (0:00:00.247)       0:01:23.363 *********** 
FAILED - RETRYING: Wait for deploy-heketi pod (30 retries left).
FAILED - RETRYING: Wait for deploy-heketi pod (29 retries left).

Then I've checked the broken pod:

# oc --config /etc/origin/master/admin.kubeconfig -n glusterfs logs deploy-heketi-storage-1-deploy
/usr/bin/openshift-deploy: error while loading shared libraries: libpthread.so.0: cannot open shared object file: Permission denied
Additional Information

Provide any additional information which may help us diagnose the issue.

cat /etc/redhat-release

Red Hat Enterprise Linux Server release 7.5 (Maipo)

My inventory snippet:

[masters]
k8-101.ololo.com
k8-102.ololo.com
k8-103.ololo.com

[nodes]
k8-101.ololo.com openshift_node_labels="{'region': 'infra', 'node-role.kubernetes.io/compute': 'true'}" openshift_schedulable=True
k8-102.ololo.com openshift_node_labels="{'region': 'infra', 'node-role.kubernetes.io/compute': 'true'}" openshift_schedulable=True
k8-103.ololo.com openshift_node_labels="{'region': 'infra', 'node-role.kubernetes.io/compute': 'true'}" openshift_schedulable=True
k8-104.ololo.com openshift_schedulable=True

[etcd]
k8-101.ololo.com
k8-102.ololo.com
k8-103.ololo.com

[lb]
k8-lb-101.ololo.com

[glusterfs]
k8-101.ololo.com glusterfs_devices='[ "/dev/sda4" ]'
k8-102.ololo.com glusterfs_devices='[ "/dev/sda4" ]'
k8-103.ololo.com glusterfs_devices='[ "/dev/sda4" ]'
k8-104.ololo.com glusterfs_devices='[ "/dev/sda4" ]'

# Create an OSEv3 group that contains the masters and nodes groups
[OSEv3:children]
masters
nodes
etcd
lb
glusterfs

...
michaelgugino commented 6 years ago

/assign @jarrpa

jarrpa commented 6 years ago

Sorry for the delay, this fell off my radar. I have never seen this particular error before. Is this still a reproducible issue? If so, are you able to deploy any other hosted components like the registry, logging, or metrics, or any other app in general?

megastallman commented 6 years ago

Yes, all other apps get deployed. We are using this cluster in production now. So only the glusterfs playbooks break everything.

jarrpa commented 6 years ago

...this is so weird. If it's consistently the same error, I'm not sure where to begin debugging this. The thing that's failing is the deployer pod (not deploy-heketi-X but deploy-heketi-X-deploy), which should be no different for heketi than anything else.

If you can, please reproduce the problem and report the following:

  1. Are the GlusterFS pods running and ready?
  2. What is the output of oc describe <failing_pod>?
  3. Can you find anything in the system logs of the host that the deploy pod was scheduled on? My apologies but I don't know the exact services you'd want to look for. I'd just start with journalctl -xe.
megastallman commented 6 years ago

Thanks to RHEL support I've resolved this issue. 1) Set Selinux to Enforcing and Targeted 2) Run "restorecon -R -v /" 3) Reboot your nodes 4) Add openshift_storage_glusterfs_heketi_admin_key="ThisIsAWorkaround" openshift_storage_glusterfs_heketi_user_key="ForAWrongAnsibleVersion" to your inventory

This black magic actually helps you to install GlusterFS