openshift / openshift-ansible

Install and config an OpenShift 3.x cluster
https://try.openshift.com
Apache License 2.0
2.17k stars 2.32k forks source link

cluster install fails on GlusterFs stage (3.7) #7777

Closed ahmadou closed 6 years ago

ahmadou commented 6 years ago

Description

On a new install on a multi master setup with a glusterfs storage setup, the install fail at the "Wait for heketi Pod" task.

Sometimes it will get stuck on the image pull phase and sometimes it will be because the heketi pod is stuck on a crash loop.

Version

Please put the following version information in the code block indicated below.

ansible 2.5.0 config file = /home/ansibleuser/openshift-ansible/ansible.cfg configured module search path = [u'/home/ansibleuser/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules'] ansible python module location = /usr/lib/python2.7/site-packages/ansible executable location = /usr/bin/ansible python version = 2.7.5 (default, Aug 4 2017, 00:39:18) [GCC 4.8.5 20150623 (Red Hat 4.8.5-16)]

If you're operating from a git clone:

Steps To Reproduce
  1. launch 3.7 playbook ansible-playbook -i ./hosts/cluster-installation playbooks/byo/openshift-glusterfs/config.yml
Expected Results

Cluster up and running and glusterfs configured

Observed Results

Describe what is actually happening.

Here are the logs of the heketi-storage container :

  Setting up heketi database
  No database file found
  Database volume found: 10.39.57.31:heketidbstorage on /var/lib/heketi type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
  Database file is expected, waiting...
  Database file did not appear, exiting.
Additional Information

Provide any additional information which may help us diagnose the issue. CentOS Linux release 7.4.1708

My config file

Configuration globale cluster

[OSEv3:children] masters etcd nodes glusterfs glusterfs_registry

VARIABLES GLOBALES CLUSTER

[OSEv3:vars]

etcd

openshift_use_etcd_system_container=True

ansible

ansible_ssh_user=ansibleuser ansible_become=true ansible_service_broker_image_prefix=openshift/ ansible_service_broker_registry_url="registry.access.redhat.com"

checks disk

openshift_check_min_host_disk_gb=13

firewall

os_firewall_use_firewalld=True

deployment configuration

openshift_deployment_type=origin

openshift_version=3.9.0

openshift_pkg_version=3.7.1

containerized=true

configuration glusterfs

openshift_storage_glusterfs_namespace=glusterfs openshift_storage_glusterfs_name=storage

configuration registry interne

openshift_hosted_registry_storage_kind=glusterfs openshift_registry_selector="region=infranodes" openshift_hosted_registry_replicas=3 openshift_hosted_registry_storage_volume_size=190Gi

configuration routers

openshift_router_selector="region=routingnodes"

configuration noeuds standard

osm_default_node_selector="region=standardnodes"

configuration points d'acces master et api

openshift_master_cluster_hostname=master-lb.mycompany.internal openshift_master_cluster_public_hostname=console.mycompany.com openshift_master_default_subdomain=mycompany.com openshift_master_api_port=8443 openshift_master_console_port=8443 openshift_master_session_name=ssn openshift_public_ip="xx.xx.xx.xx"

configuration du certificats des routeurs

openshift_hosted_router_certificate={"certfile": "/home/ansibleuser/openshift-ansible/customCertificates/STAR_mycompany.crt", "keyfile": "/home/ansibleuser/openshift-ansible/customCertificates/mycompany.key", "cafile": "/home/ansibleuser/openshift-ansible/customCertificates/COMODORSADomainValidationSecureServerCA.crt"}

configuration du ldap

openshift_master_identity_providers=[{'name': 'picv4_ldap', 'challenge': 'true', 'login': 'true', 'kind': 'LDAPPasswordIdentityProvider', 'attributes': {'id': ['dn'], 'email': ['mail'], 'name': ['cn'], 'preferredUsername': ['uid']}, 'bindDN': 'uid=ldapbind,cn=users,cn=accounts,dc=ggd,dc=mycompany', 'bindPassword': 'tetetetetetge', 'ca': '', 'insecure': 'true', 'url': 'ldap://ldap.picv4.mycompany:389/cn=users,cn=accounts,dc=picv4,dc=mycompany?uid'}]

configuration de la politique d'audit

openshift_master_audit_config={"enabled": true, "auditFilePath": "/var/log/openpaas-oscp-audit/openpaas-oscp-audit.log", "maximumFileRetentionDays": 14, "maximumFileSizeMegabytes": 500, "maximumRetainedFiles": 5}

configuration logs cluster

openshift_logging_install_logging="true" openshift_logging_es_pvc_dynamic="true" openshift_logging_es_pvc_size="100G" openshift_logging_curator_default_days="2" openshift_logging_curator_run_hour="24" openshift_master_logging_public_url="https://logs.mycompany.com"

openshift_logging_es_nodeselector="region=infranodes" openshift_logging_kibana_ops_nodeselector="region=infranodes" openshift_logging_curator_ops_nodeselector="region=infranodes"

configuration metrics

openshift_metrics_install_metrics="true" openshift_metrics_cassandra_storage_type="dynamic" openshift_metrics_duration=7 openshift_metrics_cassandra_pvc_size="20G" openshift_metrics_cassandra_replicas=1 openshift_metrics_cassandra_limits_memory="2Gi" openshift_metrics_cassandra_limits_cpu="2000m" openshift_metrics_cassandra_nodeselector="region=infranodes" openshift_master_metrics_public_url="https://metrics.mycompany.com"

NOEUDS GLUSTER FS

[glusterfs] storage01.mycompany.internal glusterfs_devices='[ "/dev/sdc"]' glusterfs_ip=10.39.57.31 storage02.mycompany.internal glusterfs_devices='[ "/dev/sdc"]' glusterfs_ip=10.39.57.32 storage03.mycompany.internal glusterfs_devices='[ "/dev/sdc"]' glusterfs_ip=10.39.57.33 storage04.mycompany.internal glusterfs_devices='[ "/dev/sdc"]' glusterfs_ip=10.39.57.34

config glusterfs

[glusterfs:vars] openshift_storage_glusterfs_nodeselector="glusterfs=standardstorage" openshift_storage_glusterfs_wipe="true"

NOEUDS GLUSTER FS DEDIES AU REGISTRY INTERNE

[glusterfs_registry] storage-registry01.mycompany.internal glusterfs_devices='[ "/dev/sdc"]' glusterfs_ip=10.39.57.41 storage-registry02.mycompany.internal glusterfs_devices='[ "/dev/sdc"]' glusterfs_ip=10.39.57.42 storage-registry03.mycompany.internal glusterfs_devices='[ "/dev/sdc"]' glusterfs_ip=10.39.57.43

NOEUDS DU CLUSTER

Groupe des VMS Master

[masters] master0[1:2].mycompany.internal

noeuds etcd

[etcd] etcd01.mycompany.internal etcd02.mycompany.internal etcd03.mycompany.internal

Noeuds Openshift

[nodes]

Infra Nodes

infranode0[1:2].mycompany.internal openshift_node_labels="{'region' : 'infranodes'}" openshift_schedulable=true

Pic nodes

picnode0[1:2].mycompany.internal openshift_node_labels="{'region' : 'picnodes'}" openshift_schedulable=true

Compilation nodes

compilnode0[1:2].mycompany.internal openshift_node_labels="{'region' : 'compilnodes'}" openshift_schedulable=true

routing nodes

routeur0[1:2].mycompany.internal openshift_node_labels="{'region' : 'routingnodes'}"

standard nodes

node0[1:2].mycompany.internal openshift_node_labels="{'region' : 'standardnodes'}" openshift_schedulable=true

masters

master0[1:2].mycompany.internal openshift_node_labels="{'region' : 'masters'}" openshift_schedulable=true

glusterfs nodes

storage0[1:4].mycompany.internal openshift_node_labels="{'region' : 'standardstorage'}"

glusterfs registry nodes

storage-registry0[1:3].mycompany.internal openshift_node_labels="{'region' : 'registrystorage'}"

variables specifiques noeuds openshift

[nodes:vars] openshift_docker_options=--log-driver json-file --log-opt max-size=1M --log-opt max-file=3 --selinux-enabled

EXTRA INFORMATION GOES HERE
sdodson commented 6 years ago

/assign mjudeikis

mjudeikis commented 6 years ago

checking. If you still have env running it would be good to see "oc describe" of the heketi pod

ahmadou commented 6 years ago

Here goes :

oc describe pod heketi-storage

Name: heketi-storage-2-deploy Namespace: glusterfs Node: picnode02.mycompany.internal/10.39.57.103 Start Time: Wed, 04 Apr 2018 17:35:47 +0200 Labels: openshift.io/deployer-pod-for.name=heketi-storage-2 Annotations: openshift.io/deployment.name=heketi-storage-2 openshift.io/scc=restricted Status: Failed IP: 10.130.0.2 Containers: deployment: Container ID: docker://1915d25e1cbbca0e63034a55c1f9100fb1d16ed527bb563e41294a17619aa77d Image: openshift/origin-deployer:v3.7.1 Image ID: docker-pullable://docker.io/openshift/origin-deployer@sha256:2e39b45e1a68fd25647f0fd64b19d81b9dee04ee84ec49fefc2a28580dc9ab61 Port: State: Terminated Reason: Error Exit Code: 1 Started: Wed, 04 Apr 2018 17:36:17 +0200 Finished: Wed, 04 Apr 2018 17:46:17 +0200 Ready: False Restart Count: 0 Environment: KUBERNETES_MASTER: https://master02.mycompany.internal:8443 OPENSHIFT_MASTER: https://master02.mycompany.internal:8443 BEARER_TOKEN_FILE: /var/run/secrets/kubernetes.io/serviceaccount/token OPENSHIFT_CA_DATA: -----BEGIN CERTIFICATE----- MIIC6jCCAdKgAwIBAgIBATANBgkqhkiG9w0BAQsFADAmMSQwIgYDVQQDDBtvcGVu c2hpZnQtc2lnbmVyQDE1MjI4NDYyNDMwHhcNMTgwNDA0MTI1MDQyWhcNMjMwNDAz MTI1MDQzWjAmMSQwIgYDVQQDDBtvcGVuc2hpZnQtc2lnbmVyQDE1MjI4NDYyNDMw ggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQCsq/4LOJ2Vk+zdO8G/rOwq glCgrhPFjklFiZk6QP2c5ZRpMr2lDRXFd/TTV5Umg5LE96F2EzsREhVj33uwijrj 5KnAvoWuCy0fN3s78lnyHkJHMuitMDMBKB8nR96wQNRNSHrBqyl/Aa/VApRIT0yF AYK2hQ5FDJh/OKh0H6BOJg6muEeFC7zdfaDBIzaf9WyNJRjlslYYsR8W/qvlYH9t 6lyPpFf66uah5AHhSEqXkXHFdXVz60vgnFTYkPRaY8OmlNMtbL0OVkJ9YWEEIXZq yeFL1lZBa19Bns/PDP/r1UtSP2MeMaNbRpu/dulDOomTF8vJlgMiw4cDAWzasIrT AgMBAAGjIzAhMA4GA1UdDwEB/wQEAwICpDAPBgNVHRMBAf8EBTADAQH/MA0GCSqG SIb3DQEBCwUAA4IBAQCF/Isqbnn4k1yjOysy7IaBcygEhF9RZMrC0mDMinmNgxv4 i1Vfo23lWEAG0+9rU5lhvmmt6Zj9w9hFKRi4VhTFjYvah+t4jAVWm2WJqhduFwGW ojzQN2MmFjYvbXp5CTYzteZXn8rh9XUWn+YJpyydr0oAlW+TXgbZFQSHXEKHRVOT MaJySnkA5NCnnwccxXZANOhCfK1fzZA8ddlIOxEao4dPbtq9bUyweIZ6cdLvDD8b nXVWRbyChqqspOTiI/on1VX+fJ/zPvnGJaH4VXDGabkaKBGwnWt5R3ckc3KCYsB6 dVTvWQ7KupUNTzdfTM1w0hGQDT2CgXQP3YQG136a -----END CERTIFICATE-----

  OPENSHIFT_DEPLOYMENT_NAME:    heketi-storage-2
  OPENSHIFT_DEPLOYMENT_NAMESPACE:   glusterfs
Mounts:
  /var/run/secrets/kubernetes.io/serviceaccount from deployer-token-l94cr (ro)

Conditions: Type Status Initialized True Ready False PodScheduled True Volumes: deployer-token-l94cr: Type: Secret (a volume populated by a Secret) SecretName: deployer-token-l94cr Optional: false QoS Class: BestEffort Node-Selectors: Tolerations: Events: FirstSeen LastSeen Count From SubObjectPath Type Reason Message


50m 50m 1 default-scheduler Normal Scheduled Successfully assigned heketi-storage-2-deploy to picnode02.mycompany.internal 50m 50m 1 kubelet, picnode02.mycompany.internal Normal SuccessfulMountVolume MountVolume.SetUp succeeded for volume "deployer-token-l94cr" 50m 50m 1 kubelet, picnode02.mycompany.internal spec.containers{deployment} Normal Pulling pulling image "openshift/origin-deployer:v3.7.1" 50m 50m 1 kubelet, picnode02.mycompany.internal spec.containers{deployment} Normal Pulled Successfully pulled image "openshift/origin-deployer:v3.7.1" 50m 50m 1 kubelet, picnode02.mycompany.internal spec.containers{deployment} Normal Created Created container 50m 50m 1 kubelet, picnode02.mycompany.internal spec.containers{deployment} Normal Started Started container

mjudeikis commented 6 years ago

can you please go to one of the running gluster pods and check: gluster volume list ?

Is it the same install you faced the first issue with oc binary? or its fresh environment? Looks like heketi database is not created, and it might be due to the last failure.

if volume is not there, just delete glusterfs project and rerun?

mjudeikis commented 6 years ago

Im spinning up my env now to test try to replicate this. But if you are in the position to join screen sharing session in https:// bluejeans.com/ 2794238616/ here.

ahmadou commented 6 years ago

It is the same install but i've uninstalled it and reinstalled so many times i've lost count. each time i got he same issue. I can start from scratch again if you want

I noticed i have no router pods also and i don't know if its' due to the install failling at that stage or if i should have had them already spinning.

I don't know how to execute the command you gave me. I can't access the pods terminal (it give me a warning about privilegs even though i have cluster admin role) and i don't seem to be able to docker exec -it bash in the pod container.

edit :

i'v run gluster volume list in the container and got

heketidbstorage

Do you want me to re do a clean install ?

mjudeikis commented 6 years ago

Router and other pods will come later. In the initial install, it is not needed.

Did you did uninstall using gluster-uninstall playbook?

And it should let you in to the pods, if you are cluster-admin. You might be admin of the project?

ahmadou commented 6 years ago

I've not run gluster uninstall but i did delete the project and manually clean the lvm volume. I also use this custom playbo---

I'll try to run the uninstall playbook and restart but can you give me the correct playbook to use ?

mjudeikis commented 6 years ago

try this one and rerun: https://github.com/openshift/openshift-ansible/blob/release-3.7/playbooks/openshift-glusterfs/uninstall.yml

im checking the playbooks as we chat

ahmadou commented 6 years ago

It fails because playbooks/init/main.yml is not to be found

mjudeikis commented 6 years ago

Looks like the issue is with firewalls. Did test in my lab and stuff works. Should not block the release.

working with @ahmadou offline in real-time.

DanyC97 commented 6 years ago

@ahmadou for future please try to format your output (config/ errors etc ) thx

Also i saw you are using ansible 2.5.0 and while this is a different question not related to this issue i'm curious to know from @sdodson if we already moved to this version or not ?

ahmadou commented 6 years ago

@DanyC97 Ok will do. I made an upgrade to the ansible version because it didn't want to start on a 2.4 if i remember correctly

mjudeikis commented 6 years ago

@ahmadou for iptables, try this on one of the nodes you were not able to do manual mount of the glusterfs:

iptables -I INPUT -m state --state NEW -m tcp -p tcp --dport 24007:24011 -j ACCEPT
iptables -I INPUT -m state --state NEW -m tcp -p tcp --dport 111 -j ACCEPT
iptables -I INPUT -m state --state NEW -m udp -p udp --dport 111 -j ACCEPT
iptables -I INPUT -m state --state NEW -m tcp -p tcp --dport 38465:38467 -j ACCEPT
iptables -I INPUT -m state --state NEW -m multiport -p multiport --dport 49152:49664 -j ACCEPT
iptables -I INPUT -m state --state NEW -m tcp -p tcp --dport 2222 -j ACCEPT
service iptables save
#iptables: Saving firewall rules to /etc/sysconfig/iptables:[  OK  ]
systemctl restart iptables.service
ahmadou commented 6 years ago

@mjudeikis

Well i didn't apply your suggestion because the firewall rules in my system are managed out of the machines themselves.

I've allowed all traffic between nodes and it worked !! So the issue of the heketidbstorage was a firewall issue.

In summarry : since you set up a lot of firewall rules, it's best to not setup to many rules when installing the cluster or do the documentation needs an update ?

I'm still getting in error. Now the heketi pod starts but the install fails because of syntax i think :

TASK [openshift_storage_glusterfs : Delete pre-existing glusterblock provisioner resources] ************************************************************************************************************************************************************************************ Thursday 05 April 2018 10:48:53 +0200 (0:00:00.761) 0:08:44.135 ******** fatal: [master01.mycompany]: FAILED! => {"msg": "The conditional check 'not openshift_is_atomic | bool' failed. The error was: error while evaluating conditional (not openshift_is_atomic | bool): 'openshift_is_atomic' is undefined\n\nThe error appears to have been in '/home/ansibleuser/openshift-ansible/roles/openshift_storage_glusterfs/tasks/glusterblock_deploy.yml': line 2, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n---\n- name: Delete pre-existing glusterblock provisioner resources\n ^ here\n"}

mjudeikis commented 6 years ago

Just do git pull. This was fixed by https://github.com/openshift/openshift-ansible/pull/7772

Idea is that openshift-ansible will configure all firewall rules for you. if you have an external firewall, in your case big brother V, you need to replicate those rules there if you are under strict firewall management. If no, keep them loose, and iptables on the boxes will do the job.

ahmadou commented 6 years ago

Ok i got it.

The installation process completed the gluster fs phase but now crashes at the metrics portion :

TASK [openshift_metrics : generate hawkular-cassandra replication controllers] ************************************************************************************************************************************************************************************************* Thursday 05 April 2018 13:17:24 +0200 (0:00:00.388) 0:38:28.969 ******** failed: [master01.mycompany.com] (item=1) => {"changed": false, "item": "1", "msg": "AnsibleUndefinedVariable: 'unicode object' has no attribute 'items'"} Do you want me to open a new ticket ?

mjudeikis commented 6 years ago

This is different stuff. I would suspect you missing some variable in your inventory. this is outside this ticket so we can close this one :)

ahmadou commented 6 years ago

Ok thank you all for your assistance.

It was a pleasure. I will open a new ticket concerning that issue if i don't manage to find an explanation by myself