openshift / openshift-ansible

Install and config an OpenShift 3.x cluster
https://try.openshift.com
Apache License 2.0
2.17k stars 2.32k forks source link

Control plane pods didn't come up #10029

Closed RobMokkink closed 5 years ago

RobMokkink commented 5 years ago

Description

Task openshift_control_plane : Wait for control plane pods to appear failes, i see the following errors

failed: [osemaster.lab.local] (item=etcd) => {"attempts": 60, "changed": false, "item": "etcd", "msg": {"cmd": "/bin/oc get pod master-etcd-osemaster.lab.local -o json -n kube-system", "results": [{}], "returncode": 1, "stderr": "The connection to the server osemaster.lab.local:8443 was refused - did you specify the right host or port?\n", "stdout": ""}}

Version
ansible version `
ansible 2.6.3
  config file = /home/devops/ansible.cfg
  configured module search path = [u'/home/devops/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, Jul 13 2018, 13:06:57) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)

openshift-ansible-3.10.43-1.git.0.4794155.el7.noarch
Steps To Reproduce
  1. Install the environment like specified in the inventory file
  2. Observe the issue
Expected Results

Describe what you expected to happen.

Installation of openshift 3.10 to succeed
Observed Results

Describe what is actually happening.

failed: [osemaster.lab.local] (item=etcd) => {"attempts": 60, "changed": false, "item": "etcd", "msg": {"cmd": "/bin/oc get pod master-etcd-osemaster.lab.local -o json -n kube-system", "results": [{}], "returncode": 1, "stderr": "The connection to the server osemaster.lab.local:8443 was refused - did you specify the right host or port?\n", "stdout": ""}}
Additional Information

Provide any additional information which may help us diagnose the issue.

OS: CentOS Linux release 7.5.1804 (Core)

Inventory:
[masters]
osemaster.lab.local

[nodes]
# A hosted registry, by default, will only be deployed on nodes labeled
# "region=infra".
osemaster.lab.local openshift_node_group_name="node-config-master" openshift_schedulable=True
osenode1.lab.local openshift_node_group_name="node-config-infra" openshift_schedulable=True
osenode2.lab.local openshift_node_group_name="node-config-infra" openshift_schedulable=True
osenode3.lab.local openshift_node_group_name="node-config-infra" openshift_schedulable=True

[etcd]
osemaster.lab.local

[glusterfs]
osenode1.lab.local glusterfs_ip=10.0.0.21 glusterfs_devices='[ "/dev/vdc" ]'
osenode2.lab.local glusterfs_ip=10.0.0.22 glusterfs_devices='[ "/dev/vdc" ]'
osenode3.lab.local glusterfs_ip=10.0.0.23 glusterfs_devices='[ "/dev/vdc" ]'

[glusterfs_registry]
osenode1.lab.local glusterfs_ip=10.0.0.21 glusterfs_devices='[ "/dev/vdc" ]'
osenode2.lab.local glusterfs_ip=10.0.0.22 glusterfs_devices='[ "/dev/vdc" ]'
osenode3.lab.local glusterfs_ip=10.0.0.23 glusterfs_devices='[ "/dev/vdc" ]'

[OSEv3:children]
masters
nodes
etcd
glusterfs
glusterfs_registry

[OSEv3:vars]
ansible_ssh_user=devops
openshift_deployment_type=origin

# Release to install
openshift_release="3.10"

# Pkg version
openshift_pkg_version=-3.10.0

# Image tag
openshift_image_tag=v3.10.0

# Authentication
openshift_master_identity_providers=[{'name': 'htpasswd_auth', 'login': 'true', 'challenge': 'true', 'kind': 'HTPasswdPasswordIdentityProvider'}]

# Use firewalld
os_firewall_use_firewalld=True

# Router node on the master
openshift_router_selector='node-role.kubernetes.io/master=true'

# Project pods on infra nodes
osm_default_node_selector='node-role.kubernetes.io/infra=true'

# Registry on infra nodes 
openshift_registry_selector='node-role.kubernetes.io/infra=true'

# Glusterfs
openshift_storage_glusterfs_wipe=false

# Registry
openshift_hosted_registry_storage_kind=glusterfs
openshift_hosted_registry_storage_volume_size=10Gi

# Default subdomain
openshift_master_default_subdomain=cloudapps.lab.local

# Hosts only need 4gb ram, this is for lab environments the case
openshift_check_min_host_memory_gb=4

# Skip certain checks
#openshift_disable_check=memory_availability,disk_availability,docker_image_availability

# Install examples
openshift_install_examples=true

# Configure the multi-tenant SDN plugin (default is 'redhat/openshift-ovs-subnet')
os_sdn_network_plugin_name='redhat/openshift-ovs-multitenant'
michaelgugino commented 5 years ago

You'll need to investigate your network settings and/or why the API pod didn't come up.

RobMokkink commented 5 years ago

Network is ok.

This are the pulled images:

docker.io/openshift/origin-nodev v3.10.0 b2e0cbf1c449 6 days ago 1.31 GB docker.io/openshift/origin-control-plane v3.10.0 57543ab622d1 6 days ago 820 MB docker.io/openshift/origin-pod v3.10.0 1611d858742d 6 days ago 223 MB quay.io/coreos/etcd v3.2.22 f5dd2137a4f 3 months ago 37.3 MB

There is one failed container (docker ps -a) docker ps -a CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 3982704b5d52 57543ab622d1 "/bin/bash -c '#!/..." 2 minutes ago Exited (255) About a minute ago k8s_api_master-api-osemaster.lab.local_kube-system_9ca23c5815da8ed1d3dca61d87e1f6ab_7

The logs of this container give me the following output:

I0912 18:04:08.309823 1 plugins.go:84] Registered admission plugin "NamespaceLifecycle" I0912 18:04:08.309921 1 plugins.go:84] Registered admission plugin "Initializers" I0912 18:04:08.309930 1 plugins.go:84] Registered admission plugin "ValidatingAdmissionWebhook" I0912 18:04:08.309938 1 plugins.go:84] Registered admission plugin "MutatingAdmissionWebhook" I0912 18:04:08.309949 1 plugins.go:84] Registered admission plugin "AlwaysAdmit" I0912 18:04:08.309960 1 plugins.go:84] Registered admission plugin "AlwaysPullImages" I0912 18:04:08.309969 1 plugins.go:84] Registered admission plugin "LimitPodHardAntiAffinityTopology" I0912 18:04:08.309979 1 plugins.go:84] Registered admission plugin "DefaultTolerationSeconds" I0912 18:04:08.309988 1 plugins.go:84] Registered admission plugin "AlwaysDeny" I0912 18:04:08.310001 1 plugins.go:84] Registered admission plugin "EventRateLimit" I0912 18:04:08.310009 1 plugins.go:84] Registered admission plugin "DenyEscalatingExec" I0912 18:04:08.310019 1 plugins.go:84] Registered admission plugin "DenyExecOnPrivileged" I0912 18:04:08.310027 1 plugins.go:84] Registered admission plugin "ExtendedResourceToleration" I0912 18:04:08.310034 1 plugins.go:84] Registered admission plugin "OwnerReferencesPermissionEnforcement" I0912 18:04:08.310044 1 plugins.go:84] Registered admission plugin "ImagePolicyWebhook" I0912 18:04:08.310053 1 plugins.go:84] Registered admission plugin "InitialResources" I0912 18:04:08.310060 1 plugins.go:84] Registered admission plugin "LimitRanger" I0912 18:04:08.310068 1 plugins.go:84] Registered admission plugin "NamespaceAutoProvision" I0912 18:04:08.310075 1 plugins.go:84] Registered admission plugin "NamespaceExists" I0912 18:04:08.310083 1 plugins.go:84] Registered admission plugin "NodeRestriction" I0912 18:04:08.310091 1 plugins.go:84] Registered admission plugin "PersistentVolumeLabel" I0912 18:04:08.310098 1 plugins.go:84] Registered admission plugin "PodNodeSelector" I0912 18:04:08.310106 1 plugins.go:84] Registered admission plugin "PodPreset" I0912 18:04:08.310114 1 plugins.go:84] Registered admission plugin "PodTolerationRestriction" I0912 18:04:08.310125 1 plugins.go:84] Registered admission plugin "ResourceQuota" I0912 18:04:08.310133 1 plugins.go:84] Registered admission plugin "PodSecurityPolicy" I0912 18:04:08.310140 1 plugins.go:84] Registered admission plugin "Priority" I0912 18:04:08.310151 1 plugins.go:84] Registered admission plugin "SecurityContextDeny" I0912 18:04:08.310157 1 plugins.go:84] Registered admission plugin "ServiceAccount" I0912 18:04:08.310166 1 plugins.go:84] Registered admission plugin "DefaultStorageClass" I0912 18:04:08.310173 1 plugins.go:84] Registered admission plugin "PersistentVolumeClaimResize" I0912 18:04:08.310181 1 plugins.go:84] Registered admission plugin "StorageObjectInUseProtection" F0912 18:04:38.314011 1 start_api.go:68] dial tcp [::1]:2379: getsockopt: connection refused

I see etcd is running and listening on the ipv4 address: tcp 0 0 10.0.0.20:2379 0.0.0.0:* LISTEN 0 160611 30956/etcd So it looks like the api container want to connect to the ipv6 address instead of the ipv4 address. Also if find it strange that the coreos etcd container is used?

RobMokkink commented 5 years ago

openshift-ansible.log

michaelgugino commented 5 years ago

@RobMokkink CoreOS is the upstream for etcd, we're just using that one for origin/okd installs for that reason.

Most likely dns in your environment is causing ipv6 AAAA records to be returned first, this is probably not what you want.

RobMokkink commented 5 years ago

@michaelgugino, found the issue, in the template there was a cloud-init issue, manage_etc_host was turned on. So my bad. Sorry this issue can be closed.

michaelgugino commented 5 years ago

@RobMokkink glad you have it all sorted!