openshift / openshift-ansible

Install and config an OpenShift 3.x cluster

https://try.openshift.com

Apache License 2.0

2.17k stars 2.31k forks source link

openshift_control_plane : pods failed to appear #11375

Closed RashaHaj closed 5 years ago

RashaHaj commented 5 years ago

Description

Hi, I'm trying to install openshift v3.11 cluster on openstack using openshift-ansible. However , the playbook "deploy_cluster.yml" is encountering the error below:

TASK [openshift_control_plane : Wait for control plane pods to appear] ******************************************************************
FAILED - RETRYING: Wait for control plane pods to appear (60 retries left).
FAILED - RETRYING: Wait for control plane pods to appear (59 retries left).
FAILED - RETRYING: Wait for control plane pods to appear (58 retries left).
FAILED - RETRYING: Wait for control plane pods to appear (57 retries left).
FAILED - RETRYING: Wait for control plane pods to appear (56 retries left).
FAILED - RETRYING: Wait for control plane pods to appear (55 retries left).

Version

docker version Version: 1.13.1 API version: 1.26 Package version: docker-1.13.1-91.git07f3374.el7.centos.x86_64
ansible --version ansible 2.7.8

rpm -qa | grep openshift openshift-ansible-roles-3.11.37-1.git.0.3b8b341.el7.noarch openshift-ansible-3.11.37-1.git.0.3b8b341.el7.noarch centos-release-openshift-origin311-1-2.el7.centos.noarch openshift-ansible-playbooks-3.11.37-1.git.0.3b8b341.el7.noarch openshift-ansible-docs-3.11.37-1.git.0.3b8b341.el7.noarch
git describe openshift-ansible-3.11.90-1-12-g1ea6332

Steps To Reproduce

run playbook/prerequisites.yml
run playbook/deploy_cluster.yml

Expected Results

The cluster to be deployed

Example command and output or error messages

#tailf /var/log/messages

master origin-node: E0320 03:40:04.028981   38205 reflector.go:136] k8s.io/kubernetes/pkg/kubelet/kubelet.go:464: Failed to list *v1.Node: Get https://master.lab.example.com:8443/api/v1/nodes?fieldSelector=metadata.name%3Dmaster.lab.example.com&limit=500&resourceVersion=0: dial tcp 192.168.1.5:8443: connect: connection refused

#docker logs --tail -10 b9b0cfc5f98a

E0320 14:27:08.073902       1 leaderelection.go:234] error retrieving resource lock kube-system/openshift-master-controllers: Get https://master.lab.example.com:8443/api/v1/namespaces/kube-system/configmaps/openshift-master-controllers: dial tcp 192.168.1.5:8443: connect: connection refused

Additional Information

[root@master ~]# telnet 192.168.1.5 8443
Trying 192.168.1.5...
telnet: connect to address 192.168.1.5: Connection refused

[root@master ~]# netstat -ntlp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 0.0.0.0:10250           0.0.0.0:*               LISTEN      79549/hyperkube
tcp        0      0 192.168.1.5:2379        0.0.0.0:*               LISTEN      83046/etcd
tcp        0      0 192.168.1.5:2380        0.0.0.0:*               LISTEN      83046/etcd
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN      1/systemd
tcp        0      0 0.0.0.0:20048           0.0.0.0:*               LISTEN      3894/rpc.mountd
tcp        0      0 0.0.0.0:53682           0.0.0.0:*               LISTEN      3886/rpc.statd
tcp        0      0 172.17.0.1:53           0.0.0.0:*               LISTEN      3803/dnsmasq
tcp        0      0 192.168.1.5:53          0.0.0.0:*               LISTEN      3803/dnsmasq
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      3887/sshd
tcp        0      0 127.0.0.1:56921         0.0.0.0:*               LISTEN      79549/hyperkube
tcp        0      0 0.0.0.0:2049            0.0.0.0:*               LISTEN      -

Any idea how to fix this please ? Thanks!

nagonzalez commented 5 years ago

First thing that sticks out is that you're on Ansible 2.7.x which is unsupported

I'd downgrade to 2.6.5+ and then give it another shot

RashaHaj commented 5 years ago

I downgraded ansible to 2.6.5 but the problem doesn't seem to be resolved :(

AILED - RETRYING: Wait for control plane pods to appear (2 retries left).
FAILED - RETRYING: Wait for control plane pods to appear (1 retries left).
failed: [master.lab.example.com] (item=etcd) => {"attempts": 60, "changed": false, "item": "etcd", "msg": {"cmd": "/usr/bin/oc get pod master-etcd-master.lab.example.com -o json -n kube-system", "results": [{}], "returncode": 1, "stderr": "The connection to the server master.lab.example.com:8443 was refused - did you specify the right host or port?\n", "stdout": ""}}
FAILED - RETRYING: Wait for control plane pods to appear (60 retries left).
FAILED - RETRYING: Wait for control plane pods to appear (59 retries left).

nagonzalez commented 5 years ago

Per your netstat output, it doesn't appear like you're running a master container on 8443

If you run a docker ps -a on the master, you may should see the ids of the failed containers. What's the log output of the failed containers?

RashaHaj commented 5 years ago

[root@master centos]# docker ps -a
CONTAINER ID        IMAGE               COMMAND                  CREATED              STATUS                        PORTS               NAMES
b697460b27a7        ff5dd2137a4f        "/bin/sh -c '#!/bi..."   23 seconds ago       Exited (1) 21 seconds ago                         k8s_etcd_master-etcd-master.lab.example.com_kube-system_e5014392f56ecd6362ebb2005a64946c_1157
6be33fb802a0        01b05abc0861        "/bin/bash -c '#!/..."   About a minute ago   Exited (255) 42 seconds ago                       k8s_api_master-api-master.lab.example.com_kube-system_b24b15710309f0062b93e07af49cb464_1056
8f2a6a8863a7        01b05abc0861        "/bin/bash -c '#!/..."   22 hours ago         Up 22 hours                                       k8s_controllers_master-controllers-master.lab.example.com_kube-system_3210f8756194a0d2e374db4a71b81896_5

Nothing to seem incorrect with the Exited containers. It's the last UP container (k8s_controllers_master-controllers-master.lab.example.com_kube-system_3210f8756194a0d2e374db4a71b81896_5 ) that always indicates the error below :

[root@master centos]# docker logs --tail -5 8f2a6a8863a7

E0322 21:05:23.098604       1 reflector.go:136] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Node: Get https://master.lab.example.com:8443/api/v1/nodes?limit=500&resourceVersion=0: dial tcp 192.168.1.5:8443: connect: connection refused
E0322 21:05:23.099550       1 reflector.go:136] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.StorageClass: Get https://master.lab.example.com:8443/apis/storage.k8s.io/v1/storageclasses?limit=500&resourceVersion=0: dial tcp 192.168.1.5:8443: connect: connection refused
E0322 21:05:23.100544       1 reflector.go:136] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Service: Get https://master.lab.example.com:8443/api/v1/services?limit=500&resourceVersion=0: dial tcp 192.168.1.5:8443: connect: connection refused
E0322 21:05:23.101604       1 reflector.go:136] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.PersistentVolume: Get https://master.lab.example.com:8443/api/v1/persistentvolumes?limit=500&resourceVersion=0: dial tcp 192.168.1.5:8443: connect: connection refused
E0322 21:05:23.102789       1 reflector.go:136] k8s.io/kubernetes/cmd/kube-scheduler/app/server.go:176: Failed to list *v1.Pod: Get https://master.lab.example.com:8443/api/v1/pods?fieldSelector=status.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded&limit=500&resourceVersion=0: dial tcp 192.168.1.5:8443: connect: connection refused
E0322 21:05:23.632365       1 leaderelection.go:234] error retrieving resource lock kube-system/kube-controller-manager: Get https://master.lab.example.com:8443/api/v1/namespaces/kube-system/configmaps/kube-controller-manager: dial tcp 192.168.1.5:8443: connect: connection refused
E0322 21:05:24.095060       1 reflector.go:136] k8s.io/client-go/informers/factory.go:130: Failed to list *v1beta1.ReplicaSet: Get https://master.lab.example.com:8443/apis/extensions/v1beta1/replicasets?limit=500&resourceVersion=0: dial tcp 192.168.1.5:8443: connect: connection refused
E0322 21:05:24.095865       1 reflector.go:136] k8s.io/client-go/informers/factory.go:130: Failed to list *v1beta1.PodDisruptionBudget: Get https://master.lab.example.com:8443/apis/policy/v1beta1/poddisruptionbudgets?limit=500&resourceVersion=0: dial tcp 192.168.1.5:8443: connect: connection refused
E0322 21:05:24.096860       1 reflector.go:136] k8s.io/client-go/informers/factory.go:130: Failed to list *v1beta1.StatefulSet: Get https://master.lab.example.com:8443/apis/apps/v1beta1/statefulsets?limit=500&resourceVersion=0: dial tcp 192.168.1.5:8443: connect: connection refused
E0322 21:05:24.097924       1 reflector.go:136] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.PersistentVolumeClaim: Get https://master.lab.example.com:8443/api/v1/persistentvolumeclaims?limit=500&resourceVersion=0: dial tcp 192.168.1.5:844

I thought to check etcd and it appears not running properly


[root@master centos]# systemctl status etcd
● etcd.service - Etcd Server
   Loaded: loaded (/usr/lib/systemd/system/etcd.service; enabled; vendor preset: disabled)
   Active: active (running) since Fri 2019-03-22 17:27:38 CET; 19h ago
 Main PID: 34853 (etcd)
   Memory: 12.7M
   CGroup: /system.slice/etcd.service
           └─34853 /usr/bin/etcd --name=master.lab.example.com --data-dir=/var/lib/etcd/default.etcd --listen-client-urls=https://192....

Mar 23 13:02:29 master.lab.example.com etcd[34853]: rejected connection from "192.168.1.5:64902" (error "remote error: tls: bad ....com")
Mar 23 13:02:59 master.lab.example.com etcd[34853]: rejected connection from "192.168.1.5:49348" (error "EOF", ServerName "")
Mar 23 13:08:03 master.lab.example.com etcd[34853]: rejected connection from "192.168.1.5:57716" (error "remote error: tls: bad ....com")
Mar 23 13:09:53 master.lab.example.com etcd[34853]: rejected connection from "192.168.1.5:60830" (error "remote error: tls: bad ....com")

[root@master centos]# journalctl -u etcd --since=today

Mar 23 13:10:39 master.lab.example.com etcd[34853]: rejected connection from "192.168.1.5:62190" (error "remote error: tls: bad certificate", ServerName "master.lab.example.com")
Mar 23 13:11:44 master.lab.example.com etcd[34853]: rejected connection from "192.168.1.5:64102" (error "remote error: tls: bad certificate", ServerName "master.lab.example.com")
Mar 23 13:12:58 master.lab.example.com etcd[34853]: rejected connection from "192.168.1.5:49896" (error "remote error: tls: bad certificate", ServerName "master.lab.example.com")
Mar 23 13:14:55 master.lab.example.com etcd[34853]: rejected connection from "192.168.1.5:53326" (error "remote error: tls: bad certificate", ServerName "master.lab.example.com")
Mar 23 13:18:06 master.lab.example.com etcd[34853]: rejected connection from "192.168.1.5:58666" (error "remote error: tls: bad certificate", ServerName "master.lab.example.com")

A certificate problem ? how to fix it ?

RashaHaj commented 5 years ago

Here is the /etc/etcd/etcd.conf file content :

ETCD_NAME=master.lab.example.com
ETCD_LISTEN_PEER_URLS=https://192.168.1.5:2380
ETCD_DATA_DIR=/var/lib/etcd/default.etcd
#ETCD_WAL_DIR=
#ETCD_SNAPSHOT_COUNT=10000
ETCD_HEARTBEAT_INTERVAL=500
ETCD_ELECTION_TIMEOUT=2500
ETCD_LISTEN_CLIENT_URLS=https://192.168.1.5:2379
#ETCD_MAX_SNAPSHOTS=5
#ETCD_MAX_WALS=5
#ETCD_CORS=

#[cluster]
ETCD_INITIAL_ADVERTISE_PEER_URLS=https://192.168.1.5:2380
ETCD_INITIAL_CLUSTER=master.lab.example.com=https://192.168.1.5:2380
ETCD_INITIAL_CLUSTER_STATE=new
ETCD_INITIAL_CLUSTER_TOKEN=etcd-cluster-1
#ETCD_DISCOVERY=
#ETCD_DISCOVERY_SRV=
#ETCD_DISCOVERY_FALLBACK=proxy
#ETCD_DISCOVERY_PROXY=
ETCD_ADVERTISE_CLIENT_URLS=https://192.168.1.5:2379
#ETCD_STRICT_RECONFIG_CHECK=false
#ETCD_AUTO_COMPACTION_RETENTION=0
#ETCD_ENABLE_V2=true
ETCD_QUOTA_BACKEND_BYTES=4294967296

#[proxy]
#ETCD_PROXY=off
#ETCD_PROXY_FAILURE_WAIT=5000
#ETCD_PROXY_REFRESH_INTERVAL=30000
#ETCD_PROXY_DIAL_TIMEOUT=1000
#ETCD_PROXY_WRITE_TIMEOUT=5000
#ETCD_PROXY_READ_TIMEOUT=0
#[security]
ETCD_TRUSTED_CA_FILE=/etc/etcd/ca.crt
ETCD_CLIENT_CERT_AUTH=true
ETCD_CERT_FILE=/etc/etcd/server.crt
ETCD_KEY_FILE=/etc/etcd/server.key
#ETCD_AUTO_TLS=false
ETCD_PEER_TRUSTED_CA_FILE=/etc/etcd/ca.crt
ETCD_PEER_CLIENT_CERT_AUTH=true
ETCD_PEER_CERT_FILE=/etc/etcd/peer.crt
ETCD_PEER_KEY_FILE=/etc/etcd/peer.key
#ETCD_PEER_AUTO_TLS=false

#[logging]
ETCD_DEBUG=False

#[profiling]
#ETCD_ENABLE_PPROF=false
#ETCD_METRICS=basic
#
#[auth]
#ETCD_AUTH_TOKEN=simple

RashaHaj commented 5 years ago

Any one knows what's the problem with the TLS certificate please ? or give me a line of research ? :(

nagonzalez commented 5 years ago

You etcd config file looks exactly like my working one except for this line:

ETCD_DATA_DIR=/var/lib/etcd/

but that probably doesn't have anything to do with it.

Last thing I'd suggest it to verify the Subject Name of your etcd certs:

openssl x509 -in /etc/etcd/peer.crt -text -noout
openssl x509 -in /etc/etcd/server.crt -text -noout

They should match your host/s FQDN

RashaHaj commented 5 years ago

When running openssl x509 -in /etc/etcd/peer.crt -text -noout and openssl x509 -in /etc/etcd/server.crt -text -noout the certifications appear to be not correct

  Subject: CN=192.168.1.5
 IP Address:192.168.1.5, DNS:192.168.1.5

I replaced with this

  Subject: CN=master.lab.example.com
 IP Address:192.168.1.5, DNS:master.lab.example.com

Everywhere the incoherence is ( server.crt , peer.crt , under generated_certs... in origin/master/master.etcd-client.crt ) and restarted etcd, but I still get the same message I even relaunched the playbook prerequisites.yml then deploy_cluster.yml , however the output of
openssl x509 -in /etc/etcd/peer.crt -text -noout and openssl x509 -in /etc/etcd/server.crt -text -noout

is always the same.

How to update the certificates , knowing that I already tried update-ca-trust enable update-ca-trust extract

nagonzalez commented 5 years ago

That would explain the error.

What's your inventory file look like?

RashaHaj commented 5 years ago

cat /etc/ansible/hosts

[OSEv3:children] masters nodes etcd nfs

[OSEv3:vars] ansible_ssh_user=root openshift_deployment_type=origin

containerized=true

os_firewall_use_firewalld=true

openshift_clock_enabled=true

openshift_release=v3.11.0

openshift_master_identity_providers=[{'name': 'htpasswd_auth', 'login': 'true','challenge':'true', 'kind': 'HTPasswdPasswordIdentityProvider'}] openshift_master_htpasswd_users={'admin': '$apr1$RbOvaj8r$LEqJqG6V/O/i7Pfyyyyyy.', 'user': '$apr1$MfsFK97I$enQjqHCh2LL8w4EBwNrrrr'} openshift_public_hostname=master.lab.example.com openshift_master_default_subdomain=cloudapps.lab.example.com openshift_disable_check=memory_availability,disk_availability,docker_storage

openshift_docker_insecure_registries=172.30.0.0/16

[masters] master.lab.example.com containerized=false

[nodes] master.lab.example.com openshift_schedulable=true openshift_node_group_name='node-config-master-infra' node1.lab.example.com openshift_node_group_name='node-config-compute' node2.lab.example.com openshift_node_group_name='node-config-compute'

[etcd] master.lab.example.com

[nfs] master.lab.example.com

RashaHaj commented 5 years ago

I started installation from the scratch and this time, CN and DNS are well set. Openshift is now listening on port 8443. But the playbook fails at the task

TASK [Approve node certificates when bootstrapping] *************************************************************************************
FAILED - RETRYING: Approve node certificates when bootstrapping (30 retries left).
FAILED - RETRYING: Approve node certificates when bootstrapping (29 retries left).
FAILED - RETRYING: Approve node certificates when bootstrapping (28 retries left).

Failure summary:

  1. Hosts:    master.lab.example.com
     Play:     Approve any pending CSR requests from inventory nodes
     Task:     Approve node certificates when bootstrapping
     Message:  Could not find csr for nodes: node2.lab.example.com, node1.lab.example.com

hostname is correct on master and the two nodes. SSH connection is working just fine between master and nodes

nagonzalez commented 5 years ago

awesome. we're getting closer :)

log into your master and run a oc get nodes. What do you see?

this is what I get on mine:

oc get nodes
NAME        STATUS    ROLES           AGE       VERSION
ocmaster1   Ready     master          5d        v1.11.0+d4cacc0
ocmaster2   Ready     master          5d        v1.11.0+d4cacc0
ocmaster3   Ready     master          5d        v1.11.0+d4cacc0
ocnode1     Ready     compute,infra   5d        v1.11.0+d4cacc0
ocnode2     Ready     compute,infra   5d        v1.11.0+d4cacc0
ocnode3     Ready     compute,infra   5d        v1.11.0+d4cacc0

RashaHaj commented 5 years ago

[root@master centos]# oc get nodes
Unable to connect to the server: Forbidden

pushing further investigations, I notice that something strange is happening with etcd

[root@master system]# systemctl status etcd
Unit etcd.service could not be found.

I don't find config file under /usr/lib/systemd/system neither /etc/systemd/system/multi-user.target.wants/ !!! The package disappears. With the first installation, I had

[root@master ~]# rpm -qa | grep etcd
etcd-3.2.22-1.el7.x86_64

Although, when executing netstat , I see etcd process listening on 2379 and 2380 , I don't know how this could happen :(

[root@master centos]# netstat -ntlp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 0.0.0.0:10250           0.0.0.0:*               LISTEN      27542/hyperkube
tcp        0      0 192.168.1.7:2379        0.0.0.0:*               LISTEN      102269/etcd
tcp        0      0 192.168.1.7:2380        0.0.0.0:*               LISTEN      102269/etcd
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN      1/systemd
tcp        0      0 0.0.0.0:20048           0.0.0.0:*               LISTEN      1598/rpc.mountd
tcp        0      0 0.0.0.0:8053            0.0.0.0:*               LISTEN      11048/openshift
tcp        0      0 192.168.1.7:53          0.0.0.0:*               LISTEN      19624/dnsmasq
tcp        0      0 172.17.0.1:53           0.0.0.0:*               LISTEN      19624/dnsmasq
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      3646/sshd
tcp        0      0 0.0.0.0:8443            0.0.0.0:*               LISTEN      11048/openshift
tcp        0      0 0.0.0.0:8444            0.0.0.0:*               LISTEN      11071/openshift
tcp        0      0 0.0.0.0:63422           0.0.0.0:*               LISTEN      1568/rpc.statd
tcp        0      0 0.0.0.0:2049            0.0.0.0:*               LISTEN      -
tcp        0      0 127.0.0.1:57029         0.0.0.0:*               LISTEN      27542/hyperkube

RashaHaj commented 5 years ago

After several attempts , now the playbook can be achieved. But I'm still not able to see nodes

TASK [Set Master install 'Complete'] ****************************************************************************************************
ok: [master.lab.example.com]

PLAY RECAP ******************************************************************************************************************************
localhost                  : ok=12   changed=0    unreachable=0    failed=0
master.lab.example.com     : ok=300  changed=63   unreachable=0    failed=0
node1.lab.example.com      : ok=16   changed=0    unreachable=0    failed=0
node2.lab.example.com      : ok=16   changed=0    unreachable=0    failed=0

INSTALLER STATUS ************************************************************************************************************************
Initialization  : Complete (0:00:34)
Master Install  : Complete (0:03:35)


[root@master playbooks]# oc get nodes
Unable to connect to the server: Forbidden

Still the oc appears to be installed !

[root@master openshift-master]# oc version
oc v3.11.0+62803d0-1
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO
Unable to connect to the server: Forbidden

[root@master openshift-master]# openshift version
openshift v3.11.0+62803d0-1

[root@master openshift-master]# kubectl version
Client Version: version.Info{Major:"1", Minor:"11+", GitVersion:"v1.11.0+d4cacc0", GitCommit:"d4cacc0", GitTreeState:"clean", BuildDate:"2018-10-15T09:45:30Z", GoVersion:"go1.10.2", Compiler:"gc", Platform:"linux/amd64"}
Unable to connect to the server: Forbidden

nagonzalez commented 5 years ago

wow, that's really weird given you're running the master on port 8443

take a look at your ~/.kube/config file

What's the the cluster value for the system:admin user?

RashaHaj commented 5 years ago

apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUM2akNDQWRLZ0F3SUJBZ0lCQVRBTkJna3Foa2lHOXcwQkFRc0ZBREFtTVNRd0lnWURWUVFEREJ0dmNHVnUKYzJocFpuUXRjMmxuYm1WeVFERTFOVFF3TXpreE56a3dIaGNOTVRrd016TXhNVE16TWpVNVdoY05NalF3TXpJNQpNVE16TXpBd1dqQW1NU1F3SWdZRFZRUUREQnR2Y0dWdWMyaHBablF0YzJsbmJtVnlRREUxTlRRd016a3hOemt3CmdnRWlNQTBHQ1NxR1NJYjNEUUVCQVFVQUE0SUJEd0F3Z2dFS0FvSUJBUURPYi90WGtzaFRpQ3VWd3NNOXNnSGoKK1I0MkZHNlJ2SVEwM2RQLzExdWNoTEVVRktpeWxmUnRZb3pGU2lOREhDM2VMeGRTa2hZNVZhZ29qMW5iTDRPeAozbjNXb3kvK2IrdEtwb2JqeHFTUHJoU0RVNDI4ZWZrZG1pR0ZRdnNqWUd3c2d6WGRodGdEcTlKVk1SMVRBU1FNCkdlOFJUVC8weUY1d3pSc01ybEtKMi94VnZHWDZiTjI2cDRMRnY0Zm5TelgwUGEvVXllbjFqUUk3UlRHemZyazMKTXloTGIrM3NKbVAzTVVKRyt3TzR0a043dUhhcjhya09NWmFYVEs0MDI5MXV5aXNMYWIyamxjRHdPSFpEcXArYwpzSG1wR2dxVEtMSGMyWFc5UndBUnZrbG03Q0c1SDJBTXZTaHU4UGNENU50bi9CTDEyRmhWTHI5ZTBJcTFxQXJGCkFnTUJBQUdqSXpBaE1BNEdBMVVkRHdFQi93UUVBd0lDcERBUEJnTlZIUk1CQWY4RUJUQURBUUgvTUEwR0NTcUcKU0liM0RRRUJDd1VBQTRJQkFRQVZGT2FhenJwQTc4RjBDTm1yZHNQNUZyWmZKSEJKZ09kMFlHYTBmOW5qeDB4ZgpFb1ZSbW10ZytPOUdPUGVZTmFjdnpJY1pmakl2ZXZib3BHM0NaVGVVVTlVZFkrWXR3anpjUStSNzRKVmhkUDBECkFrUFpYUFJ6U21kcGM2T1lDWXhRY2NIRjdKL1NFdFQyRUJNdGJQK1puRmxjWmRRL3FRdVlqQ3VwSDlNL3FKTFEKQ0xtM3F0QkNxbnE4aDdVMWtpQkNWdG44Qmw4eUg3OUpJUHhwT3FBZERwd3NPVUVLVWJjejhwNXY5bEdVYU9pNQo5NXdPNytFMDd1WW5kM1pZNzFDMlVDRjBYU0sxTFk5SFA3OERHMm1YRTNmTkFnVDNqc1M5YmNPUEZpSXVmTDVlCkZUTi9lNW1nT1Rzd0NjSjJtb3BObXJ6bEZmWFNuUGVLRnpYcEZwSXcKLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQo=
    server: https://master.lab.example.com:8443
  name: master-lab-example-com:8443
contexts:
- context:
    cluster: master-lab-example-com:8443
    namespace: default
    user: system:admin/master-lab-example-com:8443
  name: default/master-lab-example-com:8443/system:admin
current-context: default/master-lab-example-com:8443/system:admin
kind: Config
preferences: {}
users:
- name: system:admin/master-lab-example-com:8443
  user:

RashaHaj commented 5 years ago

And in In the /var/log/messages I can see constantly this error .

Apr  1 18:19:24 master origin-node: W0401 18:19:24.820688    8398 cni.go:172] Unable to update cni config: No networks found in /etc/cni/net.d
`Apr  1 18:19:24 master origin-node: W0401 18:19:24.820688    8398 cni.go:172] Unable to update cni config: No networks found in /etc/cni/net.d

Indeed, /etc/cni/net.d is empty. Could this be the root cause ? If so any idea how to fix this issue ?

RashaHaj commented 5 years ago

Would it matter if my networking is dynamic ? because I'm not able to configure it statically without loosing connection. Here's the content of /etc/sysconfig/network-scripts/ifcfg-eth0

BOOTPROTO=dhcp
DEVICE=eth0
HWADDR=fa:16:3e:e8:e7:ad
ONBOOT=yes
TYPE=Ethernet
USERCTL=no
NM_CONTROLLED=yes

slaterx commented 5 years ago

Hi All,

I'm posting this solution here as I'm hoping that this will help future me (and others) while troubleshooting this. I've been struggling with this issue for almost a week on both OCP and OKD. I'm also posting this solution here as this ticket has, so far, the most complete troubleshoot steps list to help address this issue.

The issue described here sometimes manifest as TLS error (https://github.com/openshift/openshift-ansible/issues/11375#issuecomment-475866704), as Unable to connect to the server: forbidden (https://github.com/openshift/openshift-ansible/issues/11444, https://github.com/openshift/openshift-ansible/issues/10606) and as Empty /etc/cni/net.d (https://github.com/openshift/openshift-ansible/issues/7967#issue-314580503, https://bugzilla.redhat.com/show_bug.cgi?id=1592010 and https://bugzilla.redhat.com/show_bug.cgi?id=1635257).

On my investigation, I used CentOS 7.6 and RHEL 7.6, OKD 3.11 and OCP 3.11.82 under vagrant. And, to me, the fact that I was using a virtual machine with more than one working NIC has something to do with this issue. Now, I am not sure about how the whole orchestration works or or what triggers what, but following this and this, this is the procedure I followed to overcome the issue:

Ensure before running pre-requisites.yml that:
- /etc/dnsmasq.d/origin-upstream-dns.conf has a valid upstream DNS entry (very useful if you're using vagrant and has bridged/NAT'ed interfaces). Also test that the node can successfully resolve DNS.
- If you're using a fake DNS (as example.com), create an entry on dnsmasq:
```
$ cat /etc/dnsmasq.d/foo.example.com.conf
address=/foo.example.com/192.168.1.30
```
- Confirm that /etc/resolv.conf has your node's IP (in my case, since I had a bridged NIC, that IP must be the one).
- Finally If you have more than one NIC, that the right one is the default route (also useful if on vagrant). You can change the route by running:
```
$ ip route delete default
$ ip route add default via <correct-gateway-ip>
```

If, after doing all that, you see that /etc/cni/net.d/80-openshift-network.conf is not created and therefore you hit any of the three issues above, create the file with the content below while you're waiting for control plane pods to appear and restart node service:

$ cat /etc/cni/net.d/80-openshift-network.conf
{
  "cniVersion": "0.2.0",
  "name": "openshift-sdn",
  "type": "openshift-sdn"
}

$ systemctl restart origin-node.service
# If OCP, then: systemctl restart atomic-openshift-node.service

Again, I don't understand why - but creating the file before you wait for control plane pods to appear has no effect. Also, I could see that sometimes after you restart node service the file vanishes. If you recreate it and restart the service the SDN reappears.

I could reproduce this fix both on OCP and OKD. In one of my attempts, I've bumped into this issue here, but then I restarted the server and ran deploy-cluster.yml again and it succeeded (like it did here).

RashaHaj commented 5 years ago

I confirm what you mentioned above. In fact , my problem was related to the fact that my NICs were dynamically set (to dhcp). When I changed them to static configuration, the task was executed correctly , but the playbook ends again on an error with the task :

TASK [openshift_service_catalog : Verify that the Catalog API Server is running] ********************************************************
Wednesday 10 April 2019  16:40:21 +0200 (0:00:01.220)       0:10:06.812 *******
FAILED - RETRYING: Verify that the Catalog API Server is running (60 retries left).
FAILED - RETRYING: Verify that the Catalog API Server is running (59 retries left).
FAILED - RETRYING: Verify that the Catalog API Server is running (58 retries left).

*
*
*
*

TASK [openshift_service_catalog : Report errors] ****************************************************************************************
Wednesday 10 April 2019  17:19:43 +0200 (0:00:00.158)       0:49:28.590 *******
fatal: [master.lab.example.com]: FAILED! => {"changed": false, "msg": "Catalog install failed."}

PLAY RECAP ******************************************************************************************************************************
localhost                  : ok=12   changed=0    unreachable=0    failed=0
master.lab.example.com     : ok=631  changed=144  unreachable=0    failed=1
node1.lab.example.com      : ok=107  changed=18   unreachable=0    failed=0
node2.lab.example.com      : ok=107  changed=18   unreachable=0    failed=0

INSTALLER STATUS ************************************************************************************************************************
Initialization               : Complete (0:00:24)
Health Check                 : Complete (0:00:05)
Node Bootstrap Preparation   : Complete (0:01:32)
etcd Install                 : Complete (0:00:26)
NFS Install                  : Complete (0:00:05)
Master Install               : Complete (0:03:34)
Master Additional Install    : Complete (0:00:28)
Node Join                    : Complete (0:00:21)
Hosted Install               : Complete (0:00:33)
Cluster Monitoring Operator  : Complete (0:00:09)
Web Console Install          : Complete (0:00:39)
Console Install              : Complete (0:00:16)
metrics-server Install       : Complete (0:00:01)
Service Catalog Install      : In Progress (0:40:31)
        This phase can be restarted by running: playbooks/openshift-service-catalog/config.yml

And here is the ifcfg-eth0 content for the record (for others that may run into the same issue)

DEVICE="eth0"
BOOTPROTO="static"
ONBOOT="yes"
USERCTL="no"
TYPE="Ethernet"
DEFROUTE=yes
IPADDR=192.168.1.8
NETMASK=255.255.255.0
GATEWAY=192.168.1.254
DNS1=192.168.1.3
DNS2=10.171.30.4
DNS3=10.171.34.57
DOMAIN=lab.example.com
NM_CONTROLLED=yes

Going back to check /etc/cni/net.d/80-openshift-network.conf , it's present only on the master, and still absent on the two nodes, and I'm not able to create it and have it kept intact after restartin origin-node.service (knowing that I do this just after the task "waiting for control plane pods to appear" ). Once origin-node is active, it always vanishes , no matter how many times you create it !!

RashaHaj commented 5 years ago

If it helps, I'm posting result of
curl -vk https://apiserver.kube-service-catalog.svc/healthz

* About to connect() to proxy devwatt-proxy.si.fr.intraorange port 8080 (#0)
*   Trying 10.107.39.50...
* Connected to devwatt-proxy.si.fr.intraorange (10.107.39.50) port 8080 (#0)
* Establish HTTP proxy tunnel to apiserver.kube-service-catalog.svc:443
> CONNECT apiserver.kube-service-catalog.svc:443 HTTP/1.1
> Host: apiserver.kube-service-catalog.svc:443
> User-Agent: curl/7.29.0
> Proxy-Connection: Keep-Alive
>
< HTTP/1.1 503 Service Unavailable
< Server: squid
< Mime-Version: 1.0
< Date: Wed, 10 Apr 2019 16:23:57 GMT
< Content-Type: text/html;charset=utf-8
< Content-Length: 3682
< X-Squid-Error: ERR_DNS_FAIL 0

I'm running on RH7 platforms over an openstack cloud.

RashaHaj commented 5 years ago

pushing troubleshooting further, I run

[root@node1 net.d]# docker ps -a | grep sdn
76bf15fc636d        09596cdd2baf                             "/bin/bash -c '#!/..."   12 minutes ago      Exited (255) 12 minutes ago                       k8s_sdn_sdn-xttbl_openshift-sdn_7968d1db-58a3-11e9-856a-fa163e8310de_1312
5b89b8f36765        09596cdd2baf                             "/bin/bash -c '#!/..."   4 days ago          Up 4 days                                         k8s_openvswitch_ovs-vt4j5_openshift-sdn_796467ad-58a3-11e9-856a-fa163e8310de_0
2200682497b9        docker.io/openshift/origin-pod:v3.11.0   "/usr/bin/pod"           4 days ago          Up 4 days                                         k8s_POD_ovs-vt4j5_openshift-sdn_796467ad-58a3-11e9-856a-fa163e8310de_0
a7b69e8f8453        docker.io/openshift/origin-pod:v3.11.0   "/usr/bin/pod"           4 days ago          Up 4 days                                         k8s_POD_sdn-xttbl_openshift-sdn_7968d1db-58a3-11e9-856a-fa163e8310de_0

[root@node1 net.d]# docker logs --tail 10 76bf15fc636d
which: no openshift-sdn in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin)
I0411 09:19:06.636300   17541 start_network.go:200] Reading node configuration from /etc/origin/node/node-config.yaml
I0411 09:19:06.639654   17541 start_network.go:207] Starting node networking node1.lab.example.com (v3.11.0+9b1e777-164)
W0411 09:19:06.639917   17541 server.go:195] WARNING: all flags other than --config, --write-config-to, and --cleanup are deprecated. Please begin using a config file ASAP.
I0411 09:19:06.640003   17541 feature_gate.go:230] feature gates: &{map[]}
I0411 09:19:06.642242   17541 transport.go:160] Refreshing client certificate from store
I0411 09:19:06.642279   17541 certificate_store.go:131] Loading cert/key pair from "/etc/origin/node/certificates/kubelet-client-current.pem".
I0411 09:19:06.666074   17541 node.go:147] Initializing SDN node of type "redhat/openshift-ovs-subnet" with configured hostname "node1.lab.example.com" (IP ""), iptables sync period "30s"
I0411 09:19:06.668221   17541 node.go:289] Starting openshift-sdn network plugin
F0411 09:19:06.706565   17541 network.go:46] **SDN node startup failed: node SDN setup failed: net/ipv4/ip_forward=0, it must be set to 1**

it appeared that sdn hitting an error related to ip_forward value, so desparately, I did a grep under /etc/ on that pattern to see which value it's taking , and turn them all to 1.

[root@node1 net.d]# grep -r ip_forward /etc
/etc/rc.d/init.d/network:    sysctl -w net.ipv4.ip_forward=0 > /dev/null 2>&1
/etc/sysctl.conf:net.ipv4.ip_forward = 0
/etc/sysctl.d/99-openshift.conf:net.ipv4.ip_forward=1
/etc/sysctl.d/99-g03r03c00.conf:net.ipv4.ip_forward = 0

And it works! Now I've got the file on all the nodes. But still blocked on Verify that the Catalog API Server is running :'(

RashaHaj commented 5 years ago

Well Catalog API Server is still failing, but I'm finally able to see my nodes :+1:

[root@master centos]# oc get nodes
NAME                     STATUS    ROLES          AGE       VERSION
master.lab.example.com   Ready     infra,master   8d        v1.11.0+d4cacc0
node1.lab.example.com    Ready     compute        8d        v1.11.0+d4cacc0
node2.lab.example.com    Ready     compute        8d        v1.11.0+d4cacc0

[root@master centos]# oc version
oc v3.11.0+62803d0-1
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://master.lab.example.com:8443
openshift v3.11.0+9b1e777-164
kubernetes v1.11.0+d4cacc0

[root@master centos]# oc get pods
NAME                       READY     STATUS    RESTARTS   AGE
docker-registry-1-v72nt    1/1       Running   1          8d
registry-console-1-b2hc6   1/1       Running   1          8d
router-6-4cxrz             1/1       Running   0          3d

What I changed to fix that is that I removed environment variables http_proxy and https_proxy (delete from /etc/profile and unset http_proxy ).

nagonzalez commented 5 years ago

nicely done!

RashaHaj commented 5 years ago

thanks for your help @nagonzalez and @slaterx :)