Closed RashaHaj closed 5 years ago
First thing that sticks out is that you're on Ansible 2.7.x which is unsupported
I'd downgrade to 2.6.5+ and then give it another shot
I downgraded ansible to 2.6.5 but the problem doesn't seem to be resolved :(
AILED - RETRYING: Wait for control plane pods to appear (2 retries left).
FAILED - RETRYING: Wait for control plane pods to appear (1 retries left).
failed: [master.lab.example.com] (item=etcd) => {"attempts": 60, "changed": false, "item": "etcd", "msg": {"cmd": "/usr/bin/oc get pod master-etcd-master.lab.example.com -o json -n kube-system", "results": [{}], "returncode": 1, "stderr": "The connection to the server master.lab.example.com:8443 was refused - did you specify the right host or port?\n", "stdout": ""}}
FAILED - RETRYING: Wait for control plane pods to appear (60 retries left).
FAILED - RETRYING: Wait for control plane pods to appear (59 retries left).
Per your netstat
output, it doesn't appear like you're running a master container on 8443
If you run a docker ps -a
on the master, you may should see the ids of the failed containers. What's the log output of the failed containers?
[root@master centos]# docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
b697460b27a7 ff5dd2137a4f "/bin/sh -c '#!/bi..." 23 seconds ago Exited (1) 21 seconds ago k8s_etcd_master-etcd-master.lab.example.com_kube-system_e5014392f56ecd6362ebb2005a64946c_1157
6be33fb802a0 01b05abc0861 "/bin/bash -c '#!/..." About a minute ago Exited (255) 42 seconds ago k8s_api_master-api-master.lab.example.com_kube-system_b24b15710309f0062b93e07af49cb464_1056
8f2a6a8863a7 01b05abc0861 "/bin/bash -c '#!/..." 22 hours ago Up 22 hours k8s_controllers_master-controllers-master.lab.example.com_kube-system_3210f8756194a0d2e374db4a71b81896_5
Nothing to seem incorrect with the Exited containers. It's the last UP container (k8s_controllers_master-controllers-master.lab.example.com_kube-system_3210f8756194a0d2e374db4a71b81896_5 ) that always indicates the error below :
[root@master centos]# docker logs --tail -5 8f2a6a8863a7
E0322 21:05:23.098604 1 reflector.go:136] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Node: Get https://master.lab.example.com:8443/api/v1/nodes?limit=500&resourceVersion=0: dial tcp 192.168.1.5:8443: connect: connection refused
E0322 21:05:23.099550 1 reflector.go:136] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.StorageClass: Get https://master.lab.example.com:8443/apis/storage.k8s.io/v1/storageclasses?limit=500&resourceVersion=0: dial tcp 192.168.1.5:8443: connect: connection refused
E0322 21:05:23.100544 1 reflector.go:136] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Service: Get https://master.lab.example.com:8443/api/v1/services?limit=500&resourceVersion=0: dial tcp 192.168.1.5:8443: connect: connection refused
E0322 21:05:23.101604 1 reflector.go:136] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.PersistentVolume: Get https://master.lab.example.com:8443/api/v1/persistentvolumes?limit=500&resourceVersion=0: dial tcp 192.168.1.5:8443: connect: connection refused
E0322 21:05:23.102789 1 reflector.go:136] k8s.io/kubernetes/cmd/kube-scheduler/app/server.go:176: Failed to list *v1.Pod: Get https://master.lab.example.com:8443/api/v1/pods?fieldSelector=status.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded&limit=500&resourceVersion=0: dial tcp 192.168.1.5:8443: connect: connection refused
E0322 21:05:23.632365 1 leaderelection.go:234] error retrieving resource lock kube-system/kube-controller-manager: Get https://master.lab.example.com:8443/api/v1/namespaces/kube-system/configmaps/kube-controller-manager: dial tcp 192.168.1.5:8443: connect: connection refused
E0322 21:05:24.095060 1 reflector.go:136] k8s.io/client-go/informers/factory.go:130: Failed to list *v1beta1.ReplicaSet: Get https://master.lab.example.com:8443/apis/extensions/v1beta1/replicasets?limit=500&resourceVersion=0: dial tcp 192.168.1.5:8443: connect: connection refused
E0322 21:05:24.095865 1 reflector.go:136] k8s.io/client-go/informers/factory.go:130: Failed to list *v1beta1.PodDisruptionBudget: Get https://master.lab.example.com:8443/apis/policy/v1beta1/poddisruptionbudgets?limit=500&resourceVersion=0: dial tcp 192.168.1.5:8443: connect: connection refused
E0322 21:05:24.096860 1 reflector.go:136] k8s.io/client-go/informers/factory.go:130: Failed to list *v1beta1.StatefulSet: Get https://master.lab.example.com:8443/apis/apps/v1beta1/statefulsets?limit=500&resourceVersion=0: dial tcp 192.168.1.5:8443: connect: connection refused
E0322 21:05:24.097924 1 reflector.go:136] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.PersistentVolumeClaim: Get https://master.lab.example.com:8443/api/v1/persistentvolumeclaims?limit=500&resourceVersion=0: dial tcp 192.168.1.5:844
I thought to check etcd and it appears not running properly
[root@master centos]# systemctl status etcd
● etcd.service - Etcd Server
Loaded: loaded (/usr/lib/systemd/system/etcd.service; enabled; vendor preset: disabled)
Active: active (running) since Fri 2019-03-22 17:27:38 CET; 19h ago
Main PID: 34853 (etcd)
Memory: 12.7M
CGroup: /system.slice/etcd.service
└─34853 /usr/bin/etcd --name=master.lab.example.com --data-dir=/var/lib/etcd/default.etcd --listen-client-urls=https://192....
Mar 23 13:02:29 master.lab.example.com etcd[34853]: rejected connection from "192.168.1.5:64902" (error "remote error: tls: bad ....com")
Mar 23 13:02:59 master.lab.example.com etcd[34853]: rejected connection from "192.168.1.5:49348" (error "EOF", ServerName "")
Mar 23 13:08:03 master.lab.example.com etcd[34853]: rejected connection from "192.168.1.5:57716" (error "remote error: tls: bad ....com")
Mar 23 13:09:53 master.lab.example.com etcd[34853]: rejected connection from "192.168.1.5:60830" (error "remote error: tls: bad ....com")
[root@master centos]# journalctl -u etcd --since=today
Mar 23 13:10:39 master.lab.example.com etcd[34853]: rejected connection from "192.168.1.5:62190" (error "remote error: tls: bad certificate", ServerName "master.lab.example.com")
Mar 23 13:11:44 master.lab.example.com etcd[34853]: rejected connection from "192.168.1.5:64102" (error "remote error: tls: bad certificate", ServerName "master.lab.example.com")
Mar 23 13:12:58 master.lab.example.com etcd[34853]: rejected connection from "192.168.1.5:49896" (error "remote error: tls: bad certificate", ServerName "master.lab.example.com")
Mar 23 13:14:55 master.lab.example.com etcd[34853]: rejected connection from "192.168.1.5:53326" (error "remote error: tls: bad certificate", ServerName "master.lab.example.com")
Mar 23 13:18:06 master.lab.example.com etcd[34853]: rejected connection from "192.168.1.5:58666" (error "remote error: tls: bad certificate", ServerName "master.lab.example.com")
A certificate problem ? how to fix it ?
Here is the /etc/etcd/etcd.conf file content :
ETCD_NAME=master.lab.example.com
ETCD_LISTEN_PEER_URLS=https://192.168.1.5:2380
ETCD_DATA_DIR=/var/lib/etcd/default.etcd
#ETCD_WAL_DIR=
#ETCD_SNAPSHOT_COUNT=10000
ETCD_HEARTBEAT_INTERVAL=500
ETCD_ELECTION_TIMEOUT=2500
ETCD_LISTEN_CLIENT_URLS=https://192.168.1.5:2379
#ETCD_MAX_SNAPSHOTS=5
#ETCD_MAX_WALS=5
#ETCD_CORS=
#[cluster]
ETCD_INITIAL_ADVERTISE_PEER_URLS=https://192.168.1.5:2380
ETCD_INITIAL_CLUSTER=master.lab.example.com=https://192.168.1.5:2380
ETCD_INITIAL_CLUSTER_STATE=new
ETCD_INITIAL_CLUSTER_TOKEN=etcd-cluster-1
#ETCD_DISCOVERY=
#ETCD_DISCOVERY_SRV=
#ETCD_DISCOVERY_FALLBACK=proxy
#ETCD_DISCOVERY_PROXY=
ETCD_ADVERTISE_CLIENT_URLS=https://192.168.1.5:2379
#ETCD_STRICT_RECONFIG_CHECK=false
#ETCD_AUTO_COMPACTION_RETENTION=0
#ETCD_ENABLE_V2=true
ETCD_QUOTA_BACKEND_BYTES=4294967296
#[proxy]
#ETCD_PROXY=off
#ETCD_PROXY_FAILURE_WAIT=5000
#ETCD_PROXY_REFRESH_INTERVAL=30000
#ETCD_PROXY_DIAL_TIMEOUT=1000
#ETCD_PROXY_WRITE_TIMEOUT=5000
#ETCD_PROXY_READ_TIMEOUT=0
#[security]
ETCD_TRUSTED_CA_FILE=/etc/etcd/ca.crt
ETCD_CLIENT_CERT_AUTH=true
ETCD_CERT_FILE=/etc/etcd/server.crt
ETCD_KEY_FILE=/etc/etcd/server.key
#ETCD_AUTO_TLS=false
ETCD_PEER_TRUSTED_CA_FILE=/etc/etcd/ca.crt
ETCD_PEER_CLIENT_CERT_AUTH=true
ETCD_PEER_CERT_FILE=/etc/etcd/peer.crt
ETCD_PEER_KEY_FILE=/etc/etcd/peer.key
#ETCD_PEER_AUTO_TLS=false
#[logging]
ETCD_DEBUG=False
#[profiling]
#ETCD_ENABLE_PPROF=false
#ETCD_METRICS=basic
#
#[auth]
#ETCD_AUTH_TOKEN=simple
Any one knows what's the problem with the TLS certificate please ? or give me a line of research ? :(
You etcd config file looks exactly like my working one except for this line:
ETCD_DATA_DIR=/var/lib/etcd/
but that probably doesn't have anything to do with it.
Last thing I'd suggest it to verify the Subject Name of your etcd certs:
openssl x509 -in /etc/etcd/peer.crt -text -noout
openssl x509 -in /etc/etcd/server.crt -text -noout
They should match your host/s FQDN
When running openssl x509 -in /etc/etcd/peer.crt -text -noout and openssl x509 -in /etc/etcd/server.crt -text -noout the certifications appear to be not correct
Subject: CN=192.168.1.5
IP Address:192.168.1.5, DNS:192.168.1.5
I replaced with this
Subject: CN=master.lab.example.com
IP Address:192.168.1.5, DNS:master.lab.example.com
Everywhere the incoherence is ( server.crt , peer.crt , under generated_certs... in origin/master/master.etcd-client.crt ) and restarted etcd, but I still get the same message
I even relaunched the playbook prerequisites.yml then deploy_cluster.yml , however the output of
openssl x509 -in /etc/etcd/peer.crt -text -noout and openssl x509 -in /etc/etcd/server.crt -text -noout
is always the same.
How to update the certificates , knowing that I already tried update-ca-trust enable update-ca-trust extract
That would explain the error.
What's your inventory file look like?
cat /etc/ansible/hosts
[OSEv3:children] masters nodes etcd nfs
[OSEv3:vars] ansible_ssh_user=root openshift_deployment_type=origin
openshift_clock_enabled=true
openshift_master_identity_providers=[{'name': 'htpasswd_auth', 'login': 'true','challenge':'true', 'kind': 'HTPasswdPasswordIdentityProvider'}] openshift_master_htpasswd_users={'admin': '$apr1$RbOvaj8r$LEqJqG6V/O/i7Pfyyyyyy.', 'user': '$apr1$MfsFK97I$enQjqHCh2LL8w4EBwNrrrr'} openshift_public_hostname=master.lab.example.com openshift_master_default_subdomain=cloudapps.lab.example.com openshift_disable_check=memory_availability,disk_availability,docker_storage
[masters] master.lab.example.com containerized=false
[nodes] master.lab.example.com openshift_schedulable=true openshift_node_group_name='node-config-master-infra' node1.lab.example.com openshift_node_group_name='node-config-compute' node2.lab.example.com openshift_node_group_name='node-config-compute'
[etcd] master.lab.example.com
[nfs] master.lab.example.com
I started installation from the scratch and this time, CN and DNS are well set. Openshift is now listening on port 8443. But the playbook fails at the task
TASK [Approve node certificates when bootstrapping] *************************************************************************************
FAILED - RETRYING: Approve node certificates when bootstrapping (30 retries left).
FAILED - RETRYING: Approve node certificates when bootstrapping (29 retries left).
FAILED - RETRYING: Approve node certificates when bootstrapping (28 retries left).
Failure summary:
1. Hosts: master.lab.example.com
Play: Approve any pending CSR requests from inventory nodes
Task: Approve node certificates when bootstrapping
Message: Could not find csr for nodes: node2.lab.example.com, node1.lab.example.com
hostname is correct on master and the two nodes. SSH connection is working just fine between master and nodes
awesome. we're getting closer :)
log into your master and run a oc get nodes
. What do you see?
this is what I get on mine:
oc get nodes
NAME STATUS ROLES AGE VERSION
ocmaster1 Ready master 5d v1.11.0+d4cacc0
ocmaster2 Ready master 5d v1.11.0+d4cacc0
ocmaster3 Ready master 5d v1.11.0+d4cacc0
ocnode1 Ready compute,infra 5d v1.11.0+d4cacc0
ocnode2 Ready compute,infra 5d v1.11.0+d4cacc0
ocnode3 Ready compute,infra 5d v1.11.0+d4cacc0
[root@master centos]# oc get nodes
Unable to connect to the server: Forbidden
pushing further investigations, I notice that something strange is happening with etcd
[root@master system]# systemctl status etcd
Unit etcd.service could not be found.
I don't find config file under /usr/lib/systemd/system neither /etc/systemd/system/multi-user.target.wants/ !!! The package disappears. With the first installation, I had
[root@master ~]# rpm -qa | grep etcd
etcd-3.2.22-1.el7.x86_64
Although, when executing netstat , I see etcd process listening on 2379 and 2380 , I don't know how this could happen :(
[root@master centos]# netstat -ntlp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:10250 0.0.0.0:* LISTEN 27542/hyperkube
tcp 0 0 192.168.1.7:2379 0.0.0.0:* LISTEN 102269/etcd
tcp 0 0 192.168.1.7:2380 0.0.0.0:* LISTEN 102269/etcd
tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN 1/systemd
tcp 0 0 0.0.0.0:20048 0.0.0.0:* LISTEN 1598/rpc.mountd
tcp 0 0 0.0.0.0:8053 0.0.0.0:* LISTEN 11048/openshift
tcp 0 0 192.168.1.7:53 0.0.0.0:* LISTEN 19624/dnsmasq
tcp 0 0 172.17.0.1:53 0.0.0.0:* LISTEN 19624/dnsmasq
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 3646/sshd
tcp 0 0 0.0.0.0:8443 0.0.0.0:* LISTEN 11048/openshift
tcp 0 0 0.0.0.0:8444 0.0.0.0:* LISTEN 11071/openshift
tcp 0 0 0.0.0.0:63422 0.0.0.0:* LISTEN 1568/rpc.statd
tcp 0 0 0.0.0.0:2049 0.0.0.0:* LISTEN -
tcp 0 0 127.0.0.1:57029 0.0.0.0:* LISTEN 27542/hyperkube
After several attempts , now the playbook can be achieved. But I'm still not able to see nodes
TASK [Set Master install 'Complete'] ****************************************************************************************************
ok: [master.lab.example.com]
PLAY RECAP ******************************************************************************************************************************
localhost : ok=12 changed=0 unreachable=0 failed=0
master.lab.example.com : ok=300 changed=63 unreachable=0 failed=0
node1.lab.example.com : ok=16 changed=0 unreachable=0 failed=0
node2.lab.example.com : ok=16 changed=0 unreachable=0 failed=0
INSTALLER STATUS ************************************************************************************************************************
Initialization : Complete (0:00:34)
Master Install : Complete (0:03:35)
[root@master playbooks]# oc get nodes
Unable to connect to the server: Forbidden
Still the oc appears to be installed !
[root@master openshift-master]# oc version
oc v3.11.0+62803d0-1
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO
Unable to connect to the server: Forbidden
[root@master openshift-master]# openshift version
openshift v3.11.0+62803d0-1
[root@master openshift-master]# kubectl version
Client Version: version.Info{Major:"1", Minor:"11+", GitVersion:"v1.11.0+d4cacc0", GitCommit:"d4cacc0", GitTreeState:"clean", BuildDate:"2018-10-15T09:45:30Z", GoVersion:"go1.10.2", Compiler:"gc", Platform:"linux/amd64"}
Unable to connect to the server: Forbidden
wow, that's really weird given you're running the master on port 8443
take a look at your ~/.kube/config
file
What's the the cluster
value for the system:admin
user?
apiVersion: v1
clusters:
- cluster:
certificate-authority-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUM2akNDQWRLZ0F3SUJBZ0lCQVRBTkJna3Foa2lHOXcwQkFRc0ZBREFtTVNRd0lnWURWUVFEREJ0dmNHVnUKYzJocFpuUXRjMmxuYm1WeVFERTFOVFF3TXpreE56a3dIaGNOTVRrd016TXhNVE16TWpVNVdoY05NalF3TXpJNQpNVE16TXpBd1dqQW1NU1F3SWdZRFZRUUREQnR2Y0dWdWMyaHBablF0YzJsbmJtVnlRREUxTlRRd016a3hOemt3CmdnRWlNQTBHQ1NxR1NJYjNEUUVCQVFVQUE0SUJEd0F3Z2dFS0FvSUJBUURPYi90WGtzaFRpQ3VWd3NNOXNnSGoKK1I0MkZHNlJ2SVEwM2RQLzExdWNoTEVVRktpeWxmUnRZb3pGU2lOREhDM2VMeGRTa2hZNVZhZ29qMW5iTDRPeAozbjNXb3kvK2IrdEtwb2JqeHFTUHJoU0RVNDI4ZWZrZG1pR0ZRdnNqWUd3c2d6WGRodGdEcTlKVk1SMVRBU1FNCkdlOFJUVC8weUY1d3pSc01ybEtKMi94VnZHWDZiTjI2cDRMRnY0Zm5TelgwUGEvVXllbjFqUUk3UlRHemZyazMKTXloTGIrM3NKbVAzTVVKRyt3TzR0a043dUhhcjhya09NWmFYVEs0MDI5MXV5aXNMYWIyamxjRHdPSFpEcXArYwpzSG1wR2dxVEtMSGMyWFc5UndBUnZrbG03Q0c1SDJBTXZTaHU4UGNENU50bi9CTDEyRmhWTHI5ZTBJcTFxQXJGCkFnTUJBQUdqSXpBaE1BNEdBMVVkRHdFQi93UUVBd0lDcERBUEJnTlZIUk1CQWY4RUJUQURBUUgvTUEwR0NTcUcKU0liM0RRRUJDd1VBQTRJQkFRQVZGT2FhenJwQTc4RjBDTm1yZHNQNUZyWmZKSEJKZ09kMFlHYTBmOW5qeDB4ZgpFb1ZSbW10ZytPOUdPUGVZTmFjdnpJY1pmakl2ZXZib3BHM0NaVGVVVTlVZFkrWXR3anpjUStSNzRKVmhkUDBECkFrUFpYUFJ6U21kcGM2T1lDWXhRY2NIRjdKL1NFdFQyRUJNdGJQK1puRmxjWmRRL3FRdVlqQ3VwSDlNL3FKTFEKQ0xtM3F0QkNxbnE4aDdVMWtpQkNWdG44Qmw4eUg3OUpJUHhwT3FBZERwd3NPVUVLVWJjejhwNXY5bEdVYU9pNQo5NXdPNytFMDd1WW5kM1pZNzFDMlVDRjBYU0sxTFk5SFA3OERHMm1YRTNmTkFnVDNqc1M5YmNPUEZpSXVmTDVlCkZUTi9lNW1nT1Rzd0NjSjJtb3BObXJ6bEZmWFNuUGVLRnpYcEZwSXcKLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQo=
server: https://master.lab.example.com:8443
name: master-lab-example-com:8443
contexts:
- context:
cluster: master-lab-example-com:8443
namespace: default
user: system:admin/master-lab-example-com:8443
name: default/master-lab-example-com:8443/system:admin
current-context: default/master-lab-example-com:8443/system:admin
kind: Config
preferences: {}
users:
- name: system:admin/master-lab-example-com:8443
user:
And in In the /var/log/messages I can see constantly this error .
Apr 1 18:19:24 master origin-node: W0401 18:19:24.820688 8398 cni.go:172] Unable to update cni config: No networks found in /etc/cni/net.d
`Apr 1 18:19:24 master origin-node: W0401 18:19:24.820688 8398 cni.go:172] Unable to update cni config: No networks found in /etc/cni/net.d
Indeed, /etc/cni/net.d is empty. Could this be the root cause ? If so any idea how to fix this issue ?
Would it matter if my networking is dynamic ? because I'm not able to configure it statically without loosing connection. Here's the content of /etc/sysconfig/network-scripts/ifcfg-eth0
BOOTPROTO=dhcp
DEVICE=eth0
HWADDR=fa:16:3e:e8:e7:ad
ONBOOT=yes
TYPE=Ethernet
USERCTL=no
NM_CONTROLLED=yes
Hi All,
I'm posting this solution here as I'm hoping that this will help future me (and others) while troubleshooting this. I've been struggling with this issue for almost a week on both OCP and OKD. I'm also posting this solution here as this ticket has, so far, the most complete troubleshoot steps list to help address this issue.
The issue described here sometimes manifest as TLS error
(https://github.com/openshift/openshift-ansible/issues/11375#issuecomment-475866704), as Unable to connect to the server: forbidden
(https://github.com/openshift/openshift-ansible/issues/11444, https://github.com/openshift/openshift-ansible/issues/10606) and as Empty /etc/cni/net.d
(https://github.com/openshift/openshift-ansible/issues/7967#issue-314580503, https://bugzilla.redhat.com/show_bug.cgi?id=1592010 and https://bugzilla.redhat.com/show_bug.cgi?id=1635257).
On my investigation, I used CentOS 7.6 and RHEL 7.6, OKD 3.11 and OCP 3.11.82 under vagrant. And, to me, the fact that I was using a virtual machine with more than one working NIC has something to do with this issue. Now, I am not sure about how the whole orchestration works or or what triggers what, but following this and this, this is the procedure I followed to overcome the issue:
$ cat /etc/dnsmasq.d/foo.example.com.conf
address=/foo.example.com/192.168.1.30
$ ip route delete default
$ ip route add default via <correct-gateway-ip>
If, after doing all that, you see that /etc/cni/net.d/80-openshift-network.conf
is not created and therefore you hit any of the three issues above, create the file with the content below while you're waiting for control plane pods to appear and restart node service:
$ cat /etc/cni/net.d/80-openshift-network.conf
{
"cniVersion": "0.2.0",
"name": "openshift-sdn",
"type": "openshift-sdn"
}
$ systemctl restart origin-node.service
# If OCP, then: systemctl restart atomic-openshift-node.service
Again, I don't understand why - but creating the file before you wait for control plane pods to appear has no effect. Also, I could see that sometimes after you restart node service the file vanishes. If you recreate it and restart the service the SDN reappears.
I could reproduce this fix both on OCP and OKD. In one of my attempts, I've bumped into this issue here, but then I restarted the server and ran deploy-cluster.yml again and it succeeded (like it did here).
I confirm what you mentioned above. In fact , my problem was related to the fact that my NICs were dynamically set (to dhcp). When I changed them to static configuration, the task was executed correctly , but the playbook ends again on an error with the task :
TASK [openshift_service_catalog : Verify that the Catalog API Server is running] ********************************************************
Wednesday 10 April 2019 16:40:21 +0200 (0:00:01.220) 0:10:06.812 *******
FAILED - RETRYING: Verify that the Catalog API Server is running (60 retries left).
FAILED - RETRYING: Verify that the Catalog API Server is running (59 retries left).
FAILED - RETRYING: Verify that the Catalog API Server is running (58 retries left).
*
*
*
*
TASK [openshift_service_catalog : Report errors] ****************************************************************************************
Wednesday 10 April 2019 17:19:43 +0200 (0:00:00.158) 0:49:28.590 *******
fatal: [master.lab.example.com]: FAILED! => {"changed": false, "msg": "Catalog install failed."}
PLAY RECAP ******************************************************************************************************************************
localhost : ok=12 changed=0 unreachable=0 failed=0
master.lab.example.com : ok=631 changed=144 unreachable=0 failed=1
node1.lab.example.com : ok=107 changed=18 unreachable=0 failed=0
node2.lab.example.com : ok=107 changed=18 unreachable=0 failed=0
INSTALLER STATUS ************************************************************************************************************************
Initialization : Complete (0:00:24)
Health Check : Complete (0:00:05)
Node Bootstrap Preparation : Complete (0:01:32)
etcd Install : Complete (0:00:26)
NFS Install : Complete (0:00:05)
Master Install : Complete (0:03:34)
Master Additional Install : Complete (0:00:28)
Node Join : Complete (0:00:21)
Hosted Install : Complete (0:00:33)
Cluster Monitoring Operator : Complete (0:00:09)
Web Console Install : Complete (0:00:39)
Console Install : Complete (0:00:16)
metrics-server Install : Complete (0:00:01)
Service Catalog Install : In Progress (0:40:31)
This phase can be restarted by running: playbooks/openshift-service-catalog/config.yml
And here is the ifcfg-eth0 content for the record (for others that may run into the same issue)
DEVICE="eth0"
BOOTPROTO="static"
ONBOOT="yes"
USERCTL="no"
TYPE="Ethernet"
DEFROUTE=yes
IPADDR=192.168.1.8
NETMASK=255.255.255.0
GATEWAY=192.168.1.254
DNS1=192.168.1.3
DNS2=10.171.30.4
DNS3=10.171.34.57
DOMAIN=lab.example.com
NM_CONTROLLED=yes
Going back to check /etc/cni/net.d/80-openshift-network.conf , it's present only on the master, and still absent on the two nodes, and I'm not able to create it and have it kept intact after restartin origin-node.service (knowing that I do this just after the task "waiting for control plane pods to appear" ). Once origin-node is active, it always vanishes , no matter how many times you create it !!
If it helps, I'm posting result of
curl -vk https://apiserver.kube-service-catalog.svc/healthz
* About to connect() to proxy devwatt-proxy.si.fr.intraorange port 8080 (#0)
* Trying 10.107.39.50...
* Connected to devwatt-proxy.si.fr.intraorange (10.107.39.50) port 8080 (#0)
* Establish HTTP proxy tunnel to apiserver.kube-service-catalog.svc:443
> CONNECT apiserver.kube-service-catalog.svc:443 HTTP/1.1
> Host: apiserver.kube-service-catalog.svc:443
> User-Agent: curl/7.29.0
> Proxy-Connection: Keep-Alive
>
< HTTP/1.1 503 Service Unavailable
< Server: squid
< Mime-Version: 1.0
< Date: Wed, 10 Apr 2019 16:23:57 GMT
< Content-Type: text/html;charset=utf-8
< Content-Length: 3682
< X-Squid-Error: ERR_DNS_FAIL 0
I'm running on RH7 platforms over an openstack cloud.
pushing troubleshooting further, I run
[root@node1 net.d]# docker ps -a | grep sdn
76bf15fc636d 09596cdd2baf "/bin/bash -c '#!/..." 12 minutes ago Exited (255) 12 minutes ago k8s_sdn_sdn-xttbl_openshift-sdn_7968d1db-58a3-11e9-856a-fa163e8310de_1312
5b89b8f36765 09596cdd2baf "/bin/bash -c '#!/..." 4 days ago Up 4 days k8s_openvswitch_ovs-vt4j5_openshift-sdn_796467ad-58a3-11e9-856a-fa163e8310de_0
2200682497b9 docker.io/openshift/origin-pod:v3.11.0 "/usr/bin/pod" 4 days ago Up 4 days k8s_POD_ovs-vt4j5_openshift-sdn_796467ad-58a3-11e9-856a-fa163e8310de_0
a7b69e8f8453 docker.io/openshift/origin-pod:v3.11.0 "/usr/bin/pod" 4 days ago Up 4 days k8s_POD_sdn-xttbl_openshift-sdn_7968d1db-58a3-11e9-856a-fa163e8310de_0
[root@node1 net.d]# docker logs --tail 10 76bf15fc636d
which: no openshift-sdn in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin)
I0411 09:19:06.636300 17541 start_network.go:200] Reading node configuration from /etc/origin/node/node-config.yaml
I0411 09:19:06.639654 17541 start_network.go:207] Starting node networking node1.lab.example.com (v3.11.0+9b1e777-164)
W0411 09:19:06.639917 17541 server.go:195] WARNING: all flags other than --config, --write-config-to, and --cleanup are deprecated. Please begin using a config file ASAP.
I0411 09:19:06.640003 17541 feature_gate.go:230] feature gates: &{map[]}
I0411 09:19:06.642242 17541 transport.go:160] Refreshing client certificate from store
I0411 09:19:06.642279 17541 certificate_store.go:131] Loading cert/key pair from "/etc/origin/node/certificates/kubelet-client-current.pem".
I0411 09:19:06.666074 17541 node.go:147] Initializing SDN node of type "redhat/openshift-ovs-subnet" with configured hostname "node1.lab.example.com" (IP ""), iptables sync period "30s"
I0411 09:19:06.668221 17541 node.go:289] Starting openshift-sdn network plugin
F0411 09:19:06.706565 17541 network.go:46] **SDN node startup failed: node SDN setup failed: net/ipv4/ip_forward=0, it must be set to 1**
it appeared that sdn hitting an error related to ip_forward value, so desparately, I did a grep under /etc/ on that pattern to see which value it's taking , and turn them all to 1.
[root@node1 net.d]# grep -r ip_forward /etc
/etc/rc.d/init.d/network: sysctl -w net.ipv4.ip_forward=0 > /dev/null 2>&1
/etc/sysctl.conf:net.ipv4.ip_forward = 0
/etc/sysctl.d/99-openshift.conf:net.ipv4.ip_forward=1
/etc/sysctl.d/99-g03r03c00.conf:net.ipv4.ip_forward = 0
And it works! Now I've got the file on all the nodes. But still blocked on Verify that the Catalog API Server is running :'(
Well Catalog API Server is still failing, but I'm finally able to see my nodes :+1:
[root@master centos]# oc get nodes
NAME STATUS ROLES AGE VERSION
master.lab.example.com Ready infra,master 8d v1.11.0+d4cacc0
node1.lab.example.com Ready compute 8d v1.11.0+d4cacc0
node2.lab.example.com Ready compute 8d v1.11.0+d4cacc0
[root@master centos]# oc version
oc v3.11.0+62803d0-1
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO
Server https://master.lab.example.com:8443
openshift v3.11.0+9b1e777-164
kubernetes v1.11.0+d4cacc0
[root@master centos]# oc get pods
NAME READY STATUS RESTARTS AGE
docker-registry-1-v72nt 1/1 Running 1 8d
registry-console-1-b2hc6 1/1 Running 1 8d
router-6-4cxrz 1/1 Running 0 3d
What I changed to fix that is that I removed environment variables http_proxy and https_proxy (delete from /etc/profile and unset http_proxy ).
nicely done!
thanks for your help @nagonzalez and @slaterx :)
Description
Hi, I'm trying to install openshift v3.11 cluster on openstack using openshift-ansible. However , the playbook "deploy_cluster.yml" is encountering the error below:
Version
docker version Version: 1.13.1 API version: 1.26 Package version: docker-1.13.1-91.git07f3374.el7.centos.x86_64
ansible --version ansible 2.7.8
rpm -qa | grep openshift openshift-ansible-roles-3.11.37-1.git.0.3b8b341.el7.noarch openshift-ansible-3.11.37-1.git.0.3b8b341.el7.noarch centos-release-openshift-origin311-1-2.el7.centos.noarch openshift-ansible-playbooks-3.11.37-1.git.0.3b8b341.el7.noarch openshift-ansible-docs-3.11.37-1.git.0.3b8b341.el7.noarch
git describe openshift-ansible-3.11.90-1-12-g1ea6332
Steps To Reproduce
Expected Results
The cluster to be deployed
Example command and output or error messages
Additional Information
Any idea how to fix this please ? Thanks!