openshift / installer

Install an OpenShift 4.x cluster
https://try.openshift.com
Apache License 2.0
1.44k stars 1.38k forks source link

OKD 4.5 bootstrap install - Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods: dial tcp [::1]:6443: connect: connection refused #3994

Closed RuiGarciaSantos closed 3 years ago

RuiGarciaSantos commented 4 years ago

Hi, i am trying to follow the OKD 4.5 home lab setup shared by "Building an OKD 4 Home Lab with special guest Craig Robinson" with some changes and i am hitting a moment where the install stucks.

1) I am using VMWARE 15 with also 2 network adapters: 1 bridged and another on the same created "OKD" LAN segment 2) my WAN is 192.168.1.x/24 and my LAN 10.10.10.x/24 - i have made all changes accordingly

apart from these 2 changes, all the script has been followed.

Version

[admin@OKD4-SERVICES ~]$ openshift-install version openshift-install 4.5.0-0.okd-2020-07-29-070316 built from commit 699277bb61706731d687b9e40700ebf4630b0851 release image quay.io/openshift/okd@sha256:6565b6eb19a82f4c9230641286c27f003625b79984ed8e733b011c72790a5eb3

Platform:

Vmware workstation 15

What happened?

[admin@OKD4-SERVICES ~]$ openshift-install --dir=install_dir/ wait-for bootstrap-complete --log-level=info INFO Waiting up to 20m0s for the Kubernetes API at https://api.lab.okd.local:6443... INFO API v1.18.3 up INFO Waiting up to 40m0s for bootstrapping to complete... INFO Use the following commands to gather logs from the cluster INFO openshift-install gather bootstrap --help FATAL failed to wait for bootstrapping to complete: timed out waiting for the condition [admin@OKD4-SERVICES ~]$ openshift-install --dir=install_dir/ wait-for bootstrap-complete --log-level=info INFO Waiting up to 20m0s for the Kubernetes API at https://api.lab.okd.local:6443... INFO API v1.18.3 up INFO Waiting up to 40m0s for bootstrapping to complete... W0731 09:41:39.136645 7656 reflector.go:326] k8s.io/client-go/tools/watch/informerwatcher.go:146: watch of v1.ConfigMap ended with: very short watch: k8s.io/client-go/tools/watch/informerwatcher.go:146: Unexpected watch close - watch lasted less than a second and no items received E0731 09:41:42.143697 7656 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list v1.ConfigMap: Get https://api.lab.okd.local:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0: EOF E0731 09:41:43.146363 7656 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list v1.ConfigMap: Get https://api.lab.okd.local:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0: EOF E0731 09:41:44.149718 7656 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list v1.ConfigMap: Get https://api.lab.okd.local:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0: EOF E0731 09:41:45.151322 7656 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list v1.ConfigMap: Get https://api.lab.okd.local:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0: EOF E0731 09:41:46.153482 7656 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list v1.ConfigMap: Get https://api.lab.okd.local:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0: EOF E0731 09:41:47.155610 7656 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list v1.ConfigMap: Get https://api.lab.okd.local:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0: EOF E0731 09:41:48.157890 7656 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list v1.ConfigMap: Get https://api.lab.okd.local:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0: EOF E0731 09:41:49.159739 7656 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list v1.ConfigMap: Get https://api.lab.okd.local:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0: EOF E0731 09:41:50.162047 7656 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list v1.ConfigMap: Get https://api.lab.okd.local:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0: EOF E0731 09:41:51.164177 7656 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list v1.ConfigMap: Get https://api.lab.okd.local:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0: EOF E0731 09:41:52.167620 7656 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list v1.ConfigMap: Get https://api.lab.okd.local:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0: EOF E0731 09:41:53.171616 7656 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list v1.ConfigMap: Get https://api.lab.okd.local:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0: EOF E0731 09:41:54.176786 7656 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list v1.ConfigMap: Get https://api.lab.okd.local:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0: EOF E0731 09:41:55.178870 7656 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list v1.ConfigMap: Get https://api.lab.okd.local:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0: EOF E0731 09:41:56.180965 7656 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list v1.ConfigMap: Get https://api.lab.okd.local:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0: EOF E0731 09:41:57.183155 7656 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list v1.ConfigMap: Get https://api.lab.okd.local:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0: EOF E0731 09:41:58.185372 7656 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list v1.ConfigMap: Get https://api.lab.okd.local:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0: EOF

What you expected to happen?

Master start successfully

References

https://github.com/openshift/installer/issues/2988 https://github.com/openshift/installer/issues/2687

RuiGarciaSantos commented 4 years ago

log.txt

fenggolang commented 4 years ago

Hello, I also encountered a similar problem, have you solved it?

RuiGarciaSantos commented 4 years ago

checking dns, dhcp, haproxy ... something "basic" is not ok

oguzhalit commented 4 years ago

I also encountered a similar problem, how to solve

RuiGarciaSantos commented 4 years ago

It is working now - DNS need to be well set. If "dig" command is not showing the output correctly, do not move fwd. I will try to have a live video but for now, i go into vacations.

RuiGarciaSantos commented 4 years ago

I was able to setup OKD 4.5 with:

################################################################################ #

https://linuxconfig.org/what-is-dhcp-and-how-to-configure-dhcp-server-in-linux

#

/usr/share/doc/dhcp-server/dhcpd.conf.example

# ################################################################################ #

DHCP Server Configuration file.

see /usr/share/doc/dhcp-server/dhcpd.conf.example

see dhcpd.conf(5) man page

# ################################################################################

option domain-name "ibmpt.org"; option domain-name-servers 192.168.1.210; default-lease-time 86400; # 24h max-lease-time 172800; # 48h authoritative;

log-facility local0; # sudo grep dhcpd /var/log/messages

subnet 192.168.1.0 netmask 255.255.255.0 { range 192.168.1.200 192.168.1.250; option broadcast-address 192.168.1.255; option routers 192.168.1.1; }

host rhos-bootstrap.ibmpt.org {hardware ethernet 00:50:56:2C:DD:7F; fixed-address 192.168.1.200;} host rhos-master-01.ibmpt.org {hardware ethernet 00:50:56:3B:7D:7C; fixed-address 192.168.1.201;} host rhos-worker-01.ibmpt.org {hardware ethernet 00:50:56:3D:A2:03; fixed-address 192.168.1.204;} host rhos-services.ibmpt.org {hardware ethernet 00:0C:29:AC:5C:C2; fixed-address 192.168.1.210;}

// // named.conf // // Provided by Red Hat bind package to configure the ISC BIND named(8) DNS // server as a caching only nameserver (as a localhost DNS resolver only). // // See /usr/share/doc/bind*/sample/ for example named configuration files. //

options { listen-on port 53 { 127.0.0.1; 192.168.1.210; };

listen-on-v6 port 53 { ::1; };

directory   "/var/named";
dump-file   "/var/named/data/cache_dump.db";
statistics-file "/var/named/data/named_stats.txt";
memstatistics-file "/var/named/data/named_mem_stats.txt";
secroots-file   "/var/named/data/named.secroots";
recursing-file  "/var/named/data/named.recursing";
allow-query     { localhost; 192.168.1.0/24; };

/* 
 - If you are building an AUTHORITATIVE DNS server, do NOT enable recursion.
 - If you are building a RECURSIVE (caching) DNS server, you need to enable 
   recursion. 
 - If your recursive DNS server has a public IP address, you MUST enable access 
   control to limit queries to your legitimate users. Failing to do so will
   cause your server to become part of large scale DNS amplification 
   attacks. Implementing BCP38 within your network would greatly
   reduce such attack surface 
*/
recursion yes;

forwarders {
            8.8.8.8;
            8.8.4.4;
    };

dnssec-enable yes;
dnssec-validation yes;

managed-keys-directory "/var/named/dynamic";

pid-file "/run/named/named.pid";
session-keyfile "/run/named/session.key";

/* https://fedoraproject.org/wiki/Changes/CryptoPolicy */
include "/etc/crypto-policies/back-ends/bind.config";

};

logging { channel default_debug { file "data/named.run"; severity dynamic; }; };

zone "." IN { type hint; file "named.ca"; };

include "/etc/named.rfc1912.zones"; include "/etc/named.root.key"; include "/etc/named/named.conf.local";

// DNS Forwarding zone "ibmpt.org" { type master; file "/etc/named/zones/db.FORWARD"; };

// DNS Reverse Name Resolution zone "1.168.192.in-addr.arpa" { type master; file "/etc/named/zones/db.REVERSE"; };

; ------------------------------------------------------------------------------ ; Forward DNS Records ; ------------------------------------------------------------------------------ $TTL 1W @ IN SOA rhos-services.ibmpt.org. admin.ibmpt.org. ( 1 ; Serial 1W ; Refresh 1D ; Retry 365D ; Expire 1W ; Negative Cache TTL ) ; ------------------------------------------------------------------------------ ; Name Servers - NS records ; ------------------------------------------------------------------------------ IN NS rhos-services.ibmpt.org. ; ------------------------------------------------------------------------------ ; Name Servers - A records ; ------------------------------------------------------------------------------ rhos-services.ibmpt.org. IN A 192.168.1.210 ; ------------------------------------------------------------------------------ ; OpenShift Container Platform DNS - Cluster ; ------------------------------------------------------------------------------ rhos-bootstrap.lab.ibmpt.org. IN A 192.168.1.200 rhos-master-01.lab.ibmpt.org. IN A 192.168.1.201 rhos-master-02.lab.ibmpt.org. IN A 192.168.1.202 rhos-master-03.lab.ibmpt.org. IN A 192.168.1.203 rhos-worker-01.lab.ibmpt.org. IN A 192.168.1.204 rhos-worker-02.lab.ibmpt.org. IN A 192.168.1.205 ; ------------------------------------------------------------------------------ ; OpenShift Container Platform DNS - Kubernetes API ; ------------------------------------------------------------------------------ api.lab.ibmpt.org. IN A 192.168.1.210 api-int.lab.ibmpt.org. IN A 192.168.1.210 ; ------------------------------------------------------------------------------ ; OpenShift Container Platform DNS - Routes ; ------------------------------------------------------------------------------ *.apps.lab.ibmpt.org. IN A 192.168.1.210 ; ------------------------------------------------------------------------------ ; OpenShift Container Platform DNS - etcd ; ------------------------------------------------------------------------------ etcd-0.lab.ibmpt.org. IN A 192.168.1.201 etcd-1.lab.ibmpt.org. IN A 192.168.1.202 etcd-2.lab.ibmpt.org. IN A 192.168.1.203 ; ------------------------------------------------------------------------------ ; ; ------------------------------------------------------------------------------ console-openshift-console.apps.lab.ibmpt.org. IN A 192.168.1.210 oauth-openshift.apps.lab.ibmpt.org. IN A 192.168.1.210 ; ------------------------------------------------------------------------------ ; OpenShift Container Platform SRV DNS records ; ------------------------------------------------------------------------------ _etcd-server-ssl._tcp.lab.ibmpt.org. 86400 IN SRV 0 10 2380 etcd-0.lab ;_etcd-server-ssl._tcp.lab.ibmpt.org. 86400 IN SRV 0 10 2380 etcd-1.lab ;_etcd-server-ssl._tcp.lab.ibmpt.org. 86400 IN SRV 0 10 2380 etcd-2.lab ; ------------------------------------------------------------------------------

; ------------------------------------------------------------------------------ ; Reverse DNS Records ; ------------------------------------------------------------------------------ $TTL 1W @ IN SOA rhos-services.ibmpt.org. admin.ibmpt.org. ( 6 ; Serial 1W ; Refresh 1D ; Retry 365D ; Expire 1W ; Negative Cache TTL ) ; ------------------------------------------------------------------------------ ; Name Servers - NS records ; ------------------------------------------------------------------------------ IN NS rhos-services.ibmpt.org. ; ------------------------------------------------------------------------------ ; Name Servers - PTR records ; ------------------------------------------------------------------------------ 210 IN PTR rhos-services.ibmpt.org. ; ------------------------------------------------------------------------------ ; OpenShift Container Platform Cluster - PTR records ; ------------------------------------------------------------------------------ 200 IN PTR rhos-bootstrap.lab.ibmpt.org. 201 IN PTR rhos-master-01.lab.ibmpt.org. 202 IN PTR rhos-master-02.lab.ibmpt.org. 203 IN PTR rhos-master-03.lab.ibmpt.org. 204 IN PTR rhos-worker-01.lab.ibmpt.org. 205 IN PTR rhos-worker-02.lab.ibmpt.org. 210 IN PTR api.lab.ibmpt.org. 210 IN PTR api-int.lab.ibmpt.org. ; ------------------------------------------------------------------------------

---------------------------------------------------------------------

Example configuration for a possible web application. See the

full configuration options online.

#

https://www.haproxy.org/download/1.8/doc/configuration.txt

#

---------------------------------------------------------------------

---------------------------------------------------------------------

Global settings

---------------------------------------------------------------------

global

to have these messages end up in /var/log/haproxy.log you will

# need to:
#
# 1) configure syslog to accept network log events.  This is done
#    by adding the '-r' option to the SYSLOGD_OPTIONS in
#    /etc/sysconfig/syslog
#
# 2) configure local2 events to go to the /var/log/haproxy.log
#   file. A line like the following can be added to
#   /etc/sysconfig/syslog
#
#    local2.*                       /var/log/haproxy.log
#
#log        /dev/log local0 info

chroot      /var/lib/haproxy
pidfile     /var/run/haproxy.pid
maxconn     20000
user        haproxy
group       haproxy
daemon

# turn on stats unix socket
stats socket /var/lib/haproxy/stats

# utilize system-wide crypto-policies
# ssl-default-bind-ciphers PROFILE=SYSTEM
# ssl-default-server-ciphers PROFILE=SYSTEM

---------------------------------------------------------------------

common defaults that all the 'listen' and 'backend' sections will

use if not designated in their block

---------------------------------------------------------------------

defaults mode http log global option httplog option dontlognull option http-server-close option forwardfor except 127.0.0.0/8 option redispatch retries 3 timeout http-request 10s timeout queue 1m timeout connect 10s timeout client 5m timeout server 5m timeout http-keep-alive 10s timeout check 10s maxconn 20000

---------------------------------------------------------------------

#

---------------------------------------------------------------------

listen stats bind *:9000 mode http stats enable stats uri / monitor-uri /healthz

---------------------------------------------------------------------

#

---------------------------------------------------------------------

frontend okd4_k8s_api_fe bind *:6443 default_backend okd4_k8s_api_be mode tcp option tcplog

backend okd4_k8s_api_be balance source mode tcp server rhos-bootstrap 192.168.1.200:6443 check server rhos-master-01 192.168.1.201:6443 check

server rhos-master-xx 192.168.1.xxx:6443 check

server rhos-master-xx 192.168.1.xxx:6443 check

#

frontend okd4_machine_config_server_fe bind *:22623 default_backend okd4_machine_config_server_be mode tcp option tcplog

backend okd4_machine_config_server_be balance source mode tcp server rhos-bootstrap 192.168.1.200:22623 check server rhos-master-01 192.168.1.201:22623 check

server rhos-master-xx 192.168.1.xxx:22623 check

server rhos-master-xx 192.168.1.xxx:22623 check

#

frontend okd4_http_ingress_traffic_fe bind *:80 default_backend okd4_http_ingress_traffic_be mode tcp option tcplog

backend okd4_http_ingress_traffic_be balance source mode tcp server rhos-worker-01 192.168.1.204:80 check

server rhos-worker-xx 192.168.1.xxx:80 check

#

frontend okd4_https_ingress_traffic_fe bind *:443 default_backend okd4_https_ingress_traffic_be mode tcp option tcplog

backend okd4_https_ingress_traffic_be balance source mode tcp server rhos-worker-01 192.168.1.204:443 check

server rhos-worker-xx 192.168.1.xxx:443 check

---------------------------------------------------------------------

if instead of "dig ibmpt.org" you need to use "dig @127.0.0.1 ibmpt.org" instead for having the right reply, you have a problem.

nmcli device show ens32 nmcli connection show nmcli -f ip4 device show ens32 sudo nmcli connection modify VM ipv4.dns-search "ibmpt.org" sudo nmcli con mod VM ipv4.dns "127.0.0.1 192.168.1.1" nmcli con mod VM ipv4.ignore-auto-dns yes

sudo grep "search lan" /etc/resolv.conf sudo sed -i 's/search lan/search ibmpt.org/' /etc/resolv.conf sudo grep "search ibmpt.org" /etc/resolv.conf

sudo systemctl reload NetworkManager sudo systemctl restart NetworkManager

nmcli con down VM nmcli con up VM nmcli device show ens32

LAST NOTE: i have tried to perform the same process and sequence with OpenShift (not OKD) and got some msgs that sound to tell me that my infra-structure is short.

Yes, i know about HW requirements but for practicing and exercising, we should not need a full datacenter. WIth time (after vacations) i will try OpenShift with 2 Lenovo workstations.

The script from Craig Robinson at https://itnext.io/okd-4-5-single-node-cluster-on-windows-10-using-hyper-v-3ffb7b369245 was a good start but i have made some changes:

WillNilges commented 4 years ago

I'm getting the exact same output from openshift-install --dir=install_dir/ wait-for bootstrap-complete --log-level=info. The command then hung for another hour before hitting me with this:

ERROR Cluster operator authentication Degraded is True with ConfigObservation_Error::IngressStateEndpoints_MissingEndpoints::RouterCerts_NoRouterCertSecret: ConfigObservationDegraded: secret "v4-0-config-system-router-certs" not found
RouterCertsDegraded: secret/v4-0-config-system-router-certs -n openshift-authentication: could not be retrieved: secret "v4-0-config-system-router-certs" not found
IngressStateEndpointsDegraded: No endpoints found for oauth-server 
INFO Cluster operator authentication Progressing is Unknown with NoData:  
INFO Cluster operator authentication Available is Unknown with NoData:  
INFO Cluster operator csi-snapshot-controller Progressing is True with _AsExpected: Progressing: Waiting for Deployment to deploy csi-snapshot-controller pods 
INFO Cluster operator csi-snapshot-controller Available is False with _AsExpected: Available: Waiting for Deployment to deploy csi-snapshot-controller pods 
ERROR Cluster operator etcd Degraded is True with EnvVarController_Error::InstallerController_Error::RevisionController_ContentCreationError::ScriptController_Error::StaticPods_Error: ScriptControllerDegraded: "configmap/etcd-pod": missing env var values
EnvVarControllerDegraded: at least three nodes are required to have a valid configuration
RevisionControllerDegraded: configmaps "etcd-pod" not found
StaticPodsDegraded: pods "etcd-okd4-control-plane-1.lab.okd.local" not found
InstallerControllerDegraded: missing required resources: [configmaps: etcd-scripts,restore-etcd-pod, configmaps: config-1,etcd-metrics-proxy-client-ca-1,etcd-metrics-proxy-serving-ca-1,etcd-peer-client-ca-1,etcd-pod-1,etcd-serving-ca-1, secrets: etcd-all-peer-1,etcd-all-serving-1,etcd-all-serving-metrics-1] 
INFO Cluster operator etcd Progressing is True with NodeInstaller: NodeInstallerProgressing: 1 nodes are at revision 0; 0 nodes have achieved new revision 1 
INFO Cluster operator etcd Available is False with StaticPods_ZeroNodesActive: StaticPodsAvailable: 0 nodes are active; 1 nodes are at revision 0; 0 nodes have achieved new revision 1 
ERROR Cluster operator kube-apiserver Degraded is True with StaticPods_Error: StaticPodsDegraded: pod/kube-apiserver-okd4-control-plane-1.lab.okd.local container "kube-apiserver" is not ready: CrashLoopBackOff: back-off 5m0s restarting failed container=kube-apiserver pod=kube-apiserver-okd4-control-plane-1.lab.okd.local_openshift-kube-apiserver(0355d3951816f11b71766e795c4736bb)
StaticPodsDegraded: pod/kube-apiserver-okd4-control-plane-1.lab.okd.local container "kube-apiserver" is waiting: CrashLoopBackOff: back-off 5m0s restarting failed container=kube-apiserver pod=kube-apiserver-okd4-control-plane-1.lab.okd.local_openshift-kube-apiserver(0355d3951816f11b71766e795c4736bb) 
INFO Cluster operator kube-apiserver Progressing is True with NodeInstaller: NodeInstallerProgressing: 1 nodes are at revision 0; 0 nodes have achieved new revision 2 
INFO Cluster operator kube-apiserver Available is False with StaticPods_ZeroNodesActive: StaticPodsAvailable: 0 nodes are active; 1 nodes are at revision 0; 0 nodes have achieved new revision 2 
INFO Cluster operator openshift-apiserver Available is False with APIServices_PreconditionNotReady: APIServicesAvailable: PreconditionNotReady 
INFO Cluster operator openshift-controller-manager Progressing is True with _DesiredStateNotYetAchieved: Progressing: daemonset/controller-manager: observed generation is 0, desired generation is 8.
Progressing: daemonset/controller-manager: number available is 0, desired number available > 1 
INFO Cluster operator openshift-controller-manager Available is False with _NoPodsAvailable: Available: no daemon pods available on any node. 
INFO Cluster operator operator-lifecycle-manager-packageserver Available is False with :  
INFO Cluster operator operator-lifecycle-manager-packageserver Progressing is True with : Working toward 0.15.1 
INFO Use the following commands to gather logs from the cluster 
INFO openshift-install gather bootstrap --help    
FATAL failed to wait for bootstrapping to complete: timed out waiting for the condition 

I'm pretty sure my DNS is correct. Looks like an SSL issue. Any suggestions? (Sorry if this is off topic :sweat_smile: )

RuiGarciaSantos commented 4 years ago

Is this an OKD or RHOCP install? I was never able to run RHOCP on single node. Is this an OKD single node? HW sizing matters.

Regarding DNS, confirm the dig output without specifying "@127.0.0.1": If instead of "dig xxxdomainxxx" You need to use "dig @127.0.0.1 xxxdomainxxx", it is not ok

Be sure that machines obtain the right IP through DHCP also ...

From my experience, "90% of possible source problems" reside on not meeting requirements.

WillNilges commented 4 years ago

This is OKD 4.5. I had a single ctrl plane node with 4CPUs and 16GB of RAM on Proxmox. I doubled the core count and will try again tomorrow. I'll keep you posted, thanks for the advice!

RuiGarciaSantos commented 4 years ago

16gb may be too short ... even for a single node

RuiGarciaSantos commented 4 years ago

I don't know about proxmox hw requirements

WillNilges commented 4 years ago

Good morning, got the same errors :(

[root@api-int ~]# openshift-install --dir=install_dir/ wait-for bootstrap-complete --log-level=info
INFO Waiting up to 20m0s for the Kubernetes API at https://api.lab.okd.local:6443... 
INFO API v1.18.3 up                               
INFO Waiting up to 40m0s for bootstrapping to complete... 
W1013 09:12:32.357751    1260 reflector.go:326] k8s.io/client-go/tools/watch/informerwatcher.go:146: watch of *v1.ConfigMap ended with: very short watch: k8s.io/client-go/tools/watch/informerwatcher.go:146: Unexpected watch close - watch lasted less than a second and no items received
ERROR Cluster operator authentication Degraded is True with ConfigObservation_Error::IngressStateEndpoints_MissingEndpoints::RouterCerts_NoRouterCertSecret: RouterCertsDegraded: secret/v4-0-config-system-router-certs -n openshift-authentication: could not be retrieved: secret "v4-0-config-system-router-certs" not found
ConfigObservationDegraded: secret "v4-0-config-system-router-certs" not found
IngressStateEndpointsDegraded: No endpoints found for oauth-server 
INFO Cluster operator authentication Progressing is Unknown with NoData:  
INFO Cluster operator authentication Available is Unknown with NoData:  
INFO Cluster operator csi-snapshot-controller Progressing is True with _AsExpected: Progressing: Waiting for Deployment to deploy csi-snapshot-controller pods 
INFO Cluster operator csi-snapshot-controller Available is False with _AsExpected: Available: Waiting for Deployment to deploy csi-snapshot-controller pods 
ERROR Cluster operator etcd Degraded is True with EnvVarController_Error::InstallerController_Error::RevisionController_ContentCreationError::ScriptController_Error::StaticPods_Error: RevisionControllerDegraded: configmaps "etcd-pod" not found
EnvVarControllerDegraded: at least three nodes are required to have a valid configuration
StaticPodsDegraded: pods "etcd-okd4-control-plane-1.lab.okd.local" not found
InstallerControllerDegraded: missing required resources: [configmaps: etcd-scripts,restore-etcd-pod, configmaps: config-1,etcd-metrics-proxy-client-ca-1,etcd-metrics-proxy-serving-ca-1,etcd-peer-client-ca-1,etcd-pod-1,etcd-serving-ca-1, secrets: etcd-all-peer-1,etcd-all-serving-1,etcd-all-serving-metrics-1]
ScriptControllerDegraded: "configmap/etcd-pod": missing env var values 
INFO Cluster operator etcd Progressing is True with NodeInstaller: NodeInstallerProgressing: 1 nodes are at revision 0; 0 nodes have achieved new revision 1 
INFO Cluster operator etcd Available is False with StaticPods_ZeroNodesActive: StaticPodsAvailable: 0 nodes are active; 1 nodes are at revision 0; 0 nodes have achieved new revision 1 
INFO Cluster operator kube-apiserver Progressing is True with NodeInstaller: NodeInstallerProgressing: 1 nodes are at revision 0; 0 nodes have achieved new revision 2 
INFO Cluster operator kube-apiserver Available is False with StaticPods_ZeroNodesActive: StaticPodsAvailable: 0 nodes are active; 1 nodes are at revision 0; 0 nodes have achieved new revision 2 
INFO Cluster operator openshift-apiserver Available is False with APIServices_PreconditionNotReady: APIServicesAvailable: PreconditionNotReady 
INFO Cluster operator openshift-controller-manager Progressing is True with _DesiredStateNotYetAchieved: Progressing: daemonset/controller-manager: observed generation is 0, desired generation is 8.
Progressing: daemonset/controller-manager: number available is 0, desired number available > 1 
INFO Cluster operator openshift-controller-manager Available is False with _NoPodsAvailable: Available: no daemon pods available on any node. 
INFO Cluster operator operator-lifecycle-manager-packageserver Available is False with :  
INFO Cluster operator operator-lifecycle-manager-packageserver Progressing is True with : Working toward 0.15.1 
INFO Use the following commands to gather logs from the cluster 
INFO openshift-install gather bootstrap --help    
FATAL failed to wait for bootstrapping to complete: timed out waiting for the condition

The guide I'm following recommends 8CPU 16G, which is what I'm at. I guess I could try chucking another 8G of RAM at it, but also at this point I might try using the actual recommended minimums.

gustavoseixas commented 3 years ago

Same error. My 'dig' result is:

image The 192.168.1.210 is the IP for 'service machine' Is this correct?

In bootstrap machine and service machine, quay.io is accessible: image

Some errors: image

k8s.io/client-go/informers/factory.go:135: Failed to watch *v1.RoleBinding: Get https://localhost:6443/apis/rbac.authorization.k8s.io/v1/rolebindings?allowWatchBookmarks=true&resourceVersion=1904&timeout=7m24s&timeoutSeconds=444&watch=true: dial tcp [::1]:6443: connect: connection refused

WillNilges commented 3 years ago

The 192.168.1.210 is the IP for 'service machine'

That depends. Is that the IP of your services machine? That last error you posted looks like it might be caused by a problem with DNS. I think I know what guide you're trying to follow. Can you show me your named configs?

gustavoseixas commented 3 years ago

Hi WillNilges! The guide is really this: https://itnext.io/guide-installing-an-okd-4-5-cluster-508a2631cbee

This is my etc/named.conf and /etc/named/named.conf.local:

image .. image

What was different, in relation to Craig's tutorial, is this line below, that SERVER was like 127.0.0.1, not 192.168.1.210:

image

I don't know if it means any mistake.

In this attempt, it run with this config: image

Inthis attempt, only the bootstrap and Control-Plane-1 was started. image Yours pairs of macAddress and IP is correctly configured and checked in pfSense DHCP. I used this tutorial to: OKD 4.5 Single Node Cluster on Windows 10 using Hyper-V

..but with this DHCP configs: image

And.. image

Finally: image

But: image

WillNilges commented 3 years ago

Fascinating. So you can indeed bootstrap a single node cluster, but then you get errors after the fact? What are the hardware specs of your nodes? Also, are you asking about installing a multi-node cluster like Craig does, or a single node cluster?

What was different, in relation to Craig's tutorial, is this line below, that SERVER was like 127.0.0.1, not 192.168.1.210: What command did you run to get that? Looks like a dig. dig on what? Also, I'm assuming that you mirrored all config files, which, if you followed Craig's instructions to a T, should be fine.

Also, if you're trying to run one node, he's got instructions for that but I recommend against it, since etcd is much happier with at least 3 nodes, cluster load will be more spread out, etc.

I've tried installing OKD on a wide range of hardware, and you do need some pretty beefy machines for it.

gustavoseixas commented 3 years ago

Failed to access localhost: 6443 errors occurred in the middle of the process and then stopped showing. I can't say if they were solved automatically, in the meantime. I suspect not. I had tried installing the cluster with the three masters and two workers. But the two workers did not go up due to the same problem of access to localhost: 6443. I suspect something related to a certificate.

My machine is a Dell XPS 8930 tower, with 64 GB of RAM. The cluster was mounted on Hyper-V, using CentOs as a service and Fedora Core for Nodes. When setting up the attempt with the "single node cluster" I fixed 16 GB as a minimum memory for Bootstrap, Master and Service. I gave up, for now, trying to configure a bigger than that, on my machine.

image

[What was different, in relation to Craig's tutorial, is this line below, that SERVER was like 127.0.0.1, not 192.168.1.210: What command did you run to get that? Looks like a dig. dig on what? Also, I'm assuming that you mirrored all config files, which, if you followed Craig's instructions to a T, should be fine.]

The commands are: dig okd.local or dig –x 192.168.1.210

Yes, I mirrored all config files from the tutorial.

On the "single node cluster" I followed the guidance of the official FAQ, on OKD: image

[I've tried installing OKD on a wide range of hardware, and you do need some pretty beefy machines for it.]

Perhaps this is the problem.

chany93 commented 3 years ago

Hi WillNilges! The guide is really this: https://itnext.io/guide-installing-an-okd-4-5-cluster-508a2631cbee

This is my etc/named.conf and /etc/named/named.conf.local:

image .. image

What was different, in relation to Craig's tutorial, is this line below, that SERVER was like 127.0.0.1, not 192.168.1.210:

image

I don't know if it means any mistake.

In this attempt, it run with this config: image

Inthis attempt, only the bootstrap and Control-Plane-1 was started. image Yours pairs of macAddress and IP is correctly configured and checked in pfSense DHCP. I used this tutorial to: OKD 4.5 Single Node Cluster on Windows 10 using Hyper-V

..but with this DHCP configs: image

And.. image

Finally: image

But: image

Hi! I have the exact same error on the bootstrap, the "dig" commands work fine

image

How did you solved it?

WillNilges commented 3 years ago

@chany93 @gustavoseixas This is generalizing, but when I have trouble with OKD, it's usually the DNS, and if it's not, then you don't have enough resources. I can only recommend following Craig's guide to a T, Resources, Networking and all until you get it right at least once. Then you can experiment from there. To that end, I've been trying to learn Go and working on a (really hacky and awful) project to help with this. It's got codes for generating haproxy and named files, as well as a couple scripts to help with Service Machine (I guess RedHat calls it Backplane?) configuration.

More specifically, those errors smell like network or disk latency. To investigate further, I'd want to check out some node logs and see the overall status of pods. Once the cluster bootstraps, it's not totally done installing. That takes another 30-60 minutes. However, things will simply fall apart (quite violently, might I add) if your metal is too slow.

ghost commented 3 years ago

image I encountered the same problem when installing OKD4.5. What is the reason?

WillNilges commented 3 years ago

image I encountered the same problem when installing OKD4.5. What is the reason?

Woa, that's kinda wack. Did you try to install clean, or overwrite a previous attempt? Seems to me like perhaps your bootstrap didn't even ignite. Can you provide a little more info?

ghost commented 3 years ago

Thank you for your reply. I tried a new installation, but the same error message is still there. See the attachment information for all _logs。 [Uploading log-bundle-20201221154859.tar.gz…]()

TheFrisianClause commented 3 years ago

Well I found the solution to this: When you boot the Fedora CoreOS machines one by one, I would suggest changing SELinux to permissive mode, as Fedora CoreOS sets the SELinux mode to 'enforced' by default. So I would create the machines one by one, but instead of deploying them after each other I would suggest deploy -> change SELinux setting -> reboot and deploy the next machine for a smooth install.

Also this helps with the further steps of the OKD cluster install, which is also done by Craig Robinson.

WillNilges commented 3 years ago

That was not something I had to do. I also do not know enough about SELinux in OKD to make a call on weather or not that's a good idea. Personally, I'd shy away from that solution.

On Wed, Jan 6, 2021 at 11:31 AM JulianFRL notifications@github.com wrote:

Well I found the solution to this: When you boot the Fedora CoreOS machines one by one, I would suggest changing SELinux to permissive mode, as Fedora CoreOS sets the SELinux mode to 'enforced' by default. So I would create the machines one by one, but instead of deploying them after each other I would suggest deploy -> change SELinux setting -> reboot and deploy the next machine for a smooth install.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/openshift/installer/issues/3994#issuecomment-755409391, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKHQNKXDF2PCJ3H272DIHR3SYSF6LANCNFSM4PP46BGA .

openshift-bot commented 3 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot commented 3 years ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

openshift-bot commented 3 years ago

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci[bot] commented 3 years ago

@openshift-bot: Closing this issue.

In response to [this](https://github.com/openshift/installer/issues/3994#issuecomment-855667980): >Rotten issues close after 30d of inactivity. > >Reopen the issue by commenting `/reopen`. >Mark the issue as fresh by commenting `/remove-lifecycle rotten`. >Exclude this issue from closing again by commenting `/lifecycle frozen`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.