OKD & Proxmox : 4.11.0 console and authentication not working after bootstrap

KvnOnWeb commented 1 year ago

Describe the bug Hi ! I have a problem when boostrap complete : Console and authentication not Ready. I have 503 call : "APIServicesAvailable: "oauth.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request..."

I have six server proxmox with one "okd-services" contains : DNS server, haproxy, pxe server. And 3 master / 3 worker. Domains calls between nodes / master works (ping test). When i try to access to console : I access on cluster but with the not available screen (same as route not exist in cluster).

Note : I have a home lab proxmox with one server (with 6 VM : 3 master / 3 worker) and it works.

Install configuration :

apiVersion: v1
baseDomain: {{ dns.domain }}
metadata:
  name: {{ dns.clusterid }}
compute:
- hyperthreading: Enabled
  name: worker
  replicas: 3
controlPlane:
  hyperthreading: Enabled
  name: master
  replicas: 3
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  networkType: OpenShiftSDN
  serviceNetwork:
  - 172.30.0.0/16
platform:
  none: {}
fips: false
pullSecret: '{{ pull_secret }}'
sshKey: "{{ ssh_pub_key }}"

[root@okd-services ~]# oc get nodes
NAME           STATUS   ROLES    AGE   VERSION
okd-master-1   Ready    master   52m   v1.24.6+5658434
okd-master-2   Ready    master   52m   v1.24.6+5658434
okd-master-3   Ready    master   52m   v1.24.6+5658434
okd-worker-1   Ready    worker   39m   v1.24.6+5658434
okd-worker-2   Ready    worker   39m   v1.24.6+5658434
okd-worker-3   Ready    worker   39m   v1.24.6+5658434

Version 4.11.0-0.okd-2022-12-02-145640 UPI / Platform none

How reproducible 100%

Log bundle

[root@okd-services ~]# oc get clusteroperators
NAME                                       VERSION                          AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.11.0-0.okd-2022-12-02-145640   False       False         True       45m     APIServicesAvailable: "oauth.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request...
baremetal                                  4.11.0-0.okd-2022-12-02-145640   True        False         False      44m
cloud-controller-manager                   4.11.0-0.okd-2022-12-02-145640   True        False         False      46m
cloud-credential                           4.11.0-0.okd-2022-12-02-145640   True        False         False      46m
cluster-autoscaler                         4.11.0-0.okd-2022-12-02-145640   True        False         False      43m
config-operator                            4.11.0-0.okd-2022-12-02-145640   True        False         False      45m
console                                    4.11.0-0.okd-2022-12-02-145640   False       False         True       37m     RouteHealthAvailable: console route is not admitted
csi-snapshot-controller                    4.11.0-0.okd-2022-12-02-145640   True        False         False      44m
dns                                        4.11.0-0.okd-2022-12-02-145640   True        False         False      43m
etcd                                       4.11.0-0.okd-2022-12-02-145640   True        False         False      42m
image-registry                             4.11.0-0.okd-2022-12-02-145640   False       True          True       30m     NodeCADaemonAvailable: The daemon set node-ca has available replicas...
ingress                                    4.11.0-0.okd-2022-12-02-145640   True        False         False      4m11s
insights                                   4.11.0-0.okd-2022-12-02-145640   True        False         False      38m
kube-apiserver                             4.11.0-0.okd-2022-12-02-145640   True        True          False      40m     NodeInstallerProgressing: 2 nodes are at revision 8; 1 nodes are at revision 9
kube-controller-manager                    4.11.0-0.okd-2022-12-02-145640   True        False         True       42m     GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp 172.30.42.204:9091: connect: connection refused
kube-scheduler                             4.11.0-0.okd-2022-12-02-145640   True        True          False      41m     NodeInstallerProgressing: 1 nodes are at revision 7; 2 nodes are at revision 8
kube-storage-version-migrator              4.11.0-0.okd-2022-12-02-145640   True        False         False      44m
machine-api                                4.11.0-0.okd-2022-12-02-145640   True        False         False      44m
machine-approver                           4.11.0-0.okd-2022-12-02-145640   True        False         False      44m
machine-config                             4.11.0-0.okd-2022-12-02-145640   True        False         False      43m
marketplace                                4.11.0-0.okd-2022-12-02-145640   True        False         False      44m
monitoring                                                                  False       True          True       31m     Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.
network                                    4.11.0-0.okd-2022-12-02-145640   True        False         False      44m
node-tuning                                4.11.0-0.okd-2022-12-02-145640   True        False         False      43m
openshift-apiserver                        4.11.0-0.okd-2022-12-02-145640   False       False         False      29m     APIServicesAvailable: "apps.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request...
openshift-controller-manager               4.11.0-0.okd-2022-12-02-145640   True        False         False      44m
openshift-samples                          4.11.0-0.okd-2022-12-02-145640   True        False         False      34m
operator-lifecycle-manager                 4.11.0-0.okd-2022-12-02-145640   True        False         False      44m
operator-lifecycle-manager-catalog         4.11.0-0.okd-2022-12-02-145640   True        False         False      43m
operator-lifecycle-manager-packageserver   4.11.0-0.okd-2022-12-02-145640   True        False         False      6m7s
service-ca                                 4.11.0-0.okd-2022-12-02-145640   True        False         False      44m
storage                                    4.11.0-0.okd-2022-12-02-145640   True        False         False      45m

must-gather-okd-20221212.txt

titou10titou10 commented 1 year ago

I run the latest version of OKD (4.11.0-0.okd-2022-12-02-145640) on proxmox v7.3.3 with no problem

My setup is almost the same as yours (3 masters + 3 workers + 1 bootstrap for installation), the difference are that my DNS server is running on the Proxmox host itself (bind9) and I have a dedicated VM for load balancing in front of OKD with a minimal install of AlmaLinux 9.1 + simple install of nginx, for LB/routing of API + apps to OKD + iPXE stuffs when installing

It could be that the problem is coming from you LB (haproxy) not correctly routing the flows to OKD

Q:

did you removed the bootsrap machine from your haproxy config after install and then shutdown the BS vm?
I'm sure you did but did you tripled check your haproxy config?

vrutkovs commented 1 year ago

We're gonna need an archive produced by must-gather tool, not its output

KvnOnWeb commented 1 year ago

@titou10titou10 Thanks.

Yes i check many times. After each bootstrap complete, comment bootstrap on haproxy and stop bootstrap VM. My home lab works fine with one proxmox server.

I have 6 proxmox servers with private mesh VPN (Wireguard). Each server on 192.168.1X.1/24 network.

Server 1 : 192.168.11.1 Server 2 : 192.168.12.1 Server X : 192.168.1X.1

Each server have dnsmasq (config pxe server & attribute VM IP by mac address & configure DNS server). Example dnsmasq config :

dhcp-range=192.168.11.10,192.168.11.250,12h
dhcp-lease-max=25
dhcp-host=D2:8E:FF:B3:01:73,192.168.11.11,okd-master-1
dhcp-host=12:8E:DE:B3:01:60,192.168.11.100,okd-services
dhcp-option=option:dns-server,192.168.11.100,1.1.1.1
dhcp-boot=pxelinux.0,,192.168.11.100

okd-services (bastion) : public (enp6s18) : PUBLIC_IP private (enp6s19) : 192.168.11.100
- haproxy
- pxe server
- dns server (bind)
okd-master-1 : 192.168.11.11
okd-master-2 : 192.168.12.12
okd-master-3 : 192.168.13.13
okd-worker-1 : 192.168.14.21
okd-worker-2 : 192.168.15.22
okd-worker-3 : 192.168.16.23

I have a problem on authentication operators (503) and console not contact oauth (timeout). Very similar : https://github.com/okd-project/okd/issues/430

My bad. The archive : must-gather.local.1194632374733407674.tar.gz

KvnOnWeb commented 1 year ago

Try with networkType: OVNKubernetes. Same result.

must-gather archive : must-gather.local.7911081472470111102.tar.gz

vrutkovs commented 1 year ago

Get \"https://oauth-openshift.apps.cloud.soyouweb.fr/healthz\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)\nOAuthServerServiceEndpointAccessibleControllerDegraded: Get \"https://172.30.138.239:443/healthz\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

and

"WellKnownReadyControllerProgressing: kube-apiserver oauth endpoint https://192.168.12.12:6443/.well-known/oauth-authorization-server is not yet served and authentication operator keeps waiting (check kube-apiserver operator, and check that instances roll out successfully, which can take several minutes per instance)"

in authentication operator, as kube-apiserver never rolled out:

"NodeInstallerProgressing: 1 nodes are at revision 0; 1 nodes are at revision 3; 1 nodes are at revision 8"

installer pods cannot be created:

"Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_installer-8-okd-master-1_openshift-kube-apiserver_62b703a9-0b89-4801-92dc-a6af80a6b611_0(c09814543ff0b731abf40775238958a55b55b10d5c1059d5b47b070163b5d453): error adding pod openshift-kube-apiserver_installer-8-okd-master-1 to CNI network \"multus-cni-network\": plugin type=\"multus\" name=\"multus-cni-network\" failed (add): [openshift-kube-apiserver/installer-8-okd-master-1/62b703a9-0b89-4801-92dc-a6af80a6b611:openshift-sdn]: error adding container to network \"openshift-sdn\": CNI request failed with status 400: 'the server was unable to return a response in the time allotted, but may still be processing the request (get pods installer-8-okd-master-1)\n'"

That looks like a weird networking issue.

Also, must-gather is incomplete, seems its using some user cert and didn't fetch a lot of data from openshift-* namespaces. Could you check if that works using installer-generated kubeconfig? Also try via export KUBECONFIG=/etc/kubernetes/static-pod-resources/kube-apiserver-certs/secrets/node-kubeconfigs/localhost.kubeconfig on masters

KvnOnWeb commented 1 year ago

Thanks. I think too but i found not the problem.

The must-gather could be complete : https://drive.google.com/file/d/1LMTX6oV4Mz7n5h1rQ5RSIog8G4ccNP_Q/view?usp=share_link

melledouwsma commented 1 year ago

I have 6 proxmox servers with private mesh VPN (Wireguard). Each server on 192.168.1X.1/24 network.

Are you sure the private mesh VPN between the hypervisors isn't interfering here? I'm not 100% sure, but the must-gather does indicate to me some kind of peculiar networking issue outside of the cluster.

KvnOnWeb commented 1 year ago

I installed new cluster in single proxmox server and it works. I abandon installation on multiple proxmox with private mesh networks. Thanks for your help.

okd-project / okd

OKD & Proxmox : 4.11.0 console and authentication not working after bootstrap #1439