okd-project / okd

The self-managing, auto-upgrading, Kubernetes distribution for everyone
https://okd.io
Apache License 2.0
1.74k stars 295 forks source link

OKD & Proxmox : 4.11.0 console and authentication not working after bootstrap #1439

Closed KvnOnWeb closed 1 year ago

KvnOnWeb commented 1 year ago

Describe the bug Hi ! I have a problem when boostrap complete : Console and authentication not Ready. I have 503 call : "APIServicesAvailable: "oauth.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request..."

I have six server proxmox with one "okd-services" contains : DNS server, haproxy, pxe server. And 3 master / 3 worker. Domains calls between nodes / master works (ping test). When i try to access to console : I access on cluster but with the not available screen (same as route not exist in cluster).

Note : I have a home lab proxmox with one server (with 6 VM : 3 master / 3 worker) and it works.

Install configuration :

apiVersion: v1
baseDomain: {{ dns.domain }}
metadata:
  name: {{ dns.clusterid }}
compute:
- hyperthreading: Enabled
  name: worker
  replicas: 3
controlPlane:
  hyperthreading: Enabled
  name: master
  replicas: 3
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  networkType: OpenShiftSDN
  serviceNetwork:
  - 172.30.0.0/16
platform:
  none: {}
fips: false
pullSecret: '{{ pull_secret }}'
sshKey: "{{ ssh_pub_key }}"
[root@okd-services ~]# oc get nodes
NAME           STATUS   ROLES    AGE   VERSION
okd-master-1   Ready    master   52m   v1.24.6+5658434
okd-master-2   Ready    master   52m   v1.24.6+5658434
okd-master-3   Ready    master   52m   v1.24.6+5658434
okd-worker-1   Ready    worker   39m   v1.24.6+5658434
okd-worker-2   Ready    worker   39m   v1.24.6+5658434
okd-worker-3   Ready    worker   39m   v1.24.6+5658434

Version 4.11.0-0.okd-2022-12-02-145640 UPI / Platform none

How reproducible 100%

Log bundle

[root@okd-services ~]# oc get clusteroperators
NAME                                       VERSION                          AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.11.0-0.okd-2022-12-02-145640   False       False         True       45m     APIServicesAvailable: "oauth.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request...
baremetal                                  4.11.0-0.okd-2022-12-02-145640   True        False         False      44m
cloud-controller-manager                   4.11.0-0.okd-2022-12-02-145640   True        False         False      46m
cloud-credential                           4.11.0-0.okd-2022-12-02-145640   True        False         False      46m
cluster-autoscaler                         4.11.0-0.okd-2022-12-02-145640   True        False         False      43m
config-operator                            4.11.0-0.okd-2022-12-02-145640   True        False         False      45m
console                                    4.11.0-0.okd-2022-12-02-145640   False       False         True       37m     RouteHealthAvailable: console route is not admitted
csi-snapshot-controller                    4.11.0-0.okd-2022-12-02-145640   True        False         False      44m
dns                                        4.11.0-0.okd-2022-12-02-145640   True        False         False      43m
etcd                                       4.11.0-0.okd-2022-12-02-145640   True        False         False      42m
image-registry                             4.11.0-0.okd-2022-12-02-145640   False       True          True       30m     NodeCADaemonAvailable: The daemon set node-ca has available replicas...
ingress                                    4.11.0-0.okd-2022-12-02-145640   True        False         False      4m11s
insights                                   4.11.0-0.okd-2022-12-02-145640   True        False         False      38m
kube-apiserver                             4.11.0-0.okd-2022-12-02-145640   True        True          False      40m     NodeInstallerProgressing: 2 nodes are at revision 8; 1 nodes are at revision 9
kube-controller-manager                    4.11.0-0.okd-2022-12-02-145640   True        False         True       42m     GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp 172.30.42.204:9091: connect: connection refused
kube-scheduler                             4.11.0-0.okd-2022-12-02-145640   True        True          False      41m     NodeInstallerProgressing: 1 nodes are at revision 7; 2 nodes are at revision 8
kube-storage-version-migrator              4.11.0-0.okd-2022-12-02-145640   True        False         False      44m
machine-api                                4.11.0-0.okd-2022-12-02-145640   True        False         False      44m
machine-approver                           4.11.0-0.okd-2022-12-02-145640   True        False         False      44m
machine-config                             4.11.0-0.okd-2022-12-02-145640   True        False         False      43m
marketplace                                4.11.0-0.okd-2022-12-02-145640   True        False         False      44m
monitoring                                                                  False       True          True       31m     Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.
network                                    4.11.0-0.okd-2022-12-02-145640   True        False         False      44m
node-tuning                                4.11.0-0.okd-2022-12-02-145640   True        False         False      43m
openshift-apiserver                        4.11.0-0.okd-2022-12-02-145640   False       False         False      29m     APIServicesAvailable: "apps.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request...
openshift-controller-manager               4.11.0-0.okd-2022-12-02-145640   True        False         False      44m
openshift-samples                          4.11.0-0.okd-2022-12-02-145640   True        False         False      34m
operator-lifecycle-manager                 4.11.0-0.okd-2022-12-02-145640   True        False         False      44m
operator-lifecycle-manager-catalog         4.11.0-0.okd-2022-12-02-145640   True        False         False      43m
operator-lifecycle-manager-packageserver   4.11.0-0.okd-2022-12-02-145640   True        False         False      6m7s
service-ca                                 4.11.0-0.okd-2022-12-02-145640   True        False         False      44m
storage                                    4.11.0-0.okd-2022-12-02-145640   True        False         False      45m

must-gather-okd-20221212.txt

titou10titou10 commented 1 year ago

I run the latest version of OKD (4.11.0-0.okd-2022-12-02-145640) on proxmox v7.3.3 with no problem

My setup is almost the same as yours (3 masters + 3 workers + 1 bootstrap for installation), the difference are that my DNS server is running on the Proxmox host itself (bind9) and I have a dedicated VM for load balancing in front of OKD with a minimal install of AlmaLinux 9.1 + simple install of nginx, for LB/routing of API + apps to OKD + iPXE stuffs when installing

It could be that the problem is coming from you LB (haproxy) not correctly routing the flows to OKD

Q:

vrutkovs commented 1 year ago

We're gonna need an archive produced by must-gather tool, not its output

KvnOnWeb commented 1 year ago

@titou10titou10 Thanks.

Yes i check many times. After each bootstrap complete, comment bootstrap on haproxy and stop bootstrap VM. My home lab works fine with one proxmox server.

I have 6 proxmox servers with private mesh VPN (Wireguard). Each server on 192.168.1X.1/24 network.

Server 1 : 192.168.11.1 Server 2 : 192.168.12.1 Server X : 192.168.1X.1

Each server have dnsmasq (config pxe server & attribute VM IP by mac address & configure DNS server). Example dnsmasq config :

dhcp-range=192.168.11.10,192.168.11.250,12h
dhcp-lease-max=25
dhcp-host=D2:8E:FF:B3:01:73,192.168.11.11,okd-master-1
dhcp-host=12:8E:DE:B3:01:60,192.168.11.100,okd-services
dhcp-option=option:dns-server,192.168.11.100,1.1.1.1
dhcp-boot=pxelinux.0,,192.168.11.100

I have a problem on authentication operators (503) and console not contact oauth (timeout). Very similar : https://github.com/okd-project/okd/issues/430

My bad. The archive : must-gather.local.1194632374733407674.tar.gz

KvnOnWeb commented 1 year ago

Try with networkType: OVNKubernetes. Same result.

must-gather archive : must-gather.local.7911081472470111102.tar.gz

vrutkovs commented 1 year ago
Get \"https://oauth-openshift.apps.cloud.soyouweb.fr/healthz\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)\nOAuthServerServiceEndpointAccessibleControllerDegraded: Get \"https://172.30.138.239:443/healthz\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

and

"WellKnownReadyControllerProgressing: kube-apiserver oauth endpoint https://192.168.12.12:6443/.well-known/oauth-authorization-server is not yet served and authentication operator keeps waiting (check kube-apiserver operator, and check that instances roll out successfully, which can take several minutes per instance)"

in authentication operator, as kube-apiserver never rolled out:

"NodeInstallerProgressing: 1 nodes are at revision 0; 1 nodes are at revision 3; 1 nodes are at revision 8"

installer pods cannot be created:

"Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_installer-8-okd-master-1_openshift-kube-apiserver_62b703a9-0b89-4801-92dc-a6af80a6b611_0(c09814543ff0b731abf40775238958a55b55b10d5c1059d5b47b070163b5d453): error adding pod openshift-kube-apiserver_installer-8-okd-master-1 to CNI network \"multus-cni-network\": plugin type=\"multus\" name=\"multus-cni-network\" failed (add): [openshift-kube-apiserver/installer-8-okd-master-1/62b703a9-0b89-4801-92dc-a6af80a6b611:openshift-sdn]: error adding container to network \"openshift-sdn\": CNI request failed with status 400: 'the server was unable to return a response in the time allotted, but may still be processing the request (get pods installer-8-okd-master-1)\n'"

That looks like a weird networking issue.

Also, must-gather is incomplete, seems its using some user cert and didn't fetch a lot of data from openshift-* namespaces. Could you check if that works using installer-generated kubeconfig? Also try via export KUBECONFIG=/etc/kubernetes/static-pod-resources/kube-apiserver-certs/secrets/node-kubeconfigs/localhost.kubeconfig on masters

KvnOnWeb commented 1 year ago

Thanks. I think too but i found not the problem.

The must-gather could be complete : https://drive.google.com/file/d/1LMTX6oV4Mz7n5h1rQ5RSIog8G4ccNP_Q/view?usp=share_link

melledouwsma commented 1 year ago

I have 6 proxmox servers with private mesh VPN (Wireguard). Each server on 192.168.1X.1/24 network.

Are you sure the private mesh VPN between the hypervisors isn't interfering here? I'm not 100% sure, but the must-gather does indicate to me some kind of peculiar networking issue outside of the cluster.

KvnOnWeb commented 1 year ago

I installed new cluster in single proxmox server and it works. I abandon installation on multiple proxmox with private mesh networks. Thanks for your help.