stratokumulus / proxmox-openshift-setup

34 stars 13 forks source link

Missing Details #1

Closed RickBankers closed 1 year ago

RickBankers commented 1 year ago

I also posted this on the reddit thread and wasn't sure if this was a better place.

Thanks so much for making this. I am also working on openshift on proxmox. I have cloned the repo and I'm a little confused on the following:

It would be AWESOME if you could to a step by step video on getting the pre-work setup for this and then the actual deployment. Im sure I and others would find it very useful and would help me identify the missing pieces.

RickBankers commented 1 year ago

The playbook ran perfectly! The control nodes came up and provisioned correctly. Then I ran the openshift-install --dir=install_dir/ wait-for bootstrap-complete --log-level=info and that's where it failed. I also looked at the install-config.yaml and was wondering if this is correct?

networking:
machineNetwork:  - cidr: 192.168.2.0/24    # May not be necessary, but it's a left over from my tests ... will 
clusterNetwork:  - cidr: 10.128.0.0/14
hostPrefix: 23
networkType: OVNKubernetes
serviceNetwork:   - [172.30.0.0/16](https://172.30.0.0/16)

Error after running :openshift-install --dir=install_dir/ wait-for bootstrap-complete --log-level=info

ERROR Cluster operator authentication Degraded is True with APIServerDeployment_PreconditionNotFulfilled::IngressStateEndpoints_MissingSubsets::OAuthAPIServerConfigObservation_Error::OAuthServerServiceEndpointAccessibleController_SyncError::OAuthServerServiceEndpointsEndpointAccessibleController_SyncError: APIServerDeploymentDegraded: waiting for observed configuration to have mandatory apiServerArguments.etcd-servers

ERROR APIServerDeploymentDegraded:

ERROR IngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server

ERROR OAuthAPIServerConfigObservationDegraded: configmaps openshift-etcd/etcd-endpoints: no etcd endpoint addresses found

ERROR OAuthServerServiceEndpointAccessibleControllerDegraded: Get "https://172.30.188.113:443/healthz": dial tcp 172.30.188.113:443: connect: connection refused

ERROR OAuthServerServiceEndpointsEndpointAccessibleControllerDegraded: oauth service endpoints are not ready

ERROR Cluster operator authentication Available is False with APIServerDeployment_PreconditionNotFulfilled::APIServices_PreconditionNotReady::OAuthServerServiceEndpointAccessibleController_EndpointUnavailable::ReadyIngressNodes_NoReadyIngressNodes: APIServicesAvailable: PreconditionNotReady

ERROR OAuthServerServiceEndpointAccessibleControllerAvailable: Get "https://172.30.188.113:443/healthz": dial tcp 172.30.188.113:443: connect: connection refused

ERROR ReadyIngressNodesAvailable: Authentication requires functional ingress which requires at least one schedulable and ready node. Got 0 worker nodes, 3 master nodes, 0 custom target nodes (none are schedulable or ready for ingress pods).

INFO Cluster operator baremetal Disabled is False with :

INFO Cluster operator cloud-controller-manager CloudConfigControllerAvailable is True with AsExpected: Cloud Config Controller works as expected

INFO Cluster operator cloud-controller-manager CloudConfigControllerDegraded is False with AsExpected: Cloud Config Controller works as expected

INFO Cluster operator cloud-controller-manager TrustedCABundleControllerControllerAvailable is True with AsExpected: Trusted CA Bundle Controller works as expected

INFO Cluster operator cloud-controller-manager TrustedCABundleControllerControllerDegraded is False with AsExpected: Trusted CA Bundle Controller works as expected

ERROR Cluster operator dns Available is False with DNSUnavailable: DNS "default" is unavailable.

INFO Cluster operator dns Progressing is Unknown with DNSDoesNotReportProgressingStatus: DNS "default" is not reporting a Progressing status condition

INFO Cluster operator etcd RecentBackup is Unknown with ControllerStarted: The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required

ERROR Cluster operator etcd Degraded is True with EtcdEndpoints_ErrorUpdatingEtcdEndpoints::GuardController_SyncError: EtcdEndpointsDegraded: no etcd members are present

ERROR GuardControllerDegraded: [Missing operand on node master1, Missing operand on node master2]

INFO Cluster operator etcd Progressing is True with NodeInstaller: NodeInstallerProgressing: 3 nodes are at revision 0; 0 nodes have achieved new revision 2

ERROR Cluster operator etcd Available is False with StaticPods_ZeroNodesActive: StaticPodsAvailable: 0 nodes are active; 3 nodes are at revision 0; 0 nodes have achieved new revision 2

ERROR Cluster operator ingress Available is Unknown with IngressDoesNotHaveAvailableCondition: The "default" ingress controller is not reporting an Available status condition.

INFO Cluster operator ingress Progressing is True with Reconciling: Not all ingress controllers are available.

ERROR Cluster operator ingress Degraded is Unknown with IngressDoesNotHaveDegradedCondition: The "default" ingress controller is not reporting a Degraded status condition.

INFO Cluster operator ingress EvaluationConditionsDetected is False with AsExpected:

ERROR Cluster operator kube-apiserver Degraded is True with ConfigObservation_Error::GuardController_SyncError: ConfigObservationDegraded: configmaps openshift-etcd/etcd-endpoints: no etcd endpoint addresses found

ERROR GuardControllerDegraded: [Missing operand on node master0, Missing operand on node master1, Missing operand on node master2]

INFO Cluster operator kube-apiserver Progressing is True with NodeInstaller: NodeInstallerProgressing: 3 nodes are at revision 0; 0 nodes have achieved new revision 4

ERROR Cluster operator kube-apiserver Available is False with StaticPods_ZeroNodesActive: StaticPodsAvailable: 0 nodes are active; 3 nodes are at revision 0; 0 nodes have achieved new revision 4

ERROR Cluster operator kube-controller-manager Degraded is True with GarbageCollector_Error::GuardController_SyncError::StaticPods_Error: GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp: lookup thanos-querier.openshift-monitoring.svc on 172.30.0.10:53: read udp 10.128.0.5:53969->172.30.0.10:53: read: connection refused

ERROR GuardControllerDegraded: [Missing operand on node master2, Missing operand on node master0]

ERROR StaticPodsDegraded: pod/kube-controller-manager-master1 container "cluster-policy-controller" is waiting: ContainerCreating:

ERROR StaticPodsDegraded: pod/kube-controller-manager-master1 container "kube-controller-manager" is waiting: ContainerCreating:

ERROR StaticPodsDegraded: pod/kube-controller-manager-master1 container "kube-controller-manager-cert-syncer" is waiting: ContainerCreating:

ERROR StaticPodsDegraded: pod/kube-controller-manager-master1 container "kube-controller-manager-recovery-controller" is waiting: ContainerCreating:

INFO Cluster operator kube-controller-manager Progressing is True with NodeInstaller: NodeInstallerProgressing: 3 nodes are at revision 0; 0 nodes have achieved new revision 6

ERROR Cluster operator kube-controller-manager Available is False with StaticPods_ZeroNodesActive: StaticPodsAvailable: 0 nodes are active; 3 nodes are at revision 0; 0 nodes have achieved new revision 6

ERROR Cluster operator kube-scheduler Degraded is True with GuardController_SyncError: GuardControllerDegraded: [Missing operand on node master2, Missing operand on node master1]

INFO Cluster operator kube-scheduler Progressing is True with NodeInstaller: NodeInstallerProgressing: 3 nodes are at revision 0; 0 nodes have achieved new revision 5

ERROR Cluster operator kube-scheduler Available is False with StaticPods_ZeroNodesActive: StaticPodsAvailable: 0 nodes are active; 3 nodes are at revision 0; 0 nodes have achieved new revision 5

INFO Cluster operator machine-config Progressing is True with : Working towards 4.12.0-0.okd-2023-02-18-033438

INFO Cluster operator network ManagementStateDegraded is False with :

INFO Cluster operator network Progressing is True with Deploying: DaemonSet "/openshift-multus/network-metrics-daemon" is waiting for other operators to become ready

INFO Deployment "/openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready

ERROR Cluster operator node-tuning Available is False with TunedUnavailable: DaemonSet "tuned" has no available Pod(s)

INFO Cluster operator node-tuning Progressing is True with ProfileProgressing: Waiting for 3/3 Profiles to be applied

ERROR Cluster operator openshift-apiserver Degraded is True with APIServerDeployment_PreconditionNotFulfilled::ConfigObservation_Error: APIServerDeploymentDegraded: waiting for observed configuration to have mandatory StorageConfig.URLs

ERROR APIServerDeploymentDegraded:

ERROR ConfigObservationDegraded: configmaps openshift-etcd/etcd-endpoints: no etcd endpoint addresses found

ERROR Cluster operator openshift-apiserver Available is False with APIServerDeployment_PreconditionNotFulfilled::APIServices_PreconditionNotReady: APIServicesAvailable: PreconditionNotReady

INFO Cluster operator openshift-controller-manager Progressing is True with _DesiredStateNotYetAchieved: Progressing: deployment/controller-manager: observed generation is 3, desired generation is 5.

INFO Progressing: deployment/controller-manager: available replicas is 0, desired available replicas > 1

INFO Progressing: deployment/controller-manager: updated replicas is 1, desired replicas is 3

INFO Progressing: deployment/route-controller-manager: observed generation is 2, desired generation is 4.

INFO Progressing: deployment/route-controller-manager: available replicas is 0, desired available replicas > 1

INFO Progressing: deployment/route-controller-manager: updated replicas is 1, desired replicas is 3

ERROR Cluster operator openshift-controller-manager Available is False with _NoPodsAvailable: Available: no pods available on any node.

ERROR Cluster operator operator-lifecycle-manager-packageserver Available is False with :

INFO Cluster operator operator-lifecycle-manager-packageserver Progressing is True with : Working toward 0.19.0

INFO Use the following commands to gather logs from the cluster

INFO openshift-install gather bootstrap --help

ERROR Bootstrap failed to complete: timed out waiting for the condition

ERROR Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane. 
stratokumulus commented 1 year ago

As is commented in the install config file, the optional machineNetwork config works for me, but I'm not sure if it's what was required in my case, or just totally ignored by the bootstrapping. I'll re-run my install scripts, but I won't be able to do that before the 2nd half of April.

You may want to either comment out that line, or change it to the subnet used by your installation ? The VLAN where my OKD install is deployed is in 192.168.2.0/24 ...

Also, my very first attempts were on a machine with very slow disks, and I never managed to finalize the install (NTSET, Never The Same Error Twice). I have now a Dell R630 with HW RAID, and that's what made the difference in my case.

RickBankers commented 1 year ago

Getting a little further. Had to modify the playbook a bit to fix a few issues. Now when I boot the master nodes and run openshift-install --dir=install_dir/ wait-for bootstrap-complete --log-level=info The master nodes seem to start provisioning and things look good. They reboot a couple times and look good but the openshift install fails. I logged into one of the master nodes after it looks like it was provisioned and there was no network at. Somehow during the openshift provisioning the master it doesn't configure networking correctly. Any ideas on how to fix this? Is this a qemu/proxmox issue? Should I not be using virtio for the network cards, etc?

stratokumulus commented 1 year ago

What you're seeing seems highly familiar ... Maybe try to spin up a pfSense instance, just to create the DHCP reservations for the nodes MAC addresses (that's what I ended up doing, but my code isn't updated yet, as I don't have a way to do so right now). And remove the DHCP installation and config sections in the playbook.

I noticed in some of my tests that the isc-dhcp-server failed to give the reserved addresses back to the servers ... sniffed the packets to no avail :/ And usually after a few reboots. I have no clue as to why, but a pfSense solved it (until I can drill down the issue and find the root cause of this behaviour)

RickBankers commented 1 year ago

I do have a pfsense box running as a bridge between the vlan2 and my primary network. it sets the DHCP reservation for the initial boot of the okd-services vm. Once the okd services vm boots and the ansible runs I disable the dhcp since the okd-services should be handing out addresses. Things seem to work great and the master nodes seem to provision and reboot 2x. The last reboot seems to be the issue and they come up with no networking set. Are the ignition files supposed to set the networking on the master nodes using the 172.x address range? Are they supposed to be static? The okd-services VM isnt setup for the 172.x network

RickBankers commented 1 year ago

Everything is running now. Had to do a bunch of stuff to get it working. I will provide updates after I've tested my changes and have validated. Thanks again for the work on this,