okd-project / okd

The self-managing, auto-upgrading, Kubernetes distribution for everyone
https://okd.io
Apache License 2.0
1.67k stars 290 forks source link

Hypershift cluster fails when combining OKD and FCOS #1767

Open itmwiw opened 8 months ago

itmwiw commented 8 months ago

I am attempting to use Hypershift to provision an OKD cluster. I have successfully installed the OKD 'hosted cluster' up to a certain point: The Control plane's pods are 'Running,' and the node pool's VMs are provisioned, with the Nodes marked as 'Ready'. Furthermore, when I run 'oc get co,' all operators are displayed as 'AVAILABLE':

NAME                                       VERSION                          AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
console                                    4.13.0-0.okd-2023-09-30-084937   True        False         False      16s
csi-snapshot-controller                    4.13.0-0.okd-2023-09-30-084937   True        False         False      50m
dns                                        4.13.0-0.okd-2023-09-30-084937   True        False         False      2m12s
image-registry                             4.13.0-0.okd-2023-09-30-084937   True        False         False      2m25s
ingress                                    4.13.0-0.okd-2023-09-30-084937   True        False         False      2m7s
insights                                   4.13.0-0.okd-2023-09-30-084937   True        False         False      3m20s
kube-apiserver                             4.13.0-0.okd-2023-09-30-084937   True        False         False      50m
kube-controller-manager                    4.13.0-0.okd-2023-09-30-084937   True        False         False      50m
kube-scheduler                             4.13.0-0.okd-2023-09-30-084937   True        False         False      50m
kube-storage-version-migrator              4.13.0-0.okd-2023-09-30-084937   True        False         False      2m43s
monitoring                                 4.13.0-0.okd-2023-09-30-084937   True        False         False      66s
network                                    4.13.0-0.okd-2023-09-30-084937   True        False         False      3m39s
node-tuning                                4.13.0-0.okd-2023-09-30-084937   True        False         False      6m48s
openshift-apiserver                        4.13.0-0.okd-2023-09-30-084937   True        False         False      50m
openshift-controller-manager               4.13.0-0.okd-2023-09-30-084937   True        False         False      50m
openshift-samples                          4.13.0-0.okd-2023-09-30-084937   True        False         False      111s
operator-lifecycle-manager                 4.13.0-0.okd-2023-09-30-084937   True        False         False      50m
operator-lifecycle-manager-catalog         4.13.0-0.okd-2023-09-30-084937   True        False         False      50m
operator-lifecycle-manager-packageserver   4.13.0-0.okd-2023-09-30-084937   True        False         False      50m
service-ca                                 4.13.0-0.okd-2023-09-30-084937   True        False         False      3m18s
storage                                    4.13.0-0.okd-2023-09-30-084937   True        False         False      50m

However, I still encounter an error related to the '99-okd-master-disable-mitigations' machineconfig. The exact error is as follow:

E1020 15:06:53.905739       1 sync_worker.go:652] unable to synchronize image (waiting 2m36.124787484s): Multiple errors are preventing progress:
* Could not update machineconfig "99-okd-master-disable-mitigations" (418 of 584): the server does not recognize this resource, check extension API servers
* Could not update machineconfig "99-okd-master-disable-mitigations" (451 of 584): the server does not recognize this resource, check extension API servers

That seems OKD FCOS specific, it doesn't happen on OKD SCOS.

keremceliker commented 6 months ago

Hey there,

Sometimes, issues with machineconfigs can be related to kernel modules or configurations. Ensure that the necessary kernel modules are loaded on your FCOS nodes. Also Check if the required modules for machineconfig synchronization (such as bridge-related modules) are available. Lastly, Please Review the kernel configuration to ensure it supports the required features.

There are so many check to do, please let us know if any update on it ?

Kerem ÇELİKER Head of Cloud Architecture linkedin.com/in/keremceliker/

vrutkovs commented 6 months ago

Its not related to machineconfig contents, just the fact that the payload contains MachineConfigs is breaking several hypershift assumptions. OKD FCOS should move towards setting initial kargs via other means instead of "update kernel arguments on pivot"

itmwiw commented 1 month ago

Hello, Any news regarding this issue? Thanks a lot.

itmwiw commented 1 week ago

OKD-SCOS was a workaround to make OKD work with Hypershift. Unlike OKD-FCOS, its installation didn't require MachineConfigs and thus didn't break Hypershift's assumptions. With OKD-SCOS currently paused, it appears there is no way to use Hypershift with OKD at the moment.