Open umohnani8 opened 3 weeks ago
@cheesesashimi @inesqyx PTAL.
I am not completely sure if I am on the right track, this is where it seems we check whether OCL is enabled via the feature gate.
Here are some ideas to consider:
The osrelease
wrapper works by looking at the /etc/os-release
file that is present on the filesystem. In the MCD, this is fine because the MCD chroots into the nodes' filesystem and executes within that context. However, running it within the operator would read the /etc/os-release
file that's present within the MCO's container image and not the one on the underlying host OS.
Each corev1.Node
object has some information in its status field that we can use to figure out what the underlying node OS is. We could potentially extend our osrelease
wrapper to infer all of this information (or at least as much as possible) from the corev1.Node
objects in a given MachineConfigPool. I'm not sure what this looks like for a RHEL node, but here's what it looks like for an RHCOS node (we'd also need to consider OKD FCOS and OKD SCOS as well):
$ oc get node ip-10-0-27-18.ec2.internal -o yaml | yq '.status.nodeInfo'
architecture: amd64
bootID: 63e74981-e918-46fc-b9c4-99374a5a620e
containerRuntimeVersion: cri-o://1.29.4-6.rhaos4.16.git0e93ae2.el9
kernelVersion: 5.14.0-427.20.1.el9_4.x86_64
kubeProxyVersion: v1.30.1+9798e19
kubeletVersion: v1.30.1+9798e19
machineID: ec28d02b7ad61b67b62f3681741924b9
operatingSystem: linux
osImage: Red Hat Enterprise Linux CoreOS 417.94.202406050030-0
systemUUID: ec28d02b-7ad6-1b67-b62f-3681741924b9
Then, within the MCO, whenever we get a new MachineOSConfig, we can get a list of all of the nodes for the targeted MachineConfigPool, pass them into our modified osrelease
wrapper which examines that field on each node, and degrade if we find a non-CoreOS node.
Thanks @cheesesashimi, I have updated the PR based on your review to grab the osImage info from corev1.Node
an do a string comparison to check if it is a coreos based OS or not. PTAL
@sergiordlr can you please help verify we error out when adding a non-coreos based node.
@umohnani8: This pull request references Jira Issue OCPBUGS-19537, which is invalid:
Comment /jira refresh
to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.
The bug has been updated to refer to the pull request using the external bug tracker.
/jira refresh
@umohnani8: This pull request references Jira Issue OCPBUGS-19537, which is valid. The bug has been moved to the POST state.
Requesting review from QA contact: /cc @sergiordlr
/retest
@umohnani8: all tests passed!
Full PR test history. Your PR dashboard.
This is ready, @sergiordlr PTAL
Pre-merge verified -
$ oc get clusterversions.config.openshift.io
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.17.0-0.ci.test-2024-07-15-054707-ci-ln-6p4dlvt-latest True False 9m11s Cluster version is 4.17.0-0.ci.test-2024-07-15-054707-ci-ln-6p4dlvt-latest
Added RHEL nodes using ocp4-scale-runner flexy job
$ oc get node
NAME STATUS ROLES AGE VERSION
ptalgulk-1507-tc-6xndc-master-0 Ready control-plane,master 47m v1.30.1+9798e19
ptalgulk-1507-tc-6xndc-master-1 Ready control-plane,master 48m v1.30.1+9798e19
ptalgulk-1507-tc-6xndc-master-2 Ready control-plane,master 48m v1.30.1+9798e19
ptalgulk-1507-tc-6xndc-w-a-l-rhel-0 Ready worker 6m23s v1.30.2
ptalgulk-1507-tc-6xndc-w-a-l-rhel-1 Ready worker 6m17s v1.30.2
ptalgulk-1507-tc-6xndc-worker-a-8k6gq Ready worker 34m v1.30.1+9798e19
ptalgulk-1507-tc-6xndc-worker-b-tbpcw Ready worker 36m v1.30.1+9798e19
ptalgulk-1507-tc-6xndc-worker-c-h8mdb Ready worker 35m v1.30.1+9798e19
Enabled and applied OCB on custom MCP rhel node
oc create -f - << EOF
apiVersion: machineconfiguration.openshift.io/v1alpha1
kind: MachineOSConfig
metadata:
name: check
spec:
buildInputs:
baseImagePullSecret:
name: global-pull-secret-copy
containerFile:
- containerfileArch: noarch
content: RUN echo 'test' > /etc/test-image.txt
imageBuilder:
imageBuilderType: PodImageBuilder
renderedImagePushSecret:
name: $(oc get -n openshift-machine-config-operator sa builder -o jsonpath='{.secrets[0].name}')
renderedImagePushspec: "image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/ocb-image:latest"
buildOutputs:
currentImagePullSecret:
name: global-pull-secret-copy
machineConfigPool:
name: check
EOF
We are able to see error logs message in operator instead of RHEL node logs and MCP is not in degraded state
$ oc get co machine-config
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
machine-config 4.17.0-0.ci.test-2024-07-15-054707-ci-ln-6p4dlvt-latest True False True 79m Failed to resync 4.17.0-0.ci.test-2024-07-15-054707-ci-ln-6p4dlvt-latest because: on-cluster layering is only supported on CoreOS-based nodes: non-CoreOS OS "Red Hat Enterprise Linux 8.9 (Ootpa)" detected on node "ptalgulk-1507-tc-6xndc-w-a-l-rhel-0" in MachineConfigPool(s) ["check" "worker"]
$ oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
check rendered-check-679dc8cb31540add3d08dca713fcc6e6 True False False 1 1 1 0 8m7s
master rendered-master-f77f9da088d8b1313cc504456aa675fd True False False 3 3 3 0 81m
worker rendered-worker-679dc8cb31540add3d08dca713fcc6e6 True False False 4 4 4 0 81m
adding qe-approved label
/label qe-approved
Thankyou!
@umohnani8: This pull request references Jira Issue OCPBUGS-19537, which is valid.
No GitHub users were found matching the public email listed for the QA contact in Jira (ptalgulk@redhat.com), skipping review request.
@cheesesashimi PTAL at the test results above. This is ready to be merged!
Nice work!
/lgtm /approve
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: cheesesashimi, umohnani8
The full list of commands accepted by this bot can be found here.
The pull request process is described here
@yuqi-zhang @cheesesashimi any idea how to get the label needed to merge this - it wants the acknowledge-critical-fixes-only
label.
We can apply this label if we want this bug to merge this week, anyone should have the ability. The label was added as a requirement, since we wanted to be more stable for shift week.
The requirement should drop off by next week and this can merge normally as well.
If a non-coreos based node, such as RHEL8, is added to a cluster and the user tries to enable on cluster builds, throw an error as that is not supported on non-coreos based nodes.
Fixes https://issues.redhat.com/browse/OCPBUGS-19537
- What I did Added a check for when OCB is enabled to verify which OS the nodes have and to fail if they are non-coreos based.
- How to verify it Start a cluster, add a RHEL node to it and then try enabling OCB
- Description for the changelog
Throw an error if cluster has non-coreos node when trying to enable OCB