openshift / machine-config-operator

Apache License 2.0
244 stars 396 forks source link

OCPBUGS-19537: OCB should fail if node is not coreos based #4442

Open umohnani8 opened 3 weeks ago

umohnani8 commented 3 weeks ago

If a non-coreos based node, such as RHEL8, is added to a cluster and the user tries to enable on cluster builds, throw an error as that is not supported on non-coreos based nodes.

Fixes https://issues.redhat.com/browse/OCPBUGS-19537

- What I did Added a check for when OCB is enabled to verify which OS the nodes have and to fail if they are non-coreos based.

- How to verify it Start a cluster, add a RHEL node to it and then try enabling OCB

- Description for the changelog

Throw an error if cluster has non-coreos node when trying to enable OCB

umohnani8 commented 3 weeks ago

@cheesesashimi @inesqyx PTAL.

I am not completely sure if I am on the right track, this is where it seems we check whether OCL is enabled via the feature gate.

cheesesashimi commented 3 weeks ago

Here are some ideas to consider:

The osrelease wrapper works by looking at the /etc/os-release file that is present on the filesystem. In the MCD, this is fine because the MCD chroots into the nodes' filesystem and executes within that context. However, running it within the operator would read the /etc/os-release file that's present within the MCO's container image and not the one on the underlying host OS.

Each corev1.Node object has some information in its status field that we can use to figure out what the underlying node OS is. We could potentially extend our osrelease wrapper to infer all of this information (or at least as much as possible) from the corev1.Node objects in a given MachineConfigPool. I'm not sure what this looks like for a RHEL node, but here's what it looks like for an RHCOS node (we'd also need to consider OKD FCOS and OKD SCOS as well):

$ oc get node ip-10-0-27-18.ec2.internal -o yaml | yq '.status.nodeInfo'
architecture: amd64
bootID: 63e74981-e918-46fc-b9c4-99374a5a620e
containerRuntimeVersion: cri-o://1.29.4-6.rhaos4.16.git0e93ae2.el9
kernelVersion: 5.14.0-427.20.1.el9_4.x86_64
kubeProxyVersion: v1.30.1+9798e19
kubeletVersion: v1.30.1+9798e19
machineID: ec28d02b7ad61b67b62f3681741924b9
operatingSystem: linux
osImage: Red Hat Enterprise Linux CoreOS 417.94.202406050030-0
systemUUID: ec28d02b-7ad6-1b67-b62f-3681741924b9

Then, within the MCO, whenever we get a new MachineOSConfig, we can get a list of all of the nodes for the targeted MachineConfigPool, pass them into our modified osrelease wrapper which examines that field on each node, and degrade if we find a non-CoreOS node.

umohnani8 commented 3 weeks ago

Thanks @cheesesashimi, I have updated the PR based on your review to grab the osImage info from corev1.Node an do a string comparison to check if it is a coreos based OS or not. PTAL

umohnani8 commented 2 weeks ago

@sergiordlr can you please help verify we error out when adding a non-coreos based node.

openshift-ci-robot commented 2 weeks ago

@umohnani8: This pull request references Jira Issue OCPBUGS-19537, which is invalid:

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to [this](https://github.com/openshift/machine-config-operator/pull/4442): >If a non-coreos based node, such as RHEL8, is added to a cluster and the user tries to enable on cluster builds, throw an error as that is not supported on non-coreos based nodes. > > >Fixes https://issues.redhat.com/browse/OCPBUGS-19537 > >**- What I did** >Added a check for when OCB is enabled to verify which OS the nodes have and to fail if they are non-coreos based. > >**- How to verify it** >Start a cluster, add a RHEL node to it and then try enabling OCB > >**- Description for the changelog** > >Throw an error if cluster has non-coreos node when trying to enable OCB > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fmachine-config-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
umohnani8 commented 2 weeks ago

/jira refresh

openshift-ci-robot commented 2 weeks ago

@umohnani8: This pull request references Jira Issue OCPBUGS-19537, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug * bug is open, matching expected state (open) * bug target version (4.17.0) matches configured target version for branch (4.17.0) * bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact: /cc @sergiordlr

In response to [this](https://github.com/openshift/machine-config-operator/pull/4442#issuecomment-2203929068): >/jira refresh Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fmachine-config-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
umohnani8 commented 2 weeks ago

/retest

openshift-ci[bot] commented 2 weeks ago

@umohnani8: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
umohnani8 commented 1 week ago

This is ready, @sergiordlr PTAL

ptalgulk01 commented 6 days ago

Pre-merge verified -

$ oc get clusterversions.config.openshift.io 
NAME      VERSION                                                   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.17.0-0.ci.test-2024-07-15-054707-ci-ln-6p4dlvt-latest   True        False         9m11s   Cluster version is 4.17.0-0.ci.test-2024-07-15-054707-ci-ln-6p4dlvt-latest

Added RHEL nodes using ocp4-scale-runner flexy job

$ oc get node
NAME                                    STATUS   ROLES                  AGE     VERSION
ptalgulk-1507-tc-6xndc-master-0         Ready    control-plane,master   47m     v1.30.1+9798e19
ptalgulk-1507-tc-6xndc-master-1         Ready    control-plane,master   48m     v1.30.1+9798e19
ptalgulk-1507-tc-6xndc-master-2         Ready    control-plane,master   48m     v1.30.1+9798e19
ptalgulk-1507-tc-6xndc-w-a-l-rhel-0     Ready    worker                 6m23s   v1.30.2
ptalgulk-1507-tc-6xndc-w-a-l-rhel-1     Ready    worker                 6m17s   v1.30.2
ptalgulk-1507-tc-6xndc-worker-a-8k6gq   Ready    worker                 34m     v1.30.1+9798e19
ptalgulk-1507-tc-6xndc-worker-b-tbpcw   Ready    worker                 36m     v1.30.1+9798e19
ptalgulk-1507-tc-6xndc-worker-c-h8mdb   Ready    worker                 35m     v1.30.1+9798e19

Enabled and applied OCB on custom MCP rhel node

oc create -f - << EOF
apiVersion: machineconfiguration.openshift.io/v1alpha1
kind: MachineOSConfig
metadata:
  name: check
spec:
  buildInputs:
    baseImagePullSecret:
      name: global-pull-secret-copy
    containerFile:
    - containerfileArch: noarch
      content: RUN echo 'test' > /etc/test-image.txt
    imageBuilder:
      imageBuilderType: PodImageBuilder
    renderedImagePushSecret:
      name: $(oc get -n openshift-machine-config-operator sa builder -o jsonpath='{.secrets[0].name}')
    renderedImagePushspec: "image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/ocb-image:latest"
  buildOutputs:
    currentImagePullSecret:
      name: global-pull-secret-copy
  machineConfigPool:
    name: check
EOF

We are able to see error logs message in operator instead of RHEL node logs and MCP is not in degraded state

$  oc get co machine-config     
NAME             VERSION                                                   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
machine-config         4.17.0-0.ci.test-2024-07-15-054707-ci-ln-6p4dlvt-latest   True        False         True       79m     Failed to resync 4.17.0-0.ci.test-2024-07-15-054707-ci-ln-6p4dlvt-latest because: on-cluster layering is only supported on CoreOS-based nodes: non-CoreOS OS "Red Hat Enterprise Linux 8.9 (Ootpa)" detected on node "ptalgulk-1507-tc-6xndc-w-a-l-rhel-0" in MachineConfigPool(s) ["check" "worker"]

$ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
check    rendered-check-679dc8cb31540add3d08dca713fcc6e6    True      False      False      1              1                   1                     0                      8m7s
master   rendered-master-f77f9da088d8b1313cc504456aa675fd   True      False      False      3              3                   3                     0                      81m
worker   rendered-worker-679dc8cb31540add3d08dca713fcc6e6   True      False      False      4              4                   4                     0                      81m

adding qe-approved label

/label qe-approved

Thankyou!

openshift-ci-robot commented 6 days ago

@umohnani8: This pull request references Jira Issue OCPBUGS-19537, which is valid.

3 validation(s) were run on this bug * bug is open, matching expected state (open) * bug target version (4.17.0) matches configured target version for branch (4.17.0) * bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (ptalgulk@redhat.com), skipping review request.

In response to [this](https://github.com/openshift/machine-config-operator/pull/4442): >If a non-coreos based node, such as RHEL8, is added to a cluster and the user tries to enable on cluster builds, throw an error as that is not supported on non-coreos based nodes. > > >Fixes https://issues.redhat.com/browse/OCPBUGS-19537 > >**- What I did** >Added a check for when OCB is enabled to verify which OS the nodes have and to fail if they are non-coreos based. > >**- How to verify it** >Start a cluster, add a RHEL node to it and then try enabling OCB > >**- Description for the changelog** > >Throw an error if cluster has non-coreos node when trying to enable OCB > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fmachine-config-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
umohnani8 commented 6 days ago

@cheesesashimi PTAL at the test results above. This is ready to be merged!

cheesesashimi commented 5 days ago

Nice work!

/lgtm /approve

openshift-ci[bot] commented 5 days ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cheesesashimi, umohnani8

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/machine-config-operator/blob/master/OWNERS)~~ [cheesesashimi] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
umohnani8 commented 2 days ago

@yuqi-zhang @cheesesashimi any idea how to get the label needed to merge this - it wants the acknowledge-critical-fixes-only label.

yuqi-zhang commented 2 days ago

We can apply this label if we want this bug to merge this week, anyone should have the ability. The label was added as a requirement, since we wanted to be more stable for shift week.

The requirement should drop off by next week and this can merge normally as well.