openshift / cluster-node-tuning-operator

Manage node-level tuning by orchestrating the tuned daemon.
Apache License 2.0
102 stars 105 forks source link

[release-4.17] OCPBUGS-43664: Add vendor and architecture specific tuning options #1191

Open MarSik opened 1 month ago

MarSik commented 1 month ago

OCPBUGS-43666: Fix kernel arguments ordering on Intel

An upgrade from previous version causes one extra reboot due to differently ordered kernel arguments. This is a side effect of platform specific tuned profile split we merged in https://github.com/openshift/cluster-node-tuning-operator/pull/1083

This fix updates the Intel specific tuned profile to follow the same ordering that was used in the past.

It does so by exploiting a specific tuned behavior of the bootloader plugin. It orders the kernel argument cmdline_suffix keys based on the order of first appearance. Any additional appearance just changes the value, but not the ordering.

The change is only needed for Intel, because we have never supported other platforms before and so upgrade is not an issue.

openshift-ci-robot commented 1 month ago

@MarSik: This pull request references CNF-14090 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target either version "4.17." or "openshift-4.17.", but it targets "openshift-4.18" instead.

In response to [this](https://github.com/openshift/cluster-node-tuning-operator/pull/1191): >* [CNF-14090](https://issues.redhat.com//browse/CNF-14090): Add vendor and architecture specific tuning options >- Performance tuning support for 3 platforms (amd/x86,arm/aarch64,intel/x86) is added in this change >- When a valid platform is detected the additional platform specific tuning will be imported alongside the default tuning >- This makes use of a new helper function added to tuned to detect the system name and architecture >- Update unit tests to account for the various changes >- Add new unit tests to cover the platform specific tuning > >* [CNF-14090](https://issues.redhat.com//browse/CNF-14090): Re-sync e2e test yaml for tuning changes > >* [CNF-14090](https://issues.redhat.com//browse/CNF-14090): Use variable composition for idle_poll >- idle=poll is only supported on x86 >- Update tests to account for changes >- Add explaination comments to empty values in openshift-node-performance > >* [CNF-14090](https://issues.redhat.com//browse/CNF-14090): Re-sync e2e test yaml for tuning changes > >* [CNF-14090](https://issues.redhat.com//browse/CNF-14090): Fix active/passive pstates > >* [CNF-14090](https://issues.redhat.com//browse/CNF-14090): Re-sync e2e test yaml for tuning changes Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-node-tuning-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-ci[bot] commented 1 month ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MarSik

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/cluster-node-tuning-operator/blob/release-4.17/OWNERS)~~ [MarSik] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
MarSik commented 1 month ago

/jira cherry-pick CNF-14090

openshift-ci-robot commented 1 month ago

@MarSik: Ignoring requests to cherry-pick non-bug issues: CNF-14090

In response to [this](https://github.com/openshift/cluster-node-tuning-operator/pull/1191#issuecomment-2428746584): >/jira cherry-pick [CNF-14090](https://issues.redhat.com//browse/CNF-14090) Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-node-tuning-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
MarSik commented 1 month ago

/jira cherry-pick OCPBUGS-43660

openshift-ci-robot commented 1 month ago

@MarSik: Jira Issue OCPBUGS-43660 has been cloned as Jira Issue OCPBUGS-43664. Will retitle bug to link to clone. /retitle OCPBUGS-43664: CNF-14090: Add vendor and architecture specific tuning options (#1083)

In response to [this](https://github.com/openshift/cluster-node-tuning-operator/pull/1191#issuecomment-2428755833): >/jira cherry-pick OCPBUGS-43660 Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-node-tuning-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-ci-robot commented 1 month ago

@MarSik: This pull request references Jira Issue OCPBUGS-43664, which is invalid:

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to [this](https://github.com/openshift/cluster-node-tuning-operator/pull/1191): >* [CNF-14090](https://issues.redhat.com//browse/CNF-14090): Add vendor and architecture specific tuning options >- Performance tuning support for 3 platforms (amd/x86,arm/aarch64,intel/x86) is added in this change >- When a valid platform is detected the additional platform specific tuning will be imported alongside the default tuning >- This makes use of a new helper function added to tuned to detect the system name and architecture >- Update unit tests to account for the various changes >- Add new unit tests to cover the platform specific tuning > >* [CNF-14090](https://issues.redhat.com//browse/CNF-14090): Re-sync e2e test yaml for tuning changes > >* [CNF-14090](https://issues.redhat.com//browse/CNF-14090): Use variable composition for idle_poll >- idle=poll is only supported on x86 >- Update tests to account for changes >- Add explaination comments to empty values in openshift-node-performance > >* [CNF-14090](https://issues.redhat.com//browse/CNF-14090): Re-sync e2e test yaml for tuning changes > >* [CNF-14090](https://issues.redhat.com//browse/CNF-14090): Fix active/passive pstates > >* [CNF-14090](https://issues.redhat.com//browse/CNF-14090): Re-sync e2e test yaml for tuning changes Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-node-tuning-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
MarSik commented 1 month ago

/jira cherry-pick OCPBUGS-43665

openshift-ci-robot commented 1 month ago

@MarSik: Jira Issue OCPBUGS-43665 has been cloned as Jira Issue OCPBUGS-43666. Will retitle bug to link to clone. /retitle OCPBUGS-43666: OCPBUGS-43664: CNF-14090: Add vendor and architecture specific tuning options (#1083)

In response to [this](https://github.com/openshift/cluster-node-tuning-operator/pull/1191#issuecomment-2428760596): >/jira cherry-pick OCPBUGS-43665 Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-node-tuning-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-ci-robot commented 1 month ago

@MarSik: This pull request references Jira Issue OCPBUGS-43666, which is invalid:

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to [this](https://github.com/openshift/cluster-node-tuning-operator/pull/1191): >* [CNF-14090](https://issues.redhat.com//browse/CNF-14090): Add vendor and architecture specific tuning options >- Performance tuning support for 3 platforms (amd/x86,arm/aarch64,intel/x86) is added in this change >- When a valid platform is detected the additional platform specific tuning will be imported alongside the default tuning >- This makes use of a new helper function added to tuned to detect the system name and architecture >- Update unit tests to account for the various changes >- Add new unit tests to cover the platform specific tuning > >* [CNF-14090](https://issues.redhat.com//browse/CNF-14090): Re-sync e2e test yaml for tuning changes > >* [CNF-14090](https://issues.redhat.com//browse/CNF-14090): Use variable composition for idle_poll >- idle=poll is only supported on x86 >- Update tests to account for changes >- Add explaination comments to empty values in openshift-node-performance > >* [CNF-14090](https://issues.redhat.com//browse/CNF-14090): Re-sync e2e test yaml for tuning changes > >* [CNF-14090](https://issues.redhat.com//browse/CNF-14090): Fix active/passive pstates > >* [CNF-14090](https://issues.redhat.com//browse/CNF-14090): Re-sync e2e test yaml for tuning changes Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-node-tuning-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
MarSik commented 1 month ago

/jira refresh

openshift-ci-robot commented 1 month ago

@MarSik: This pull request references Jira Issue OCPBUGS-43666, which is invalid:

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to [this](https://github.com/openshift/cluster-node-tuning-operator/pull/1191#issuecomment-2428767269): >/jira refresh Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-node-tuning-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-ci-robot commented 1 month ago

@MarSik: This pull request references Jira Issue OCPBUGS-43664, which is invalid:

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to [this](https://github.com/openshift/cluster-node-tuning-operator/pull/1191): >* [CNF-14090](https://issues.redhat.com//browse/CNF-14090): Add vendor and architecture specific tuning options >- Performance tuning support for 3 platforms (amd/x86,arm/aarch64,intel/x86) is added in this change >- When a valid platform is detected the additional platform specific tuning will be imported alongside the default tuning >- This makes use of a new helper function added to tuned to detect the system name and architecture >- Update unit tests to account for the various changes >- Add new unit tests to cover the platform specific tuning > >* [CNF-14090](https://issues.redhat.com//browse/CNF-14090): Re-sync e2e test yaml for tuning changes > >* [CNF-14090](https://issues.redhat.com//browse/CNF-14090): Use variable composition for idle_poll >- idle=poll is only supported on x86 >- Update tests to account for changes >- Add explaination comments to empty values in openshift-node-performance > >* [CNF-14090](https://issues.redhat.com//browse/CNF-14090): Re-sync e2e test yaml for tuning changes > >* [CNF-14090](https://issues.redhat.com//browse/CNF-14090): Fix active/passive pstates > >* [CNF-14090](https://issues.redhat.com//browse/CNF-14090): Re-sync e2e test yaml for tuning changes Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-node-tuning-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
MarSik commented 1 month ago

/jira refresh

openshift-ci-robot commented 1 month ago

@MarSik: This pull request references Jira Issue OCPBUGS-43664, which is invalid:

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to [this](https://github.com/openshift/cluster-node-tuning-operator/pull/1191#issuecomment-2428776854): >/jira refresh Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-node-tuning-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
MarSik commented 1 month ago

/jira refresh

openshift-ci-robot commented 1 month ago

@MarSik: This pull request references Jira Issue OCPBUGS-43664, which is invalid:

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to [this](https://github.com/openshift/cluster-node-tuning-operator/pull/1191#issuecomment-2428780570): >/jira refresh Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-node-tuning-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
bartwensley commented 1 month ago

@MarSik - are you sure you want to cherry-pick this now? I think it would make sense to wait for at least some degree of QE to happen for this on OCP 4.18 before the cherry-pick - given that it changes kernel parameters and such, there is potential for significant breakage if something is wrong.

MarSik commented 1 month ago

@bartwensley QE approval of 4.18 is needed to merge this indeed. But I want to be ready.

MarSik commented 1 month ago

@bartwensley Btw, the idea is that this does not change any kernel arguments for Intel. Just for the other new platforms. And it is easier to verify with the PR posted (clusterbot can build a PR, yesterday it gave me an AMD VM though..)

fontivan commented 1 month ago

/retest

fontivan commented 1 month ago

4.17 still needs backports to fix CI failures, will need to be rebased once both of these PRs are merged

fontivan commented 1 month ago

/retitle [release-4.17] OCPBUGS-43664: Add vendor and architecture specific tuning options

MarSik commented 3 weeks ago

/retest-required

MarSik commented 3 weeks ago

/label backport-risk-assessed

MarSik commented 3 weeks ago

/retest-required

openshift-ci[bot] commented 3 weeks ago

@MarSik: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
openshift-ci-robot commented 2 weeks ago

@MarSik: This pull request references Jira Issue OCPBUGS-43664, which is invalid:

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to [this](https://github.com/openshift/cluster-node-tuning-operator/pull/1191): >* [CNF-14090](https://issues.redhat.com//browse/CNF-14090): Add vendor and architecture specific tuning options >- Performance tuning support for 3 platforms (amd/x86,arm/aarch64,intel/x86) is added in this change >- When a valid platform is detected the additional platform specific tuning will be imported alongside the default tuning >- This makes use of a new helper function added to tuned to detect the system name and architecture >- Update unit tests to account for the various changes >- Add new unit tests to cover the platform specific tuning > >* [CNF-14090](https://issues.redhat.com//browse/CNF-14090): Re-sync e2e test yaml for tuning changes > >* [CNF-14090](https://issues.redhat.com//browse/CNF-14090): Use variable composition for idle_poll >- idle=poll is only supported on x86 >- Update tests to account for changes >- Add explaination comments to empty values in openshift-node-performance > >* [CNF-14090](https://issues.redhat.com//browse/CNF-14090): Re-sync e2e test yaml for tuning changes > >* [CNF-14090](https://issues.redhat.com//browse/CNF-14090): Fix active/passive pstates > >* [CNF-14090](https://issues.redhat.com//browse/CNF-14090): Re-sync e2e test yaml for tuning changes > >* OCPBUGS-43665: Drop amd_iommu=on from amd tuning >- "=on" is not a valid value for amd_iommu >- amd_iommu is enabled by default unless you specify "amd_iommu=off", unlike intel >- See kernel docs for more information (https://docs.kernel.org/admin-guide/kernel-parameters.html) > >OCPBUGS-43666: Fix kernel arguments ordering on Intel > >An upgrade from previous version causes one extra reboot >due to differently ordered kernel arguments. This is >a side effect of platform specific tuned profile split >we merged in https://github.com/openshift/cluster-node-tuning-operator/pull/1083 > >This fix updates the Intel specific tuned profile to >follow the same ordering that was used in the past. > >It does so by exploiting a specific tuned behavior >of the bootloader plugin. It orders the kernel argument >cmdline_suffix keys based on the order of first appearance. >Any additional appearance just changes the value, but >not the ordering. > >The change is only needed for Intel, because we have >never supported other platforms before and so upgrade >is not an issue. Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-node-tuning-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.