openshift / cluster-node-tuning-operator

Manage node-level tuning by orchestrating the tuned daemon.
Apache License 2.0
99 stars 104 forks source link

OCPBUGS-36431: Fix generated cpu mask for 512+ cpus #1131

Closed MarSik closed 2 weeks ago

MarSik commented 1 month ago

The padding size was hardcoded to support max 64 nibbles which is equivalent to 256 cpus (%064x). Any mask bigger than this that was not itself 32 cpu aligned broke the chunking algorithm.

openshift-ci[bot] commented 1 month ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MarSik

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/cluster-node-tuning-operator/blob/master/OWNERS)~~ [MarSik] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
openshift-ci-robot commented 1 month ago

@MarSik: This pull request references Jira Issue OCPBUGS-36431, which is invalid:

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to [this](https://github.com/openshift/cluster-node-tuning-operator/pull/1131): >The padding size was hardcoded to support max 64 nibbles which is equivalent to 256 cpus (%064x). Any mask bigger than this that was not itself 32 cpu aligned broke the chunking algorithm. Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-node-tuning-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
MarSik commented 1 month ago

/jira refresh

openshift-ci-robot commented 1 month ago

@MarSik: This pull request references Jira Issue OCPBUGS-36431, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug * bug is open, matching expected state (open) * bug target version (4.18.0) matches configured target version for branch (4.18.0) * bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact: /cc @mrniranjan

In response to [this](https://github.com/openshift/cluster-node-tuning-operator/pull/1131#issuecomment-2283666259): >/jira refresh Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-node-tuning-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
yanirq commented 1 month ago

/lgtm

openshift-ci-robot commented 1 month ago

/retest-required

Remaining retests: 0 against base HEAD 6d625b5a887313ce4db97ce0c393fdca86df7022 and 2 for PR HEAD f20f45872ccc1eb04161962dd7c4fec39f57112b in total

ffromani commented 1 month ago

/test e2e-gcp-pao

ffromani commented 1 month ago

/test e2e-gcp-pao-workloadhints

ffromani commented 1 month ago

/test e2e-gcp-pao-updating-profile

ffromani commented 1 month ago

/retest

openshift-ci-robot commented 1 month ago

/retest-required

Remaining retests: 0 against base HEAD c7823f29ac3b19b28860a4cdf5cabfffcbedbad6 and 1 for PR HEAD f20f45872ccc1eb04161962dd7c4fec39f57112b in total

openshift-ci-robot commented 1 month ago

/retest-required

Remaining retests: 0 against base HEAD 3655f22656d4a3aa9f471099305dcd78a9c80320 and 0 for PR HEAD f20f45872ccc1eb04161962dd7c4fec39f57112b in total

openshift-ci-robot commented 1 month ago

/hold

Revision f20f45872ccc1eb04161962dd7c4fec39f57112b was retested 3 times: holding

Tal-or commented 1 month ago

Seems like a flake

[rfe_id:28761][performance] Updating parameters in performance profile Updating of nodeSelector parameter and node labels [It] [test_id:27484]Verifies that node is reverted to plain worker when the extra labels are removed [tier-2]
/go/src/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go:483
  [FAILED] Expected
      <string>: # irqbalance is a daemon process that distributes interrupts across
      # CPUs on SMP systems.  The default is to rebalance once every 10
      # seconds.  This is the environment file that is specified to systemd via the
      # EnvironmentFile key in the service unit file (or via whatever method the init
      # system you're using has).

      #
      # IRQBALANCE_ONESHOT
      #    After starting, wait for ten seconds, then look at the interrupt
      #    load and balance it once; after balancing exit and do not change
      #    it again.
      #
      #IRQBALANCE_ONESHOT=

      #
      # IRQBALANCE_BANNED_CPUS
      #    64 bit bitmask which allows you to indicate which CPUs should
      #    be skipped when reblancing IRQs.  CPU numbers which have their
      #    corresponding bits set to one in this mask will not have any
      #    IRQs assigned to them on rebalance.
      #
      #IRQBALANCE_BANNED_CPUS=

      #
      # IRQBALANCE_BANNED_CPULIST
      #    The CPUs list which allows you to indicate which CPUs should
      #    be skipped when reblancing IRQs. CPU numbers in CPUs list will
      #    not have any IRQs assigned to them on rebalance.
      #
      #      The format of CPUs list is:
      #        <cpu number>,...,<cpu number>
      #      or a range:
      #        <cpu number>-<cpu number>
      #      or a mixture:
      #        <cpu number>,...,<cpu number>-<cpu number>
      #
      #IRQBALANCE_BANNED_CPULIST=

      #
      # IRQBALANCE_ARGS
      #    Append any args here to the irqbalance daemon as documented in the man
      #    page.
      #
      #IRQBALANCE_ARGS=
      IRQBALANCE_BANNED_CPUS=0
  not to contain substring
      <string>: 1
  In [It] at: /go/src/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go:513 @ 08/14/24 11:16:27.69
MarSik commented 1 month ago

/retest

MarSik commented 1 month ago

@Tal-or @mrniranjan The test failure is not a fluke. But it is unrelated to this change.

  not to contain substring
      <string>: 1
  In [It] at: /go/src/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go:513 @ 08/19/24 09:14:24.633 

The character "1" is present in one of the comments in /etc/sysconfig/irqbalance:

# irqbalance is a daemon process that distributes interrupts across
# CPUs on SMP systems.  The default is to rebalance once every 10
Tal-or commented 4 weeks ago

@Tal-or @mrniranjan The test failure is not a fluke. But it is unrelated to this change.

  not to contain substring
      <string>: 1
  In [It] at: /go/src/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go:513 @ 08/19/24 09:14:24.633 

The character "1" is present in one of the comments in /etc/sysconfig/irqbalance:

# irqbalance is a daemon process that distributes interrupts across
# CPUs on SMP systems.  The default is to rebalance once every 10

Hmm I'm a bit confused though, why we're not seeing it on other PRs as well?

yanirq commented 4 weeks ago

/retest

mrniranjan commented 3 weeks ago

/retest-required

MarSik commented 3 weeks ago

@Tal-or It was not showing on other PRs, because during the cleanup of mine I moved the padding to the right place. And the test is using the function that till now was returning a wider mask than just plain "1".

MarSik commented 3 weeks ago

/hold cancel

openshift-ci-robot commented 2 weeks ago

/retest-required

Remaining retests: 0 against base HEAD 151f6d244d0b276789c66420872c0c2bd1df889b and 2 for PR HEAD f20f45872ccc1eb04161962dd7c4fec39f57112b in total

openshift-ci[bot] commented 2 weeks ago

@MarSik: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
openshift-ci-robot commented 2 weeks ago

@MarSik: Jira Issue OCPBUGS-36431: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-36431 has been moved to the MODIFIED state.

In response to [this](https://github.com/openshift/cluster-node-tuning-operator/pull/1131): >The padding size was hardcoded to support max 64 nibbles which is equivalent to 256 cpus (%064x). Any mask bigger than this that was not itself 32 cpu aligned broke the chunking algorithm. Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-node-tuning-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-bot commented 2 weeks ago

[ART PR BUILD NOTIFIER]

Distgit: cluster-node-tuning-operator This PR has been included in build cluster-node-tuning-operator-container-v4.18.0-202408281841.p0.g980e038.assembly.stream.el9. All builds following this will include this PR.

MarSik commented 2 weeks ago

/cherry-pick release-4.17

openshift-cherrypick-robot commented 2 weeks ago

@MarSik: new pull request created: #1147

In response to [this](https://github.com/openshift/cluster-node-tuning-operator/pull/1131#issuecomment-2316841145): >/cherry-pick release-4.17 Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.