openshift / cluster-node-tuning-operator

Manage node-level tuning by orchestrating the tuned daemon.
Apache License 2.0
99 stars 104 forks source link

[release-4.16] OCPBUGS-39379: E2E: Add test to verify cpuset.cpus.exclusive is writeable #1153

Open mrniranjan opened 2 weeks ago

mrniranjan commented 2 weeks ago

Automates OCPBUGS-34812: cgroupsv2: failed to write on cpuset.cpus.exclusive

To reproduce the bug, we need to create and delete deployment(deploying guaranteed pods with cpu load balancing annotation) in quick succession, so that we do not fully wait for the cleanup causing the pod about to be deleted to still have access to exclusive cpus causing the new pod to fail because cpuset.cpus.exclusive is not yet freed.

As the pre-start hook fails to write to cpuset.cpus.exclusive file in the pods cgroup pod goes to RunContainerError state.

This automation PR creates and deletes deployment in loop to reproduce the issue and checks if the pods fails with Runtime error with message "failed to run pre-start hook for container"

Manual backport of #1127

openshift-ci-robot commented 2 weeks ago

@mrniranjan: This pull request references Jira Issue OCPBUGS-39379, which is valid. The bug has been moved to the POST state.

7 validation(s) were run on this bug * bug is open, matching expected state (open) * bug target version (4.16.z) matches configured target version for branch (4.16.z) * bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST) * release note type set to "Release Note Not Required" * dependent bug [Jira Issue OCPBUGS-39127](https://issues.redhat.com//browse/OCPBUGS-39127) is in the state Verified, which is one of the valid states (VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA)) * dependent [Jira Issue OCPBUGS-39127](https://issues.redhat.com//browse/OCPBUGS-39127) targets the "4.17.0" version, which is one of the valid target versions: 4.17.0 * bug has dependents

Requesting review from QA contact: /cc @mrniranjan

The bug has been updated to refer to the pull request using the external bug tracker.

In response to [this](https://github.com/openshift/cluster-node-tuning-operator/pull/1153): >Automates OCPBUGS-34812: cgroupsv2: failed to write on cpuset.cpus.exclusive > >To reproduce the bug, we need to create and delete deployment(deploying guaranteed pods with cpu load balancing annotation) in quick succession, so that we do not fully wait for the cleanup causing the pod about to be deleted to still have access to exclusive cpus causing the new pod to fail because cpuset.cpus.exclusive is not yet freed. > >As the pre-start hook fails to write to cpuset.cpus.exclusive file in the pods cgroup pod goes to RunContainerError state. > >This automation PR creates and deletes deployment in loop to reproduce the issue and checks if the pods fails with Runtime error with message "failed to run pre-start hook for container" > >Manual backport of PR#1127 Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-node-tuning-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-ci[bot] commented 2 weeks ago

@openshift-ci-robot: GitHub didn't allow me to request PR reviews from the following users: mrniranjan.

Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to [this](https://github.com/openshift/cluster-node-tuning-operator/pull/1153#issuecomment-2325487872): >@mrniranjan: This pull request references [Jira Issue OCPBUGS-39379](https://issues.redhat.com//browse/OCPBUGS-39379), which is valid. The bug has been moved to the POST state. > >
7 validation(s) were run on this bug > >* bug is open, matching expected state (open) >* bug target version (4.16.z) matches configured target version for branch (4.16.z) >* bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST) >* release note type set to "Release Note Not Required" >* dependent bug [Jira Issue OCPBUGS-39127](https://issues.redhat.com//browse/OCPBUGS-39127) is in the state Verified, which is one of the valid states (VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA)) >* dependent [Jira Issue OCPBUGS-39127](https://issues.redhat.com//browse/OCPBUGS-39127) targets the "4.17.0" version, which is one of the valid target versions: 4.17.0 >* bug has dependents

Requesting review from QA contact: /cc @mrniranjan

The bug has been updated to refer to the pull request using the external bug tracker.

In response to [this](https://github.com/openshift/cluster-node-tuning-operator/pull/1153): >Automates OCPBUGS-34812: cgroupsv2: failed to write on cpuset.cpus.exclusive > >To reproduce the bug, we need to create and delete deployment(deploying guaranteed pods with cpu load balancing annotation) in quick succession, so that we do not fully wait for the cleanup causing the pod about to be deleted to still have access to exclusive cpus causing the new pod to fail because cpuset.cpus.exclusive is not yet freed. > >As the pre-start hook fails to write to cpuset.cpus.exclusive file in the pods cgroup pod goes to RunContainerError state. > >This automation PR creates and deletes deployment in loop to reproduce the issue and checks if the pods fails with Runtime error with message "failed to run pre-start hook for container" > >Manual backport of PR#1127 Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-node-tuning-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci-robot commented 2 weeks ago

@mrniranjan: This pull request references Jira Issue OCPBUGS-39379, which is valid.

7 validation(s) were run on this bug * bug is open, matching expected state (open) * bug target version (4.16.z) matches configured target version for branch (4.16.z) * bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST) * release note type set to "Release Note Not Required" * dependent bug [Jira Issue OCPBUGS-39127](https://issues.redhat.com//browse/OCPBUGS-39127) is in the state Verified, which is one of the valid states (VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA)) * dependent [Jira Issue OCPBUGS-39127](https://issues.redhat.com//browse/OCPBUGS-39127) targets the "4.17.0" version, which is one of the valid target versions: 4.17.0 * bug has dependents

Requesting review from QA contact: /cc @mrniranjan

The bug has been updated to refer to the pull request using the external bug tracker.

In response to [this](https://github.com/openshift/cluster-node-tuning-operator/pull/1153): >Automates OCPBUGS-34812: cgroupsv2: failed to write on cpuset.cpus.exclusive > >To reproduce the bug, we need to create and delete deployment(deploying guaranteed pods with cpu load balancing annotation) in quick succession, so that we do not fully wait for the cleanup causing the pod about to be deleted to still have access to exclusive cpus causing the new pod to fail because cpuset.cpus.exclusive is not yet freed. > >As the pre-start hook fails to write to cpuset.cpus.exclusive file in the pods cgroup pod goes to RunContainerError state. > >This automation PR creates and deletes deployment in loop to reproduce the issue and checks if the pods fails with Runtime error with message "failed to run pre-start hook for container" > >Manual backport of #1127 Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-node-tuning-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-ci[bot] commented 2 weeks ago

@openshift-ci-robot: GitHub didn't allow me to request PR reviews from the following users: mrniranjan.

Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to [this](https://github.com/openshift/cluster-node-tuning-operator/pull/1153#issuecomment-2325536316): >@mrniranjan: This pull request references [Jira Issue OCPBUGS-39379](https://issues.redhat.com//browse/OCPBUGS-39379), which is valid. > >
7 validation(s) were run on this bug > >* bug is open, matching expected state (open) >* bug target version (4.16.z) matches configured target version for branch (4.16.z) >* bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST) >* release note type set to "Release Note Not Required" >* dependent bug [Jira Issue OCPBUGS-39127](https://issues.redhat.com//browse/OCPBUGS-39127) is in the state Verified, which is one of the valid states (VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA)) >* dependent [Jira Issue OCPBUGS-39127](https://issues.redhat.com//browse/OCPBUGS-39127) targets the "4.17.0" version, which is one of the valid target versions: 4.17.0 >* bug has dependents

Requesting review from QA contact: /cc @mrniranjan

The bug has been updated to refer to the pull request using the external bug tracker.

In response to [this](https://github.com/openshift/cluster-node-tuning-operator/pull/1153): >Automates OCPBUGS-34812: cgroupsv2: failed to write on cpuset.cpus.exclusive > >To reproduce the bug, we need to create and delete deployment(deploying guaranteed pods with cpu load balancing annotation) in quick succession, so that we do not fully wait for the cleanup causing the pod about to be deleted to still have access to exclusive cpus causing the new pod to fail because cpuset.cpus.exclusive is not yet freed. > >As the pre-start hook fails to write to cpuset.cpus.exclusive file in the pods cgroup pod goes to RunContainerError state. > >This automation PR creates and deletes deployment in loop to reproduce the issue and checks if the pods fails with Runtime error with message "failed to run pre-start hook for container" > >Manual backport of #1127 Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-node-tuning-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci[bot] commented 2 weeks ago

@mrniranjan: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
yanirq commented 1 week ago

/approve /label backport-risk-assessed

openshift-ci[bot] commented 1 week ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mrniranjan, yanirq

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/cluster-node-tuning-operator/blob/release-4.16/OWNERS)~~ [yanirq] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
mrniranjan commented 2 days ago

/lgtm

openshift-ci[bot] commented 2 days ago

@mrniranjan: you cannot LGTM your own PR.

In response to [this](https://github.com/openshift/cluster-node-tuning-operator/pull/1153#issuecomment-2352482042): >/lgtm Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
mrniranjan commented 2 days ago

/label cherry-pick-approved