openshift / cluster-node-tuning-operator

Manage node-level tuning by orchestrating the tuned daemon.
Apache License 2.0
102 stars 105 forks source link

OCPBUGS-43280: CI: unblock #1182

Closed Tal-or closed 1 month ago

Tal-or commented 1 month ago

This PR should unblock CI in the short term. We're doing this out of necessary to relief the pressure from the queue and we'll keep investigate the root cause in the background.

issues:

  1. https://issues.redhat.com/browse/OCPBUGS-43280
  2. Terminal configuration has changed and now it provides date and color which breaks the tests
openshift-ci-robot commented 1 month ago

@Tal-or: This pull request explicitly references no jira issue.

In response to [this](https://github.com/openshift/cluster-node-tuning-operator/pull/1182): >This PR should unblock CI in the short term. >We're doing this out of necessary to relief the pressure from the queue >and we'll keep investigate the root cause in the background. > >issues: >1. https://issues.redhat.com/browse/OCPBUGS-43280 >2. Terminal configuration has changed and now it provides date and color which breaks the tests Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-node-tuning-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
jmencak commented 1 month ago
 [It] /go/src/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/1_performance/netqueues.go:91

  [FAILED] Unexpected error:
      <*errors.errorString | 0xc000700180>: 
      failed to run command [tuned-adm profile_info openshift-node-performance-performance]: output "Profile name:\nopenshift-node-performance-performance\n\nProfile summary:\nOpenshift node optimized for deterministic performance at the cost of increased power consumption, focused on low latency network performance. Based on Tuned 2.11 and Cluster node tuning (oc 4.5)\n\nProfile description:\n\n"; error "Cannot talk to TuneD daemon via DBus. Is TuneD daemon running?\n"
      {
          s: "failed to run command [tuned-adm profile_info openshift-node-performance-performance]: output \"Profile name:\\nopenshift-node-performance-performance\\n\\nProfile summary:\\nOpenshift node optimized for deterministic performance at the cost of increased power consumption, focused on low latency network performance. Based on Tuned 2.11 and Cluster node tuning (oc 4.5)\\n\\nProfile description:\\n\\n\"; error \"Cannot talk to TuneD daemon via DBus. Is TuneD daemon running?\\n\"",
      }
  occurred

Cannot talk to TuneD daemon via DBus. Is TuneD daemon running? is expected. DBus is not supported in the container and never was.

Tal-or commented 1 month ago

/retest Only because I dont have time to take a deeper look, so let's see if it's a flake or real persistent issue

jmencak commented 1 month ago

a flake or real persistent issue

It is a real persistent issue, I can reproduce this locally. I believe the check for err should be dropped and we should only check for exit status 0. As I stated above, you'll always get this error message from TuneD. No idea why this wasn't an issue with Tty: true. It looks to me this never worked -- the error output wasn't ever sent to err and now it finally is?

Tal-or commented 1 month ago

a flake or real persistent issue

It is a real persistent issue, I can reproduce this locally. I believe the check for err should be dropped and we should only check for exit status 0. As I stated above, you'll always get this error message from TuneD. No idea why this wasn't an issue with Tty: true. It looks to me this never worked -- the error output wasn't ever sent to err and now it finally is?

I don't like the way we're handling issues in the CI lately and surly it will come back at us later on.

I mean we already gave ourselves some slack didn't we?

IMO we should at least try understand where this error come from. Otherwise I don't see a point in a test that doesn't check for errors.

Also, there might be another way to check that tuned applied on the node. Maybe we can check Tuned profile instead.

jmencak commented 1 month ago

I don't like the way we're handling issues in the CI lately and surly it will come back at us later on.

Well there is many things "I don't like". For example the fact this e2e test was ever merged, because it checks for wrong things and doesn't seem to catch real failures.

IMO we should at least try understand where this error come from.

We already know this. This error comes from TuneD and (as I stated) was present there since the beginning. It is not my fault the e2e tests were not written in a way to catch these "issues" in the past. Putting "issues" in double quotes, because tuned-adm still returns a valid results and exits 0. That is the key to check. There are many programs that return stuff to stderr which should not be considered failures. A properly written tool will exit non-zero on failure. Are the e2e tests checking for this?

What we (I) do not currently understand is why the stderr was not passed with Tty: true in the past.

Otherwise I don't see a point in a test that doesn't check for errors.

Exactly, the tests should check for errors, we agree on this. But for errors which are real errors and not just warnings thrown on stderr. There are many standard command-line tools which put help messages on stderr. Is that an error? No.

Also, there might be another way to check that tuned applied on the node. Maybe we can check Tuned profile instead.

Yes, that's a possibility and perhaps a better one.

Edit: Actually, we likely do test for non 0 exit code in ExecCommandOnPod(), but that information is lost due to the the function returns. It is likely combined into stderr, which IMO is just wrong.

Tal-or commented 1 month ago

Well there is many things "I don't like". For example the fact this e2e test was ever merged, because it checks for wrong things and doesn't seem to catch real failures.

+1 and we have many others unfortunately.

What we (I) do not currently understand is why the stderr was not passed with Tty: true in the past.

It might be related to the recent change we're seeing in the shell in general, and it was masked up until now.

With the recent change the test will check that tuneD posses the right profile name. This kind of check is actually meaningful

Tal-or commented 1 month ago

/retest Seems like infra issue. At least pao-gcp finally passed

ffromani commented 1 month ago

I don't like the way we're handling issues in the CI lately and surly it will come back at us later on.

Well there is many things "I don't like". For example the fact this e2e test was ever merged, because it checks for wrong things and doesn't seem to catch real failures.

I concur. The problem is not really merging bad tests. This happens. Sometimes we don't know better, sometimes we learn as we go, sometimes it's just how things are. The real killer is we rarely, if ever, go back and review our tests to improve them, removing when necessary (hopefully rarely). We do this exercise pretty much only when stuff breaks, which is not idea,

IMO we should at least try understand where this error come from.

We already know this. This error comes from TuneD and (as I stated) was present there since the beginning. It is not my fault the e2e tests were not written in a way to catch these "issues" in the past. Putting "issues" in double quotes, because tuned-adm still returns a valid results and exits 0. That is the key to check. There are many programs that return stuff to stderr which should not be considered failures. A properly written tool will exit non-zero on failure. Are the e2e tests checking for this? [...] Edit: Actually, we likely do test for non 0 exit code in ExecCommandOnPod(), but that information is lost due to the the function returns. It is likely combined into stderr, which IMO is just wrong.

Yes, the function should be changed to not interpret stuff sent to stderr as error. Is not necessarily an error. Better to address this in another PR because it can open a pretty big can of worms.

Tal-or commented 1 month ago

/hold /lgtm

This is not perfect, but it seems the best option atm to stop the bleeding, and I do think it's heading in the right direction

Unless there is an actual blocker, I highly suggest to merge this PR (ofc if all tests are passing) and iterate again later. The PRs queue is already piling up

ffromani commented 1 month ago

/hold /lgtm This is not perfect, but it seems the best option atm to stop the bleeding, and I do think it's heading in the right direction

Unless there is an actual blocker, I highly suggest to merge this PR (ofc if all tests are passing) and iterate again later. The PRs queue is already piling up

I concur, hence the LGTM. Holding to let other ppl chime in.

openshift-ci[bot] commented 1 month ago

@Tal-or: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
Tal-or commented 1 month ago

@jmencak PTAL and lift the hold if you're happy

jmencak commented 1 month ago

Thank you for this PR. Sorry, couldn't review earlier. Still technically PTO.

Let's improve the ExecCommandOnPod() in a follow-up PR. We need to unblock the CI ASAP. /approve /hold cancel

openshift-ci[bot] commented 1 month ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jmencak, Tal-or

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/cluster-node-tuning-operator/blob/master/OWNERS)~~ [jmencak] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
openshift-bot commented 1 month ago

[ART PR BUILD NOTIFIER]

Distgit: cluster-node-tuning-operator This PR has been included in build cluster-node-tuning-operator-container-v4.18.0-202410152041.p0.gd8898a5.assembly.stream.el9. All builds following this will include this PR.

jmencak commented 1 month ago

/retitle OCPBUGS-43280: CI: unblock

openshift-ci-robot commented 1 month ago

@Tal-or: Jira Issue OCPBUGS-43280: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-43280 has been moved to the MODIFIED state.

In response to [this](https://github.com/openshift/cluster-node-tuning-operator/pull/1182): >This PR should unblock CI in the short term. >We're doing this out of necessary to relief the pressure from the queue >and we'll keep investigate the root cause in the background. > >issues: >1. https://issues.redhat.com/browse/OCPBUGS-43280 >2. Terminal configuration has changed and now it provides date and color which breaks the tests Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-node-tuning-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
jmencak commented 1 month ago

I see the same issue as described by OCPBUGS-43280 in 4.17 lane too.

/cherry-pick release-4.17

openshift-cherrypick-robot commented 1 month ago

@jmencak: new pull request created: #1188

In response to [this](https://github.com/openshift/cluster-node-tuning-operator/pull/1182#issuecomment-2423027185): >I see the same issue as described by [OCPBUGS-43280](https://github.com/openshift/cluster-node-tuning-operator/pull/1182#top) in 4.17 lane too. > >/cherry-pick release-4.17 > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.