Closed Tal-or closed 1 month ago
@Tal-or: This pull request explicitly references no jira issue.
[38;5;243m[It] /go/src/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/1_performance/netqueues.go:91[0m
[38;5;9m[FAILED] Unexpected error:
<*errors.errorString | 0xc000700180>:
failed to run command [tuned-adm profile_info openshift-node-performance-performance]: output "Profile name:\nopenshift-node-performance-performance\n\nProfile summary:\nOpenshift node optimized for deterministic performance at the cost of increased power consumption, focused on low latency network performance. Based on Tuned 2.11 and Cluster node tuning (oc 4.5)\n\nProfile description:\n\n"; error "Cannot talk to TuneD daemon via DBus. Is TuneD daemon running?\n"
{
s: "failed to run command [tuned-adm profile_info openshift-node-performance-performance]: output \"Profile name:\\nopenshift-node-performance-performance\\n\\nProfile summary:\\nOpenshift node optimized for deterministic performance at the cost of increased power consumption, focused on low latency network performance. Based on Tuned 2.11 and Cluster node tuning (oc 4.5)\\n\\nProfile description:\\n\\n\"; error \"Cannot talk to TuneD daemon via DBus. Is TuneD daemon running?\\n\"",
}
occurred[0m
Cannot talk to TuneD daemon via DBus. Is TuneD daemon running?
is expected. DBus is not supported in the container and never was.
/retest Only because I dont have time to take a deeper look, so let's see if it's a flake or real persistent issue
a flake or real persistent issue
It is a real persistent issue, I can reproduce this locally. I believe the check for err
should be dropped and we should only check for exit status 0. As I stated above, you'll always get this error message from TuneD
. No idea why this wasn't an issue with Tty: true
. It looks to me this never worked -- the error output wasn't ever sent to err
and now it finally is?
a flake or real persistent issue
It is a real persistent issue, I can reproduce this locally. I believe the check for
err
should be dropped and we should only check for exit status 0. As I stated above, you'll always get this error message fromTuneD
. No idea why this wasn't an issue withTty: true
. It looks to me this never worked -- the error output wasn't ever sent toerr
and now it finally is?
I don't like the way we're handling issues in the CI lately and surly it will come back at us later on.
I mean we already gave ourselves some slack didn't we?
IMO we should at least try understand where this error come from. Otherwise I don't see a point in a test that doesn't check for errors.
Also, there might be another way to check that tuned applied on the node. Maybe we can check Tuned profile instead.
I don't like the way we're handling issues in the CI lately and surly it will come back at us later on.
Well there is many things "I don't like". For example the fact this e2e test was ever merged, because it checks for wrong things and doesn't seem to catch real failures.
IMO we should at least try understand where this error come from.
We already know this. This error comes from TuneD and (as I stated) was present there since the beginning. It is not my fault the e2e tests were not written in a way to catch these "issues" in the past. Putting "issues" in double quotes, because tuned-adm still returns a valid results and exits 0. That is the key to check. There are many programs that return stuff to stderr which should not be considered failures. A properly written tool will exit non-zero on failure. Are the e2e tests checking for this?
What we (I) do not currently understand is why the stderr
was not passed with Tty: true
in the past.
Otherwise I don't see a point in a test that doesn't check for errors.
Exactly, the tests should check for errors, we agree on this. But for errors which are real errors and not just warnings thrown on stderr. There are many standard command-line tools which put help messages on stderr. Is that an error? No.
Also, there might be another way to check that tuned applied on the node. Maybe we can check Tuned profile instead.
Yes, that's a possibility and perhaps a better one.
Edit: Actually, we likely do test for non 0 exit code in ExecCommandOnPod(), but that information is lost due to the the function returns. It is likely combined into stderr, which IMO is just wrong.
Well there is many things "I don't like". For example the fact this e2e test was ever merged, because it checks for wrong things and doesn't seem to catch real failures.
+1 and we have many others unfortunately.
What we (I) do not currently understand is why the stderr was not passed with Tty: true in the past.
It might be related to the recent change we're seeing in the shell in general, and it was masked up until now.
With the recent change the test will check that tuneD posses the right profile name. This kind of check is actually meaningful
/retest Seems like infra issue. At least pao-gcp finally passed
I don't like the way we're handling issues in the CI lately and surly it will come back at us later on.
Well there is many things "I don't like". For example the fact this e2e test was ever merged, because it checks for wrong things and doesn't seem to catch real failures.
I concur. The problem is not really merging bad tests. This happens. Sometimes we don't know better, sometimes we learn as we go, sometimes it's just how things are. The real killer is we rarely, if ever, go back and review our tests to improve them, removing when necessary (hopefully rarely). We do this exercise pretty much only when stuff breaks, which is not idea,
IMO we should at least try understand where this error come from.
We already know this. This error comes from TuneD and (as I stated) was present there since the beginning. It is not my fault the e2e tests were not written in a way to catch these "issues" in the past. Putting "issues" in double quotes, because tuned-adm still returns a valid results and exits 0. That is the key to check. There are many programs that return stuff to stderr which should not be considered failures. A properly written tool will exit non-zero on failure. Are the e2e tests checking for this? [...] Edit: Actually, we likely do test for non 0 exit code in ExecCommandOnPod(), but that information is lost due to the the function returns. It is likely combined into stderr, which IMO is just wrong.
Yes, the function should be changed to not interpret stuff sent to stderr as error. Is not necessarily an error. Better to address this in another PR because it can open a pretty big can of worms.
/hold /lgtm
This is not perfect, but it seems the best option atm to stop the bleeding, and I do think it's heading in the right direction
Unless there is an actual blocker, I highly suggest to merge this PR (ofc if all tests are passing) and iterate again later. The PRs queue is already piling up
/hold /lgtm This is not perfect, but it seems the best option atm to stop the bleeding, and I do think it's heading in the right direction
Unless there is an actual blocker, I highly suggest to merge this PR (ofc if all tests are passing) and iterate again later. The PRs queue is already piling up
I concur, hence the LGTM. Holding to let other ppl chime in.
@Tal-or: all tests passed!
Full PR test history. Your PR dashboard.
@jmencak PTAL and lift the hold if you're happy
Thank you for this PR. Sorry, couldn't review earlier. Still technically PTO.
Let's improve the ExecCommandOnPod()
in a follow-up PR. We need to unblock the CI ASAP.
/approve
/hold cancel
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: jmencak, Tal-or
The full list of commands accepted by this bot can be found here.
The pull request process is described here
[ART PR BUILD NOTIFIER]
Distgit: cluster-node-tuning-operator This PR has been included in build cluster-node-tuning-operator-container-v4.18.0-202410152041.p0.gd8898a5.assembly.stream.el9. All builds following this will include this PR.
/retitle OCPBUGS-43280: CI: unblock
@Tal-or: Jira Issue OCPBUGS-43280: All pull requests linked via external trackers have merged:
Jira Issue OCPBUGS-43280 has been moved to the MODIFIED state.
I see the same issue as described by OCPBUGS-43280 in 4.17 lane too.
/cherry-pick release-4.17
@jmencak: new pull request created: #1188
This PR should unblock CI in the short term. We're doing this out of necessary to relief the pressure from the queue and we'll keep investigate the root cause in the background.
issues: