Closed codingben closed 1 month ago
/cc @akrejcir
Can you add which commit you cherry-picked this from?
Can you add which commit you cherry-picked this from?
Done.
The contents of this PR and #787 are not identical. vm-console-proxy
is missing from this PR?
I'd rather we keep the fix also for vm-console-proxy
, unless there is a reason not to.
/hold
The contents of this PR and #787 are not identical.
vm-console-proxy
is missing from this PR?
Yes, because the reported bug didn't mention vm-console-proxy
.
I'd rather we keep the fix also for vm-console-proxy, unless there is a reason not to.
I pinged QE engineer to see if it also requires. Why we should add more code to older version if there is no reported bug about it?
Because we backport complete PRs and I don't think it does any harm in this case.
Sure I'll add code changes to vm-console-proxy. Until then, @0xFelix @akrejcir @ksimon1 @jcanocan I'm not sure how to reproduce the failures here, how would you tackle it?
labels[AppKubernetesNameLabel] = name
labels[AppKubernetesComponentLabel] = component.String()
labels[AppKubernetesManagedByLabel] = AppKubernetesManagedByValue
I'll try to run the whole suite of functional tests, but I'm not confident that it's practical to reproduce and debug given that it takes too much time to run them.
I think I found the issue. As I suspected, it is not related to your PR.
This is what I did to give you a hint how you could investigate these failures in the future:
I've looked at the logs of the failing test: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/kubevirt_ssp-operator/1014/pull-ci-kubevirt-ssp-operator-release-v0.18-e2e-functests/1815720491986456576
There I've seen that many tests fail because of [FAILED] Timed out waiting for SSP to be in phase Deployed.
. This error message is returned by a short-circuit logic that was waiting for SSP to be deployed. If it times out once, all tests will fail. https://github.com/kubevirt/ssp-operator/blob/d613183406e8eba47ff526537af17d445909c3a3/tests/tests_suite_test.go#L487
Next, I've searched for the first test that failed, and it was:
Tekton Pipelines Operand resource creation when DeployTektonTaskResources is set to true [test_id:TODO] should create role bindings
/go/src/github.com/kubevirt/ssp-operator/tests/tekton-pipelines_test.go:57
[FAILED] in [BeforeEach] - /go/src/github.com/kubevirt/ssp-operator/tests/tekton-pipelines_test.go:30 @ 07/23/24 13:21:46.268
[FAILED] in [AfterEach] - /go/src/github.com/kubevirt/ssp-operator/tests/tests_suite_test.go:487 @ 07/23/24 13:21:46.404
• [FAILED] [600.222 seconds]
Tekton Pipelines Operand resource creation when DeployTektonTaskResources is set to true [BeforeEach] [test_id:TODO] should create role bindings
[BeforeEach] /go/src/github.com/kubevirt/ssp-operator/tests/tekton-pipelines_test.go:20
[It] /go/src/github.com/kubevirt/ssp-operator/tests/tekton-pipelines_test.go:57
[FAILED] Timed out after 600.001s.
Expected
<bool>: false
to be true
In [BeforeEach] at: /go/src/github.com/kubevirt/ssp-operator/tests/tekton-pipelines_test.go:30 @ 07/23/24 13:21:46.268
In the code, if failed here: https://github.com/kubevirt/ssp-operator/blob/d613183406e8eba47ff526537af17d445909c3a3/tests/tekton-pipelines_test.go#L30
I see that the issue was triggered by enabling the tekton feature gate. Let's look at the logs of ssp-operator pod: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/kubevirt_ssp-operator/1014/pull-ci-kubevirt-ssp-operator-release-v0.18-e2e-functests/1815720491986456576/artifacts/e2e-functests/gather-extra/artifacts/pods/kubevirt_ssp-operator-7464dcdfb-kbr8w_manager.log
There are a lot of errors, but the ones related to tekton are like this:
{
"level":"error",
"ts":"2024-07-23T13:17:26Z",
"msg":"Reconciler error",
"controller":"ssp",
"controllerGroup":"ssp.kubevirt.io",
"controllerKind":"SSP",
"SSP":{"name":"test-ssp","namespace":"ssp-operator-functests"},
"namespace":"ssp-operator-functests",
"name":"test-ssp",
"reconcileID":"3d8afad9-b630-4b93-8e83-15a85daa8492",
"error":"Tekton CRD tasks.tekton.dev does not exist",
"stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:235"
}
The error message is: Tekton CRD tasks.tekton.dev does not exist
. That means that tekton was not properly installed in the cluster.
In the upstream CI, our scripts install tekton, so let's check them. Here we install tekton: https://github.com/kubevirt/ssp-operator/blob/d613183406e8eba47ff526537af17d445909c3a3/automation/common/deploy-kubevirt-and-cdi.sh#L15-L16
The ${TEKTON_VERSION}
is defined here: https://github.com/kubevirt/ssp-operator/blob/d613183406e8eba47ff526537af17d445909c3a3/automation/common/versions.sh#L56-L63
Here we can see potential problem. The Kubevirt and CDI versions are fixed to a compatible minor version, but the tekton uses the latest version.
Let's check if the CRD exists in the latest released tekton version: https://github.com/tektoncd/operator/releases/tag/v0.72.0 , This is the file: https://github.com/tektoncd/operator/releases/download/v0.72.0/openshift-release.yaml
And as expected, it does not contain the CRD tasks.tekton.dev
.
So the issue was that we are using too new version of tekton.
@akrejcir Thanks Andrej - I really appreciate your help! I'll learn from this now, I never saw issue like this in the CI (any repository).
I was able to reproduce this issue locally by running tekton-tasks functional tests:
Tekton Tasks Operand resource creation when DeployTektonTaskResources is set to true [test_id:TODO] should create cluster roles
/home/fedora/ssp-operator/tests/tekton-tasks_test.go:88
[FAILED] in [BeforeEach] - /home/fedora/ssp-operator/tests/tests_suite_test.go:487 @ 07/31/24 08:11:42.807
[FAILED] in [AfterEach] - /home/fedora/ssp-operator/tests/tests_suite_test.go:487 @ 07/31/24 08:11:43.294
• [FAILED] [0.814 seconds]
Tekton Tasks Operand resource creation when DeployTektonTaskResources is set to true [BeforeEach] [test_id:TODO] should create cluster roles
[BeforeEach] /home/fedora/ssp-operator/tests/tekton-tasks_test.go:22
[It] /home/fedora/ssp-operator/tests/tekton-tasks_test.go:88
[FAILED] Timed out waiting for SSP to be in phase Deployed.
In [BeforeEach] at: /home/fedora/ssp-operator/tests/tests_suite_test.go:487 @ 07/31/24 08:11:42.807
I'll rebase this PR once https://github.com/kubevirt/ssp-operator/pull/1023 is merged and backported to this release branch.
Issues
13 New issues
10 Accepted issues
Measures
0 Security Hotspots
No data about Coverage
0.6% Duplication on New Code
@0xFelix @akrejcir @ksimon1 Hi, can you please review? if all okay, labels are missing to merge it. CI passed. Thanks! :)
I'd rather we keep the fix also for
vm-console-proxy
, unless there is a reason not to./hold
I've added also changes to vm-console-proxy
.
/unhold
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: 0xFelix
The full list of commands accepted by this bot can be found here.
The pull request process is described here
This is a manual cherry-pick of #787.
What this PR does / why we need it:
Add required labels (managed-by, part-of, version, component) to template-validator and vm-console-proxy pods.
Jira-Url: https://issues.redhat.com/browse/CNV-44518
Release note: