[release-v0.18] fix: add required labels to pods

codingben commented 1 month ago

This is a manual cherry-pick of #787.

What this PR does / why we need it:

Add required labels (managed-by, part-of, version, component) to template-validator and vm-console-proxy pods.

Jira-Url: https://issues.redhat.com/browse/CNV-44518

Release note:

Add required labels to template-validator and vm-console-proxy pods.

codingben commented 1 month ago

/cc @akrejcir

0xFelix commented 1 month ago

Can you add which commit you cherry-picked this from?

codingben commented 1 month ago

Can you add which commit you cherry-picked this from?

Done.

0xFelix commented 1 month ago

The contents of this PR and #787 are not identical. vm-console-proxy is missing from this PR?

akrejcir commented 1 month ago

I'd rather we keep the fix also for vm-console-proxy, unless there is a reason not to.

/hold

codingben commented 1 month ago

The contents of this PR and #787 are not identical. vm-console-proxy is missing from this PR?

Yes, because the reported bug didn't mention vm-console-proxy.

I'd rather we keep the fix also for vm-console-proxy, unless there is a reason not to.

I pinged QE engineer to see if it also requires. Why we should add more code to older version if there is no reported bug about it?

0xFelix commented 1 month ago

Because we backport complete PRs and I don't think it does any harm in this case.

codingben commented 1 month ago

Sure I'll add code changes to vm-console-proxy. Until then, @0xFelix @akrejcir @ksimon1 @jcanocan I'm not sure how to reproduce the failures here, how would you tackle it?

I tried to run functional tests of template-validator and vm-console-proxy locally in a fresh OpenShift 4.14 cluster, and it works (only KubeVirt and CDI installed there).
I tried to deploy a sample CR config of SSP with template-validator, I see that everything is deployed and running correctly without issues - I mean i can see these labels configured:
```
labels[AppKubernetesNameLabel] = name
labels[AppKubernetesComponentLabel] = component.String()
labels[AppKubernetesManagedByLabel] = AppKubernetesManagedByValue
```

I'll try to run the whole suite of functional tests, but I'm not confident that it's practical to reproduce and debug given that it takes too much time to run them.

akrejcir commented 1 month ago

I think I found the issue. As I suspected, it is not related to your PR.

This is what I did to give you a hint how you could investigate these failures in the future:

I've looked at the logs of the failing test: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/kubevirt_ssp-operator/1014/pull-ci-kubevirt-ssp-operator-release-v0.18-e2e-functests/1815720491986456576
There I've seen that many tests fail because of [FAILED] Timed out waiting for SSP to be in phase Deployed.. This error message is returned by a short-circuit logic that was waiting for SSP to be deployed. If it times out once, all tests will fail. https://github.com/kubevirt/ssp-operator/blob/d613183406e8eba47ff526537af17d445909c3a3/tests/tests_suite_test.go#L487

Next, I've searched for the first test that failed, and it was:

Tekton Pipelines Operand resource creation when DeployTektonTaskResources is set to true [test_id:TODO] should create role bindings
/go/src/github.com/kubevirt/ssp-operator/tests/tekton-pipelines_test.go:57
[FAILED] in [BeforeEach] - /go/src/github.com/kubevirt/ssp-operator/tests/tekton-pipelines_test.go:30 @ 07/23/24 13:21:46.268
[FAILED] in [AfterEach] - /go/src/github.com/kubevirt/ssp-operator/tests/tests_suite_test.go:487 @ 07/23/24 13:21:46.404
• [FAILED] [600.222 seconds]
Tekton Pipelines Operand resource creation when DeployTektonTaskResources is set to true [BeforeEach] [test_id:TODO] should create role bindings
[BeforeEach] /go/src/github.com/kubevirt/ssp-operator/tests/tekton-pipelines_test.go:20
[It] /go/src/github.com/kubevirt/ssp-operator/tests/tekton-pipelines_test.go:57
[FAILED] Timed out after 600.001s.
Expected
  <bool>: false
to be true
In [BeforeEach] at: /go/src/github.com/kubevirt/ssp-operator/tests/tekton-pipelines_test.go:30 @ 07/23/24 13:21:46.268

In the code, if failed here: https://github.com/kubevirt/ssp-operator/blob/d613183406e8eba47ff526537af17d445909c3a3/tests/tekton-pipelines_test.go#L30

I see that the issue was triggered by enabling the tekton feature gate. Let's look at the logs of ssp-operator pod: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/kubevirt_ssp-operator/1014/pull-ci-kubevirt-ssp-operator-release-v0.18-e2e-functests/1815720491986456576/artifacts/e2e-functests/gather-extra/artifacts/pods/kubevirt_ssp-operator-7464dcdfb-kbr8w_manager.log

There are a lot of errors, but the ones related to tekton are like this:

{
"level":"error",
"ts":"2024-07-23T13:17:26Z",
"msg":"Reconciler error",
"controller":"ssp",
"controllerGroup":"ssp.kubevirt.io",
"controllerKind":"SSP",
"SSP":{"name":"test-ssp","namespace":"ssp-operator-functests"},
"namespace":"ssp-operator-functests",
"name":"test-ssp",
"reconcileID":"3d8afad9-b630-4b93-8e83-15a85daa8492",
"error":"Tekton CRD tasks.tekton.dev does not exist",
"stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:235"
}

The error message is: Tekton CRD tasks.tekton.dev does not exist. That means that tekton was not properly installed in the cluster.
In the upstream CI, our scripts install tekton, so let's check them. Here we install tekton: https://github.com/kubevirt/ssp-operator/blob/d613183406e8eba47ff526537af17d445909c3a3/automation/common/deploy-kubevirt-and-cdi.sh#L15-L16
The ${TEKTON_VERSION} is defined here: https://github.com/kubevirt/ssp-operator/blob/d613183406e8eba47ff526537af17d445909c3a3/automation/common/versions.sh#L56-L63
Here we can see potential problem. The Kubevirt and CDI versions are fixed to a compatible minor version, but the tekton uses the latest version.
Let's check if the CRD exists in the latest released tekton version: https://github.com/tektoncd/operator/releases/tag/v0.72.0 , This is the file: https://github.com/tektoncd/operator/releases/download/v0.72.0/openshift-release.yaml
And as expected, it does not contain the CRD tasks.tekton.dev.

So the issue was that we are using too new version of tekton.

codingben commented 1 month ago

@akrejcir Thanks Andrej - I really appreciate your help! I'll learn from this now, I never saw issue like this in the CI (any repository).

I was able to reproduce this issue locally by running tekton-tasks functional tests:

Tekton Tasks Operand resource creation when DeployTektonTaskResources is set to true [test_id:TODO] should create cluster roles
/home/fedora/ssp-operator/tests/tekton-tasks_test.go:88
  [FAILED] in [BeforeEach] - /home/fedora/ssp-operator/tests/tests_suite_test.go:487 @ 07/31/24 08:11:42.807
  [FAILED] in [AfterEach] - /home/fedora/ssp-operator/tests/tests_suite_test.go:487 @ 07/31/24 08:11:43.294
• [FAILED] [0.814 seconds]
Tekton Tasks Operand resource creation when DeployTektonTaskResources is set to true [BeforeEach] [test_id:TODO] should create cluster roles
  [BeforeEach] /home/fedora/ssp-operator/tests/tekton-tasks_test.go:22
  [It] /home/fedora/ssp-operator/tests/tekton-tasks_test.go:88

  [FAILED] Timed out waiting for SSP to be in phase Deployed.
  In [BeforeEach] at: /home/fedora/ssp-operator/tests/tests_suite_test.go:487 @ 07/31/24 08:11:42.807

codingben commented 1 month ago

I'll rebase this PR once https://github.com/kubevirt/ssp-operator/pull/1023 is merged and backported to this release branch.

sonarcloud[bot] commented 1 month ago

Quality Gate passed

Issues
13 New issues
10 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.6% Duplication on New Code

See analysis details on SonarCloud

codingben commented 1 month ago

@0xFelix @akrejcir @ksimon1 Hi, can you please review? if all okay, labels are missing to merge it. CI passed. Thanks! :)

codingben commented 1 month ago

I'd rather we keep the fix also for vm-console-proxy, unless there is a reason not to.

/hold

I've added also changes to vm-console-proxy.

/unhold

akrejcir commented 1 month ago

/lgtm

kubevirt-bot commented 1 month ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: 0xFelix

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/kubevirt/ssp-operator/blob/release-v0.18/OWNERS)~~ [0xFelix] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment

kubevirt / ssp-operator

[release-v0.18] fix: add required labels to pods #1014

Quality Gate passed