kubernetes-sigs / e2e-framework

A Go framework for end-to-end testing of components running in Kubernetes clusters.
Apache License 2.0
526 stars 101 forks source link

`Flux` Integration test with `kyverno` is flaking #440

Closed harshanarayana closed 4 months ago

harshanarayana commented 4 months ago

What happened?

It has been a few days since the tests started flaking too much.

https://prow.k8s.io/job-history/gs/kubernetes-jenkins/pr-logs/directory/pull-e2e-framework-test

--- FAIL: TestFluxRepoWorkflow (20.08s)
    --- FAIL: TestFluxRepoWorkflow/Check_creation_of_tenant_resources_under_kyverno_cluster_policies (10.04s)
        --- FAIL: TestFluxRepoWorkflow/Check_creation_of_tenant_resources_under_kyverno_cluster_policies/ensure_privileged_containers_can't_be_deployed_in_the_cluster (10.04s)
            flux_test.go:45: <nil>
            flux_test.go:47: <nil>

I did some debug and found the following.

I0712 23:33:04.996051   19948 warning_handler.go:65] "would violate PodSecurity \"restricted:latest\": privileged (container \"nginx\" must not set securityContext.privileged=true), allowPrivilegeEscalation != false (container \"nginx\" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container \"nginx\" must set securityContext.capabilities.drop=[\"ALL\"]), runAsNonRoot != true (pod or container \"nginx\" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container \"nginx\" must set securityContext.seccompProfile.type to \"RuntimeDefault\" or \"Localhost\")" logger="KubeAPIWarningLogger"
I0712 23:33:10.004013   19948 conditions.go:228] "Checking for condition match" resource="/, Kind= [flux-c1467/nginx-1]" state="True" conditionType="Available" cond=[{"type":"Available","status":"False","lastUpdateTime":"2024-07-12T18:03:04Z","lastTransitionTime":"2024-07-12T18:03:04Z","reason":"MinimumReplicasUnavailable","message":"Deployment does not have minimum availability."},{"type":"Progressing","status":"True","lastUpdateTime":"2024-07-12T18:03:04Z","lastTransitionTime":"2024-07-12T18:03:04Z","reason":"ReplicaSetUpdated","message":"ReplicaSet \"nginx-1-6dbc76f964\" is progressing."}]
I0712 23:33:15.009544   19948 conditions.go:228] "Checking for condition match" resource="/, Kind= [flux-c1467/nginx-1]" state="True" conditionType="Available" cond=[{"type":"Available","status":"True","lastUpdateTime":"2024-07-12T18:03:11Z","lastTransitionTime":"2024-07-12T18:03:11Z","reason":"MinimumReplicasAvailable","message":"Deployment has minimum availability."},{"type":"Progressing","status":"True","lastUpdateTime":"2024-07-12T18:03:11Z","lastTransitionTime":"2024-07-12T18:03:04Z","reason":"NewReplicaSetAvailable","message":"ReplicaSet \"nginx-1-6dbc76f964\" has successfully progressed."}]
I0712 23:33:15.037476   19948 warning_handler.go:65] "would violate PodSecurity \"restricted:latest\": privileged (container \"nginx\" must not set securityContext.privileged=true), allowPrivilegeEscalation != false (container \"nginx\" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container \"nginx\" must set securityContext.capabilities.drop=[\"ALL\"]), runAsNonRoot != true (pod or container \"nginx\" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container \"nginx\" must set securityContext.seccompProfile.type to \"RuntimeDefault\" or \"Localhost\")" logger="KubeAPIWarningLogger"
I0712 23:33:20.045901   19948 conditions.go:228] "Checking for condition match" resource="/, Kind= [flux-c1467/nginx-2]" state="True" conditionType="Available" cond=[{"type":"Available","status":"False","lastUpdateTime":"2024-07-12T18:03:14Z","lastTransitionTime":"2024-07-12T18:03:14Z","reason":"MinimumReplicasUnavailable","message":"Deployment does not have minimum availability."},{"type":"Progressing","status":"True","lastUpdateTime":"2024-07-12T18:03:14Z","lastTransitionTime":"2024-07-12T18:03:14Z","reason":"ReplicaSetUpdated","message":"ReplicaSet \"nginx-2-7db5b7ffd5\" is progressing."}]
I0712 23:33:25.044663   19948 conditions.go:228] "Checking for condition match" resource="/, Kind= [flux-c1467/nginx-2]" state="True" conditionType="Available" cond=[{"type":"Available","status":"True","lastUpdateTime":"2024-07-12T18:03:22Z","lastTransitionTime":"2024-07-12T18:03:22Z","reason":"MinimumReplicasAvailable","message":"Deployment has minimum availability."},{"type":"Progressing","status":"True","lastUpdateTime":"2024-07-12T18:03:22Z","lastTransitionTime":"2024-07-12T18:03:14Z","reason":"NewReplicaSetAvailable","message":"ReplicaSet \"nginx-2-7db5b7ffd5\" has successfully progressed."}]

Looks like the flakiness is because the kyverno policy is not getting enforced always. The nginx-1 deployment based pod is never supposed to start up as per the config, but it does. That leads to the tests failing.

I0712 23:33:04.996051 19948 warning_handler.go:65] "would violate PodSecurity \"restricted:latest\": privileged (container \"nginx\" must not set securityContext.privileged=true), allowPrivilegeEscalation != false (container \"nginx\" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container \"nginx\" must set securityContext.capabilities.drop=[\"ALL\"]), runAsNonRoot != true (pod or container \"nginx\" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container \"nginx\" must set securityContext.seccompProfile.type to \"RuntimeDefault\" or \"Localhost\")" logger="KubeAPIWarningLogger"

It does generate this warning but doesn't seem to fail.

What did you expect to happen?

Tests not to flake

How can we reproduce it (as minimally and precisely as possible)?

for x in $(seq 1 15); do go test -count=1 ./...; done; running this under examples/third_party_integration/flux/kyverno can easily reproduce flaky tests

Anything elese we need to know?

No response

E2E Provider Used

kind

e2e-framework Version

HEAD

OS version

```console # On Linux: $ cat /etc/os-release # paste output here $ uname -a # paste output here # On Windows: C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture # paste output here ```
prit342 commented 4 months ago

@harshanarayana Thank you for the pointers. I have added the logic to check for all the deployment related to kyverno to be available before we run the tests as part of the PR https://github.com/kubernetes-sigs/e2e-framework/pull/438 and have also ran the test multiple times using the following shell script:

#!/bin/bash
set -euo pipefail
for x in $(seq 1 15); do
  echo "Running test ${x}"
  set -x
  go test -v ./... -count=1
  set +x 
  sleep 5
  kind get clusters
done

The test has not failed since then.

cc @vladimirvivien @cpanato