kubernetes / minikube

Run Kubernetes locally
https://minikube.sigs.k8s.io/
Apache License 2.0
29.24k stars 4.87k forks source link

`TestAddons/parallel/Registry` failing in all test environments #19714

Open spowelljr opened 19 hours ago

spowelljr commented 19 hours ago

Timeline

Root Cause

After migrating to AR, when our GCP-Auth addon is enabled with the mock credentials we use for testing, attempting to pull images from gcr.io/k8s-minikube result in unauthorized: authentication failed errors.

Reproduction

$ export GOOGLE_APPLICATION_CREDENTIALS="/Users/<user>/repo/minikube/test/integration/testdata/gcp-creds.json"
$ export GOOGLE_CLOUD_PROJECT="this_is_fake"
$ export MOCK_GOOGLE_TOKEN="true"

$ minikube start --addons gcp-auth
😄  minikube v1.34.0 on Darwin 14.7 (arm64)
✨  Automatically selected the docker driver. Other choices: qemu2, ssh, vfkit (experimental)
📌  Using Docker Desktop driver with root privileges
👍  Starting "minikube" primary control-plane node in "minikube" cluster
🚜  Pulling base image v0.0.45-1727108449-19696 ...
🔥  Creating docker container (CPUs=2, Memory=4000MB) ...
🐳  Preparing Kubernetes v1.31.1 on Docker 27.3.1 ...
    ▪ Generating certificates and keys ...
    ▪ Booting up control plane ...
    ▪ Configuring RBAC rules ...
🔗  Configuring bridge CNI (Container Networking Interface) ...
🔎  Verifying Kubernetes components...
    ▪ Using image gcr.io/k8s-minikube/storage-provisioner:v5
    ▪ Using image registry.k8s.io/ingress-nginx/kube-webhook-certgen:v1.4.3
    ▪ Using image gcr.io/k8s-minikube/gcp-auth-webhook:v0.1.2
🔎  Verifying gcp-auth addon...
📌  Your GCP credentials will now be mounted into every pod created in the minikube cluster.
📌  If you don't want your credentials mounted into a specific pod, add a label with the `gcp-auth-skip-secret` key to your pod configuration.
📌  If you want existing pods to be mounted with credentials, either recreate them or rerun addons enable with --refresh.
🌟  Enabled addons: storage-provisioner, default-storageclass, gcp-auth
🏄  Done! kubectl is now configured to use "minikube" cluster and "default" namespace by default

$ kubectl run --rm registry-test --restart=Never --image=gcr.io/k8s-minikube/busybox -it -- sh -c "wget --spider -S http://registry.kube-system.svc.cluster.local"
pod "registry-test" deleted
error: timed out waiting for the condition

$ kubectl get pods -A
NAMESPACE     NAME                               READY   STATUS         RESTARTS   AGE
default       registry-test                      0/1     ErrImagePull   0          5s

$ kubectl describe pods registry-test
...
Events:
  Type     Reason     Age               From               Message
  ----     ------     ----              ----               -------
  Normal   Scheduled  15s               default-scheduler  Successfully assigned default/registry-test to minikube
  Normal   BackOff    13s               kubelet            Back-off pulling image "gcr.io/k8s-minikube/busybox"
  Warning  Failed     13s               kubelet            Error: ImagePullBackOff
  Normal   Pulling    3s (x2 over 15s)  kubelet            Pulling image "gcr.io/k8s-minikube/busybox"
  Warning  Failed     2s (x2 over 14s)  kubelet            Failed to pull image "gcr.io/k8s-minikube/busybox": Error response from daemon: Head "https://gcr.io/v2/k8s-minikube/busybox/manifests/latest": unauthorized: authentication failed
  Warning  Failed     2s (x2 over 14s)  kubelet            Error: ErrImagePull

How does this affect the registry test?

Multiple factors come into play, first off, as mentioned in the timeline above, a month before the migration to AR, the GCP-Auth test was moved before the rest of the tests. Second, the GCP-Auth test also tries to pull a busybox image from gcr.io/k8s-minikube, but fails from the issue mentioned above, resulting in a call to t.Fatal and in turn the GCP-Auth addon is not disabled. So when the registry test runs, the GCP-Auth addon with mock credentials is still running, causing the command the registry test tries to execute that pulls a busybox image to fail.

I don't see the GCP-Auth test failing though

https://storage.googleapis.com/minikube-builds/logs/master/35974/Docker_Linux.html

Correct, looking at the above gopogh output there seems to be no failures in GCP-Auth, but it is actually failing and the failure is being suppressed, looking at the raw JSON logs I found the following.

{"Time":"2024-08-27T23:15:10.547073291Z","Action":"output","Test":"TestAddons/serial/GCPAuth","Output":"    addons_test.go:704: (dbg) TestAddons/serial/GCPAuth: waiting 8m0s for pods matching \"integration-test=busybox\" in namespace \"default\" ...\n"}
{"Time":"2024-08-27T23:15:10.550185311Z","Action":"output","Test":"TestAddons/serial/GCPAuth","Output":"    helpers_test.go:344: \"busybox\" [3c0f1b89-73c9-47ff-b180-16b49b9cb882] Pending / Ready:ContainersNotReady (containers with unready status: [busybox]) / ContainersReady:ContainersNotReady (containers with unready status: [busybox])\n"}
{"Time":"2024-08-27T23:23:10.5476429Z","Action":"output","Test":"TestAddons/serial/GCPAuth","Output":"    helpers_test.go:329: TestAddons/serial/GCPAuth: WARNING: pod list for \"default\" \"integration-test=busybox\" returned: client rate limiter Wait returned an error: context deadline exceeded\n"}
{"Time":"2024-08-27T23:23:10.54769183Z","Action":"output","Test":"TestAddons/serial/GCPAuth","Output":"    addons_test.go:704: ***** TestAddons/serial/GCPAuth: pod \"integration-test=busybox\" failed to start within 8m0s: context deadline exceeded ****\n"}
{"Time":"2024-08-27T23:23:10.547702276Z","Action":"output","Test":"TestAddons/serial/GCPAuth","Output":"    addons_test.go:704: (dbg) Run:  out/minikube-linux-amd64 status --format={{.APIServer}} -p addons-029048 -n addons-029048\n"}
{"Time":"2024-08-27T23:23:10.84226332Z","Action":"output","Test":"TestAddons/serial/GCPAuth","Output":"    addons_test.go:704: TestAddons/serial/GCPAuth: showing logs for failed pods as of 2024-08-27 23:23:10.842143419 +0000 UTC m=+743.194308226\n"}
{"Time":"2024-08-27T23:23:10.842291126Z","Action":"output","Test":"TestAddons/serial/GCPAuth","Output":"    addons_test.go:704: (dbg) Run:  kubectl --context addons-029048 describe po busybox -n default\n"}
{"Time":"2024-08-27T23:23:10.909691302Z","Action":"output","Test":"TestAddons/serial/GCPAuth","Output":"    addons_test.go:704: (dbg) kubectl --context addons-029048 describe po busybox -n default:\n"}

This is due to how gopogh handles parent tests, if a test has any child the test result is suppressed, this is done to prevent a single child test failure from showing a failure for each parent in the chain and mucking up the output. ie. If two tests failed that each had four parents, gopogh only shows two failures instead of 10. The GCP-Auth test has a child test Namespaces, so any failure in the GCP-Auth test itself is suppressed.

Action Items