kubernetes / test-infra

Test infrastructure for the Kubernetes project.
Apache License 2.0
3.81k stars 2.61k forks source link

Update k8s-staging-test-infra GCR images as needed #32863

Closed k8s-infra-ci-robot closed 2 days ago

k8s-infra-ci-robot commented 5 days ago

No gcr.io/k8s-testimages/ changes.

Multiple distinct gcr.io/k8s-staging-test-infra changes:

Commits Dates Images
https://github.com/kubernetes/test-infra/compare/69ac5748ba...6dd397d329 2024‑02‑05 → 2024‑06‑27 bigquery
https://github.com/kubernetes/test-infra/compare/3b134c2624...6dd397d329 2024‑03‑08 → 2024‑06‑27 bootstrap
https://github.com/kubernetes/test-infra/compare/597c402033...1dde27f6a9 2024‑06‑11 → 2024‑06‑25 kubekins-e2e(1.29), kubekins-e2e(master)
https://github.com/kubernetes/test-infra/compare/1dde27f6a9...6dd397d329 2024‑06‑25 → 2024‑06‑27 krte(1.27), krte(1.28), krte(1.29), krte(1.30), krte(experimental), krte(master)

No us-central1-docker.pkg.dev/k8s-staging-test-infra/images changes.

No gcr.io/k8s-staging-apisnoop/ changes.

No gcr.io/k8s-staging-apisnoop/ changes.

/cc @nathanperkins

k8s-ci-robot commented 5 days ago

@k8s-infra-ci-robot: GitHub didn't allow me to request PR reviews from the following users: nathanperkins.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to [this](https://github.com/kubernetes/test-infra/pull/32863): >No gcr.io/k8s-testimages/ changes. > >Multiple distinct gcr.io/k8s-staging-test-infra changes: > >Commits | Dates | Images >--- | --- | --- >https://github.com/kubernetes/test-infra/compare/597c402033...1dde27f6a9 | 2024‑06‑11 → 2024‑06‑25 | kubekins-e2e(master) > > >No us-central1-docker.pkg.dev/k8s-staging-test-infra/images changes. > >No gcr.io/k8s-staging-apisnoop/ changes. > >No gcr.io/k8s-staging-apisnoop/ changes. > > >/cc @nathanperkins > > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
k8s-ci-robot commented 5 days ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: k8s-infra-ci-robot Once this PR has been reviewed and has the lgtm label, please assign bentheelder for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[config/OWNERS](https://github.com/kubernetes/test-infra/blob/master/config/OWNERS)** - **[config/jobs/kubernetes/sig-k8s-infra/trusted/OWNERS](https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/sig-k8s-infra/trusted/OWNERS)** - **[images/kubekins-e2e/OWNERS](https://github.com/kubernetes/test-infra/blob/master/images/kubekins-e2e/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
k8s-ci-robot commented 2 days ago

@k8s-infra-ci-robot: Updated the job-config configmap in namespace default at cluster test-infra-trusted using the following files:

In response to [this](https://github.com/kubernetes/test-infra/pull/32863): >No gcr.io/k8s-testimages/ changes. > >Multiple distinct gcr.io/k8s-staging-test-infra changes: > >Commits | Dates | Images >--- | --- | --- >https://github.com/kubernetes/test-infra/compare/69ac5748ba...6dd397d329 | 2024‑02‑05 → 2024‑06‑27 | bigquery >https://github.com/kubernetes/test-infra/compare/3b134c2624...6dd397d329 | 2024‑03‑08 → 2024‑06‑27 | bootstrap >https://github.com/kubernetes/test-infra/compare/597c402033...1dde27f6a9 | 2024‑06‑11 → 2024‑06‑25 | kubekins-e2e(1.29), kubekins-e2e(master) >https://github.com/kubernetes/test-infra/compare/1dde27f6a9...6dd397d329 | 2024‑06‑25 → 2024‑06‑27 | krte(1.27), krte(1.28), krte(1.29), krte(1.30), krte(experimental), krte(master) > > >No us-central1-docker.pkg.dev/k8s-staging-test-infra/images changes. > >No gcr.io/k8s-staging-apisnoop/ changes. > >No gcr.io/k8s-staging-apisnoop/ changes. > > >/cc @nathanperkins > > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
BenTheElder commented 1 day ago

@dims this broke kind.

EDIT: this is a useless comment without further context and verification, in the future will refrain from jumping on that assumption so quickly, apologies.

BenTheElder commented 1 day ago

... confirming here https://github.com/kubernetes-sigs/kind/pull/648#issuecomment-2201166628

integration tests hitting the same issue as https://kubernetes.slack.com/archives/CEKK1KTN2/p1719537867758879 by the looks of it.

BenTheElder commented 1 day ago

... pretty sure anyhow, appears to be a problematic docker upgrade breaking IPV6 stuff.

cc @aojea we're going to have to dig into this, as soon as I get a failure on the no-op PR will revert this and then we can look into rollback

looks like major changes to ipv6 in docker.

BenTheElder commented 1 day ago

... ok, now that is weird https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_kind/648/pull-kind-test/1807900248492740608

it ... passed?

.... but none of the changes in https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_kind/3676/pull-kind-test/1807897149162131456 should affect this ...

BenTheElder commented 1 day ago

I think maybe it became racy, with something else having modprobed on the node, but I'm still suspicious of a docker upgrade because:

Let me pull one of these images and confirm what docker version it has ...

The other thought is that it is mis-attributed to this change and is instead the build cluster.

BenTheElder commented 1 day ago

Yeah, we picked up docker 27.x:

$ docker run --rm --entrypoint=docker gcr.io/k8s-staging-test-infra/krte:v20240627-6dd397d329-master version
Unable to find image 'gcr.io/k8s-staging-test-infra/krte:v20240627-6dd397d329-master' locally
v20240627-6dd397d329-master: Pulling from k8s-staging-test-infra/krte
fea1432adf09: Pull complete 
910334bc68b1: Pull complete 
af497ffc85e2: Pull complete 
Digest: sha256:a9b0127377d84aadbf9729fc4a5c7bf5f9aadb07e3893f4512157c96ce47f78c
Status: Downloaded newer image for gcr.io/k8s-staging-test-infra/krte:v20240627-6dd397d329-master
Client: Docker Engine - Community
 Version:           27.0.2
 API version:       1.46
 Go version:        go1.21.11
 Git commit:        912c1dd
 Built:             Wed Jun 26 18:47:36 2024
 OS/Arch:           linux/amd64
 Context:           default
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
BenTheElder commented 1 day ago

Tracking the kind side of this at https://github.com/kubernetes-sigs/kind/issues/3677

dims commented 1 day ago

Yeah, we picked up docker 27.x:

Ugh! do we want to revert?

BenTheElder commented 1 day ago

Trying to figure out if it's causing issues in other CI, but I had to go out. It's only a light flake in kind, but I'm concerned that we're going to find more issues, it's causing the network creation to flake and some behavior change in IPv6 networking.

Let's leave it for the moment.

aojea commented 1 day ago

Ok, what it seems to happen is that now docker REQUIRES ip6tables,

We had a knob to enable this

https://github.com/kubernetes/test-infra/blob/0f25764b482278fa61a6ed9ccdee4569947dba80/images/bootstrap/runner.sh#L46-L47

that installed the required module

https://github.com/kubernetes/test-infra/blob/0f25764b482278fa61a6ed9ccdee4569947dba80/images/bootstrap/runner.sh#L61-L62

and now some jobs does not seem to have it

network_integration_test.go:63: "Error response from daemon: Failed to Setup IP tables: Unable to enable NAT rule:  (iptables failed: ip6tables --wait -t nat -I POSTROUTING -s fc00:3051:9942:af9f::/64 ! -o br-4e53c7863d0d -j MASQUERADE: modprobe: FATAL: Module ip6_tables not found in directory /lib/modules/5.15.0-1054-gke\nip6tables v1.8.9 (legacy): can't initialize ip6tables table `nat': Table does not exist (do you need to insmod?)\nPerhaps ip6tables or your kernel needs to be upgraded.\n (exit status 3))\n"

@BenTheElder we are back in 2019 😄 , my memory may fail, but I think that some images didn't have that module?

dims commented 1 day ago

is this is a GKE cluster/pool issue? based on FATAL: Module ip6_tables not found in directory /lib/modules/5.15.0-1054-gke\nip6tables v1.8.9 (legacy) - we could try the eks prow cluster then

BenTheElder commented 1 day ago

is this is a GKE cluster/pool issue? based on FATAL: Module ip6_tables not found in directory /lib/modules/5.15.0-1054-gke\nip6tables v1.8.9 (legacy) - we could try the eks prow cluster then

on the ubuntu nodes it's just not loaded by default but the module is available, since the host nodes are ipv4.

previously we added the modprobe only when we intend to use ipv6 and then enable it in docker, but now ipv6 is always enabled

https://github.com/kubernetes/test-infra/pull/32890

BenTheElder commented 20 hours ago

let's use https://github.com/kubernetes-sigs/kind/issues/3677 to track, even though technically this affects docker networks in all jobs, we do not have evidence yet that other jobs are creating networks (I could see this causing issues for the default bridge but we have no proof yet).

BenTheElder commented 17 hours ago

I think this is resolved now, if not fully cleanly, will keep an eye out for any other issues.

We're ensuring we load the ipv6 NAT module when setting up dind, and having all dind jobs mount /lib/modules.

We can do something more clever in the future.

There may be other issues from the v27 changes, but I'm not seeing them yet.