Closed k8s-infra-ci-robot closed 2 days ago
@k8s-infra-ci-robot: GitHub didn't allow me to request PR reviews from the following users: nathanperkins.
Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: k8s-infra-ci-robot Once this PR has been reviewed and has the lgtm label, please assign bentheelder for approval. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
@k8s-infra-ci-robot: Updated the job-config
configmap in namespace default
at cluster test-infra-trusted
using the following files:
cloud-provider-kind-periodic.yaml
using file config/jobs/kubernetes-sigs/cloud-provider-kind/cloud-provider-kind-periodic.yaml
cloud-provider-kind-presubmits.yaml
using file config/jobs/kubernetes-sigs/cloud-provider-kind/cloud-provider-kind-presubmits.yaml
cluster-api-provider-azure-presubmits-main.yaml
using file config/jobs/kubernetes-sigs/cluster-api-provider-azure/cluster-api-provider-azure-presubmits-main.yaml
kind-presubmits.yaml
using file config/jobs/kubernetes-sigs/kind/kind-presubmits.yaml
kind-release-blocking.yaml
using file config/jobs/kubernetes-sigs/kind/kind-release-blocking.yaml
kind.yaml
using file config/jobs/kubernetes-sigs/kind/kind.yaml
hnc-e2e.yaml
using file config/jobs/kubernetes-sigs/wg-multi-tenancy/hnc-e2e.yaml
mtb-presubmit.yaml
using file config/jobs/kubernetes-sigs/wg-multi-tenancy/mtb-presubmit.yaml
conformance-audit.yaml
using file config/jobs/kubernetes/sig-arch/conformance-audit.yaml
sig-instrumentation-kind-periodics.yaml
using file config/jobs/kubernetes/sig-instrumentation/sig-instrumentation-kind-periodics.yaml
sig-instrumentation-presubmit.yaml
using file config/jobs/kubernetes/sig-instrumentation/sig-instrumentation-presubmit.yaml
sig-k8s-infra-test-infra.yaml
using file config/jobs/kubernetes/sig-k8s-infra/trusted/sig-k8s-infra-test-infra.yaml
sig-test-infra.yaml
using file config/jobs/kubernetes/sig-k8s-infra/trusted/sig-test-infra.yaml
sig-network-kind.yaml
using file config/jobs/kubernetes/sig-network/sig-network-kind.yaml
sig-node-presubmit.yaml
using file config/jobs/kubernetes/sig-node/sig-node-presubmit.yaml
1.27.yaml
using file config/jobs/kubernetes/sig-release/release-branch-jobs/1.27.yaml
1.28.yaml
using file config/jobs/kubernetes/sig-release/release-branch-jobs/1.28.yaml
1.29.yaml
using file config/jobs/kubernetes/sig-release/release-branch-jobs/1.29.yaml
1.30.yaml
using file config/jobs/kubernetes/sig-release/release-branch-jobs/1.30.yaml
sig-scheduling-config.yaml
using file config/jobs/kubernetes/sig-scheduling/sig-scheduling-config.yaml
sig-storage-kind.yaml
using file config/jobs/kubernetes/sig-storage/sig-storage-kind.yaml
conformance-e2e.yaml
using file config/jobs/kubernetes/sig-testing/conformance-e2e.yaml
kubernetes-kind.yaml
using file config/jobs/kubernetes/sig-testing/kubernetes-kind.yaml
@dims this broke kind.
EDIT: this is a useless comment without further context and verification, in the future will refrain from jumping on that assumption so quickly, apologies.
... confirming here https://github.com/kubernetes-sigs/kind/pull/648#issuecomment-2201166628
integration tests hitting the same issue as https://kubernetes.slack.com/archives/CEKK1KTN2/p1719537867758879 by the looks of it.
... pretty sure anyhow, appears to be a problematic docker upgrade breaking IPV6 stuff.
cc @aojea we're going to have to dig into this, as soon as I get a failure on the no-op PR will revert this and then we can look into rollback
looks like major changes to ipv6 in docker.
... ok, now that is weird https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_kind/648/pull-kind-test/1807900248492740608
it ... passed?
.... but none of the changes in https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_kind/3676/pull-kind-test/1807897149162131456 should affect this ...
I think maybe it became racy, with something else having modprobed on the node, but I'm still suspicious of a docker upgrade because:
Let me pull one of these images and confirm what docker version it has ...
The other thought is that it is mis-attributed to this change and is instead the build cluster.
Yeah, we picked up docker 27.x:
$ docker run --rm --entrypoint=docker gcr.io/k8s-staging-test-infra/krte:v20240627-6dd397d329-master version
Unable to find image 'gcr.io/k8s-staging-test-infra/krte:v20240627-6dd397d329-master' locally
v20240627-6dd397d329-master: Pulling from k8s-staging-test-infra/krte
fea1432adf09: Pull complete
910334bc68b1: Pull complete
af497ffc85e2: Pull complete
Digest: sha256:a9b0127377d84aadbf9729fc4a5c7bf5f9aadb07e3893f4512157c96ce47f78c
Status: Downloaded newer image for gcr.io/k8s-staging-test-infra/krte:v20240627-6dd397d329-master
Client: Docker Engine - Community
Version: 27.0.2
API version: 1.46
Go version: go1.21.11
Git commit: 912c1dd
Built: Wed Jun 26 18:47:36 2024
OS/Arch: linux/amd64
Context: default
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Tracking the kind side of this at https://github.com/kubernetes-sigs/kind/issues/3677
Yeah, we picked up docker 27.x:
Ugh! do we want to revert?
Trying to figure out if it's causing issues in other CI, but I had to go out. It's only a light flake in kind, but I'm concerned that we're going to find more issues, it's causing the network creation to flake and some behavior change in IPv6 networking.
Let's leave it for the moment.
Ok, what it seems to happen is that now docker REQUIRES ip6tables,
We had a knob to enable this
that installed the required module
and now some jobs does not seem to have it
network_integration_test.go:63: "Error response from daemon: Failed to Setup IP tables: Unable to enable NAT rule: (iptables failed: ip6tables --wait -t nat -I POSTROUTING -s fc00:3051:9942:af9f::/64 ! -o br-4e53c7863d0d -j MASQUERADE: modprobe: FATAL: Module ip6_tables not found in directory /lib/modules/5.15.0-1054-gke\nip6tables v1.8.9 (legacy): can't initialize ip6tables table `nat': Table does not exist (do you need to insmod?)\nPerhaps ip6tables or your kernel needs to be upgraded.\n (exit status 3))\n"
@BenTheElder we are back in 2019 😄 , my memory may fail, but I think that some images didn't have that module?
is this is a GKE cluster/pool issue? based on FATAL: Module ip6_tables not found in directory /lib/modules/5.15.0-1054-gke\nip6tables v1.8.9 (legacy)
- we could try the eks prow cluster then
is this is a GKE cluster/pool issue? based on FATAL: Module ip6_tables not found in directory /lib/modules/5.15.0-1054-gke\nip6tables v1.8.9 (legacy) - we could try the eks prow cluster then
on the ubuntu nodes it's just not loaded by default but the module is available, since the host nodes are ipv4.
previously we added the modprobe only when we intend to use ipv6 and then enable it in docker, but now ipv6 is always enabled
let's use https://github.com/kubernetes-sigs/kind/issues/3677 to track, even though technically this affects docker networks in all jobs, we do not have evidence yet that other jobs are creating networks (I could see this causing issues for the default bridge but we have no proof yet).
I think this is resolved now, if not fully cleanly, will keep an eye out for any other issues.
We're ensuring we load the ipv6 NAT module when setting up dind, and having all dind jobs mount /lib/modules.
We can do something more clever in the future.
There may be other issues from the v27 changes, but I'm not seeing them yet.
No gcr.io/k8s-testimages/ changes.
Multiple distinct gcr.io/k8s-staging-test-infra changes:
No us-central1-docker.pkg.dev/k8s-staging-test-infra/images changes.
No gcr.io/k8s-staging-apisnoop/ changes.
No gcr.io/k8s-staging-apisnoop/ changes.
/cc @nathanperkins