Test failures starting 22 September due to network timeouts

tam7t commented 3 years ago

What happened:

On our around 22 September we started seeing CI failures in prow:

Example log from error:

Step 4/8 : RUN apk add --no-cache curl &&     curl -LO https://storage.googleapis.com/kubernetes-release/release/${KUBE_VERSION}/bin/linux/${ARCH}/kubectl &&     chmod +x kubectl
 ---> Running in d7e388707d87
fetch https://dl-cdn.alpinelinux.org/alpine/v3.14/main/x86_64/APKINDEX.tar.gz
{"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 25m0s timeout","severity":"error","time":"2021-09-23T21:50:51Z"}

Ref run: https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_secrets-s[…]e-csi-driver-image-scan/1441072726373568512/build-log.txt

What you expected to happen:

Successful builds.

How to reproduce it (as minimally and precisely as possible):

This happens consistently on all of our builds. We've tried reverting commits in PRs and cannot find anything related to the test case that would cause this. The same tests running on the k8s-infra-prow-build cluster succeed.

We have also seen it fail on apt:

  Connection timed out [IP: 151.101.194.132 80]
[91mE: Failed to fetch http://deb.debian.org/debian/pool/main/u/util-linux/libblkid1_2.36.1-8_amd64.deb  Connection timed out [IP: 199.232.126.132 80]
E: Failed to fetch http://security.debian.org/debian-security/pool/updates/main/o/openssl/libssl1.1_1.1.1k-1%2bdeb11u1_amd64.deb  Connection timed out [IP: 151.101.2.132 80]
E: Failed to fetch http://security.debian.org/debian-security/pool/updates/main/o/openssl/openssl_1.1.1k-1%2bdeb11u1_amd64.deb  Connection timed out [IP: 151.101.194.132 80]
E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?

We also tried setting and increasing CPU/Memory requests in https://github.com/kubernetes/test-infra/pull/23723 and https://github.com/kubernetes/test-infra/pull/23725 but were unsuccessful

Please provide links to example occurrences, if any:

Anything else we need to know?:

Discussion thread in slack: https://kubernetes.slack.com/archives/C09QZ4DQB/p1632414457031600

randomvariable commented 3 years ago

@alvaroaleman reports that setting TCP MSS clamping worksaround the issue, i.e. to execute once in the pod:

iptables -t mangle -A POSTROUTING -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu

This suggests an issue with nested Docker networking. Have there been any host or CNI changes on the cluster?

If not, it might make sense to set the clamping at the host level?

/sig k8s-infra

alvaroaleman commented 3 years ago

@alvaroaleman reports that setting TCP MSS clamping worksaround the issue, i.e. to execute once in the pod:

That is an educated but unverified guess based on prior failures with similar symptoms, fwiw.

andyzhangx commented 3 years ago

shall we add the workaround iptables -t mangle -A POSTROUTING -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu into runner.sh script directly? I worked out a PR here: https://github.com/kubernetes/test-infra/pull/23757

randomvariable commented 3 years ago

shall we add the workaround iptables -t mangle -A POSTROUTING -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu into runner.sh script directly?

I think that's preferable to every sub project owner doing it themselves, and centralising the networking config so if it ever needs to be changed can be done in one place is good.

I mostly didn't know enough about test-infra to know where to do that, so thanks for the PR.

Stono commented 3 years ago

Hi all We've been debugging this on our cluster too (unrelated to the kubernetes project). Same failures since the 24th. The interesting part here is all failures we experienced were to services hosted by fastly.

We mitigated this by setting --mtu on dind containers to 1440.

Your host looks to be fastly too:

; <<>> DiG 9.10.6 <<>> dl-cdn.alpinelinux.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 45124
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;dl-cdn.alpinelinux.org.        IN  A

;; ANSWER SECTION:
dl-cdn.alpinelinux.org. 3600    IN  CNAME   dualstack.d.sni.global.fastly.net.
dualstack.d.sni.global.fastly.net. 29 IN A  199.232.54.133

Stono commented 3 years ago

Might have spoken too soon, our lower mtu seemed to make things better, but we just got some more failures this time for repo.maven.apache.org which is again fastly. Will attempt the iptables -t mangle -A POSTROUTING -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu fix now.

alvaroaleman commented 3 years ago

Yeah it is all fastly and happens because they block ICMP so path mtu discovery can't work. Reducing the MTU might help if you reduce it low enough but is always a guess. Using the clamping is more reliable there bot obviously only works for TCP (which I think is sufficient here). The oldest report of this issue I am aware of is https://github.com/gliderlabs/docker-alpine/issues/307

From an academic perspective I am curious why this suddenly happens, are all of these GKE clusters? Was there some change in the GKE networking stack maybe?

Stono commented 3 years ago

@alvaroaleman that was my question too, we're on GKE. We observed it in one zone on Friday (europe-west4-b), and excluded that zone from our egress.

Over the weekend it started in west4-a and west4-c as well so certainly feels like a change was being rolled out per-zone.

We have an open ticket with Google but haven't had any confirmation yet... (not convinced we will get any either to be honest). Like you, whilst we have a work around - I'd really like to understand why now.

Saykar commented 3 years ago

Google has confirmed the issue in the support case: After reviewing the information you provided, we believe that you may be affected by a known issue: We have identified a Networking connecting issue impacting the GKE Docker workload. This is a high priority issue that we're working to resolve as soon as possible. Some customers may be experiencing a connection failure in Docker workflow to Fastly destinations and may receive a timeout error.

CecileRobertMichon commented 3 years ago

It was added to https://status.cloud.google.com/ as well

We have identified a Networking connecting issue impacting the GKE Docker workload

mikesparr commented 3 years ago

I'm told there is a fix scheduled for Thursday.

Current suggested workaround is add an initContainer to the Docker in Docker workload like this:

initContainers:                                                                                                                                                                                         
  - name: workaround                                                                                                                                                                                      
    image: k8s.gcr.io/build-image/debian-iptables-amd64:buster-v1.6.7                                                                                                                                     
    command:                                                                                                                                                                                              
    - sh                                                                                                                                                                                                  
    - -c                                                                                                                                                                                                  
    - "iptables -t mangle -A POSTROUTING -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu"                                                                                                    
    securityContext:                                                                                                                                                                                      
      capabilities:                                                                                                                                                                                       
        add:                                                                                                                                                                                              
        - NET_ADMIN                                                                                                                                                                                       
      privileged: true

spiffxp commented 3 years ago

/reopen Holding this open until the Google Cloud incident is resolved.

Current status:

https://status.cloud.google.com/incidents/QSirAFiyN5yMeeE6GNxq latest update claims Thursday for an ETA on a fix
https://github.com/kubernetes/test-infra/pull/23757 put a workaround into runner.sh which eventually found its way into the gcr.io/k8s-staging-test-infra/bootstrap and gcr.io/k8s-staging-test-infra/kubekins images this morning (see https://github.com/kubernetes/test-infra/pull/23757#issuecomment-929490515)

k8s-ci-robot commented 3 years ago

@spiffxp: Reopened this issue.

In response to [this](https://github.com/kubernetes/test-infra/issues/23741#issuecomment-929722226): >/reopen >Holding this open until the Google Cloud incident is resolved. > >Current status: >- https://status.cloud.google.com/incidents/QSirAFiyN5yMeeE6GNxq latest update claims Thursday for an ETA on a fix >- https://github.com/kubernetes/test-infra/pull/23757 put a workaround into `runner.sh` which eventually found its way into the `gcr.io/k8s-staging-test-infra/bootstrap` and `gcr.io/k8s-staging-test-infra/kubekins` images this morning (see https://github.com/kubernetes/test-infra/pull/23757#issuecomment-929490515) Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

spiffxp commented 3 years ago

/milestone v1.23 /remove-sig k8s-infra /sig testing this is impacting kubernetes testing and happens on more than just infra managed by k8s-infra

/priority important-soon Would have set to critical-urgent earlier given the number of subprojects that were blocked on this, but now that we have a workaround in place...

spiffxp commented 3 years ago

https://status.cloud.google.com/incidents/QSirAFiyN5yMeeE6GNxq is listed as resolved

randomvariable commented 3 years ago

It looks ok on the Cluster API AWS end, so happy to close this out.

randomvariable commented 3 years ago

this is impacting kubernetes testing and happens on more than just infra managed by k8s-infra

I hadn't fully appreciated this. Thought most things were managed by k8s-infra.

spiffxp commented 3 years ago

I would like for us to roll this change back, but I'm wary that maybe we'll need it again someday.

So as to avoid a protracted roll out cycle, I'm planning on doing the following:

gate the workaround behind an env var BOOTSTRAP_MTU_WORKAROUND, defaulted in-image to true... this will let us roll out changes selectively or globally by job config changes (1 PR) instead of propagating to kubekins (bootstrap change PR + bootstramp bump PR + kubekins bump PR)
selectively disable the workaround on jobs that were broken before the workaround was in place via job config changes (add env stanzas setting BOOTSTRAP_MTU_WORKAROUND=false
disable the workaround for all jobs by setting it in the preset-dind-enabled preset
default the workaround to false in-image
...leaving us the option to enable the workaround by default using the same "env var in preset" approach if this problem surfaces again

randomvariable commented 3 years ago

That sounds good to me.

spiffxp commented 3 years ago

https://github.com/kubernetes/test-infra/pull/23918 - workaround was disabled for all jobs around 2021-10-06 6pm PDT

CecileRobertMichon commented 3 years ago

can confirm our previously failing jobs are still green this morning 👍

spiffxp commented 3 years ago

https://github.com/kubernetes/test-infra/pull/23955 - will disable by default in image, once this ends up in kubkins we can remove the explicit setting via preset

spiffxp commented 3 years ago

https://github.com/kubernetes/test-infra/pull/24007 - will remove disable from preset

spiffxp commented 3 years ago

/close Workaround has been disabled but is available to re-enable via config change if needed

k8s-ci-robot commented 3 years ago

@spiffxp: Closing this issue.

In response to [this](https://github.com/kubernetes/test-infra/issues/23741#issuecomment-960211068): >/close >Workaround has been disabled but is available to re-enable via config change if needed Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes / test-infra

Test failures starting 22 September due to network timeouts #23741