Closed tam7t closed 3 years ago
@alvaroaleman reports that setting TCP MSS clamping worksaround the issue, i.e. to execute once in the pod:
iptables -t mangle -A POSTROUTING -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
This suggests an issue with nested Docker networking. Have there been any host or CNI changes on the cluster?
If not, it might make sense to set the clamping at the host level?
/sig k8s-infra
@alvaroaleman reports that setting TCP MSS clamping worksaround the issue, i.e. to execute once in the pod:
That is an educated but unverified guess based on prior failures with similar symptoms, fwiw.
shall we add the workaround iptables -t mangle -A POSTROUTING -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
into runner.sh
script directly? I worked out a PR here: https://github.com/kubernetes/test-infra/pull/23757
shall we add the workaround iptables -t mangle -A POSTROUTING -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu into runner.sh script directly?
I think that's preferable to every sub project owner doing it themselves, and centralising the networking config so if it ever needs to be changed can be done in one place is good.
I mostly didn't know enough about test-infra to know where to do that, so thanks for the PR.
Hi all We've been debugging this on our cluster too (unrelated to the kubernetes project). Same failures since the 24th. The interesting part here is all failures we experienced were to services hosted by fastly.
We mitigated this by setting --mtu
on dind
containers to 1440
.
Your host looks to be fastly too:
; <<>> DiG 9.10.6 <<>> dl-cdn.alpinelinux.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 45124
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;dl-cdn.alpinelinux.org. IN A
;; ANSWER SECTION:
dl-cdn.alpinelinux.org. 3600 IN CNAME dualstack.d.sni.global.fastly.net.
dualstack.d.sni.global.fastly.net. 29 IN A 199.232.54.133
Might have spoken too soon, our lower mtu seemed to make things better, but we just got some more failures this time for repo.maven.apache.org
which is again fastly. Will attempt the iptables -t mangle -A POSTROUTING -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
fix now.
Yeah it is all fastly and happens because they block ICMP so path mtu discovery can't work. Reducing the MTU might help if you reduce it low enough but is always a guess. Using the clamping is more reliable there bot obviously only works for TCP (which I think is sufficient here). The oldest report of this issue I am aware of is https://github.com/gliderlabs/docker-alpine/issues/307
From an academic perspective I am curious why this suddenly happens, are all of these GKE clusters? Was there some change in the GKE networking stack maybe?
@alvaroaleman that was my question too, we're on GKE. We observed it in one zone on Friday (europe-west4-b), and excluded that zone from our egress.
Over the weekend it started in west4-a and west4-c as well so certainly feels like a change was being rolled out per-zone.
We have an open ticket with Google but haven't had any confirmation yet... (not convinced we will get any either to be honest). Like you, whilst we have a work around - I'd really like to understand why now.
Google has confirmed the issue in the support case:
After reviewing the information you provided, we believe that you may be affected by a known issue: We have identified a Networking connecting issue impacting the GKE Docker workload. This is a high priority issue that we're working to resolve as soon as possible. Some customers may be experiencing a connection failure in Docker workflow to Fastly destinations and may receive a timeout error.
It was added to https://status.cloud.google.com/ as well
We have identified a Networking connecting issue impacting the GKE Docker workload
I'm told there is a fix scheduled for Thursday.
Current suggested workaround is add an initContainer to the Docker in Docker workload like this:
initContainers:
- name: workaround
image: k8s.gcr.io/build-image/debian-iptables-amd64:buster-v1.6.7
command:
- sh
- -c
- "iptables -t mangle -A POSTROUTING -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu"
securityContext:
capabilities:
add:
- NET_ADMIN
privileged: true
/reopen Holding this open until the Google Cloud incident is resolved.
Current status:
runner.sh
which eventually found its way into the gcr.io/k8s-staging-test-infra/bootstrap
and gcr.io/k8s-staging-test-infra/kubekins
images this morning (see https://github.com/kubernetes/test-infra/pull/23757#issuecomment-929490515)@spiffxp: Reopened this issue.
/milestone v1.23 /remove-sig k8s-infra /sig testing this is impacting kubernetes testing and happens on more than just infra managed by k8s-infra
/priority important-soon Would have set to critical-urgent earlier given the number of subprojects that were blocked on this, but now that we have a workaround in place...
https://status.cloud.google.com/incidents/QSirAFiyN5yMeeE6GNxq is listed as resolved
It looks ok on the Cluster API AWS end, so happy to close this out.
this is impacting kubernetes testing and happens on more than just infra managed by k8s-infra
I hadn't fully appreciated this. Thought most things were managed by k8s-infra.
I would like for us to roll this change back, but I'm wary that maybe we'll need it again someday.
So as to avoid a protracted roll out cycle, I'm planning on doing the following:
BOOTSTRAP_MTU_WORKAROUND
, defaulted in-image to true... this will let us roll out changes selectively or globally by job config changes (1 PR) instead of propagating to kubekins (bootstrap change PR + bootstramp bump PR + kubekins bump PR)env
stanzas setting BOOTSTRAP_MTU_WORKAROUND=false
preset-dind-enabled
presetThat sounds good to me.
https://github.com/kubernetes/test-infra/pull/23918 - workaround was disabled for all jobs around 2021-10-06 6pm PDT
can confirm our previously failing jobs are still green this morning 👍
https://github.com/kubernetes/test-infra/pull/23955 - will disable by default in image, once this ends up in kubkins we can remove the explicit setting via preset
https://github.com/kubernetes/test-infra/pull/24007 - will remove disable from preset
/close Workaround has been disabled but is available to re-enable via config change if needed
@spiffxp: Closing this issue.
What happened:
On our around 22 September we started seeing CI failures in prow:
Example log from error:
Ref run: https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_secrets-s[…]e-csi-driver-image-scan/1441072726373568512/build-log.txt
What you expected to happen:
Successful builds.
How to reproduce it (as minimally and precisely as possible):
This happens consistently on all of our builds. We've tried reverting commits in PRs and cannot find anything related to the test case that would cause this. The same tests running on the
k8s-infra-prow-build
cluster succeed.We have also seen it fail on
apt
:We also tried setting and increasing CPU/Memory requests in https://github.com/kubernetes/test-infra/pull/23723 and https://github.com/kubernetes/test-infra/pull/23725 but were unsuccessful
Please provide links to example occurrences, if any:
Anything else we need to know?:
Discussion thread in slack: https://kubernetes.slack.com/archives/C09QZ4DQB/p1632414457031600