Add workaround for spurious retransmits leading to connection resets

aaronlehmann commented 8 years ago

There is a longstanding issue over at https://github.com/docker/distribution/issues/785 where users reported connection resets trying to push to an AWS-hosted registry from inside the AWS network. After months, we've finally narrowed this down to a bad interaction between spurious TCP retransmits and the NAT rules that Docker sets up for bridge networking.

Here is a summary of what happens:

For some reason, when an AWS EC2 machine connects to itself using its external-facing IP address, there are occasional packets with sequence numbers and timestamps that are far behind the rest.
Normally these packets would be ignored as spurious retransmits. However, because the packets fall outside the TCP window, Linux's conntrack module marks them invalid, and their destination addresses do not get rewritten by DNAT.
The packets are eventually interpreted as packets destined to the actual address/port in the IP/TCP headers. Since there is no flow matching these, the host sends a RST.
The RST terminates the actual NAT'd connection, since its source address and port matches the NAT'd connection.

I think it would be hugely helpful for libnetwork to include a workaround for this. It has affected a lot of users trying to use the registry in AWS, and it presumably affects other Dockerized applications as well. While I'll reach out to AWS to point out the spurious retransmits, I don't know if they'll be able to fix them, and there may also be other environments with similar issues.

I've found two possible workarounds:

Turn on conntrack's "be liberal" flag: echo 1 > /proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_be_liberal. This causes conntrack/NAT to treat packets outside the TCP window as part of the flow being tracked, instead of marking them invalid and causing them to be handled by the host.
Add a rule to drop invalid packets instead of allowing them to trigger RSTs: iptables -I INPUT -m conntrack --ctstate INVALID -j DROP

Both of these can potentially affect non-Docker traffic. The former causes NAT to forward packets that it would otherwise err on the side of not forwarding, which seems relatively harmless, but it's a system-level setting, so it's not limited to Docker flows. The latter would drop any packets that conntrack deems invalid, system-wide, unless we added specific destination filters for the addresses/ports that Docker set up NAT rules for, which could add overhead.

It may be too late to hope for a workaround to be included in Docker 1.11, but anything we can do on this front will really improve the lives of Docker users on AWS.

thaJeztah commented 8 years ago

@aaronlehmann I saw the linked issue turned out to be an issue with AWS, is there still something that needs to be done in libnetwork?

aaronlehmann commented 8 years ago

@thaJeztah: This issue is a suggestion to work around problems like this in libnetwork. The problem came from a combination of invalid packets generated somewhere in AWS' infrastructure, and the NAT setup used by libnetwork reacting to those invalid packets by tearing down the connection. This means the invalid packets cause problems for Dockerized applications but they are harmless for most other setups. docker/docker#19532 revealed that this problem was also seen on a residential internet connection. I think there is value in finding a workaround.

jrabbit commented 8 years ago

I'm being bit by this in production what more information could I provide?

middleagedman commented 8 years ago

Same here.. Simple docker container build on an arch linux system in residential. Just trying to do a git clone from a https git site (bitbucket).

GnuTLS recv error (-54): Error in the pull function.
Closing connection 1 error: RPC failed; result=56, HTTP code = 200 fatal: The remote end hung up unexpectedly fatal: early EOF fatal: index-pack failed

BenSjoberg commented 7 years ago

Just ran into this on my office's internal network. Thankfully I found this page or all my hair would be ripped out by morning.

The iptables workaround did the trick for me, thanks very much for providing that. If it helps, I'm running Docker 1.11.2 on Ubuntu 16.04. Let me know if there's any more information I can give that would be useful.

GordonTheTurtle commented 6 years ago

@aaronlehmann It has been detected that this issue has not received any activity in over 6 months. Can you please let us know if it is still relevant:

For a bug: do you still experience the issue with the latest version?
For a feature request: was your request appropriately answered in a later version?

Thank you! This issue will be automatically closed in 1 week unless it is commented on. For more information please refer to https://github.com/docker/libnetwork/issues/1926

aaronlehmann commented 6 years ago

A fix was implemented in AWS. I don't think a workaround is necessary anymore.

mitchcapper commented 6 years ago

I will comment that this does happen on networks outside of AWS. The iptables fix does fix it HOWEVER you first have to find this issue to learn that. The errors are very generic, so if implementing the fix in docker is not a big deal it would probably save some people many hours of research into it:)

vduglued commented 6 years ago

Any solution to this problem on a macOS host?

p53 commented 6 years ago

we have similar problem downloading file to our docker image from nexus throws connection reset by peer, adding iptables rules fixes it

guillon commented 5 years ago

As it has been reported multiple times (@middleagedman, @BenSjoberg, @mitchcapper , @p53) the fix in the iptables resolves the issue ('connections reset by peer' or RST packet sent at TCP level). Quick fix (ref @aaronlehmann): iptables -I INPUT -m conntrack --ctstate INVALID -j DROP

The issue is actually occurring in any container running in the default bridge network. Whether the issue occurs frequently or not depends on lot of factors (bandwidth, latency, host load). For sure, it occurs at some point. This issue is probably most of the time non-understood and incorrectly explained by a possible transient network partition, but it is not. It is a bug in the NAT setup installed by Docker.

We face this issue with a perfectly valid TCP client-server transfer (for instance a curl from a container downloading a large file though HTTP from an external server at high throughput). Do the very same download from the host directly and all is fine. Do it from a container on the same host and it breaks.

The problem as already mentioned by @aaronlehmann is that benign "invalid" packets to the SNAT'ed container (caused for instance by TCP window overflow due to high throughput but slow client) are assigned to the host interface and considered incorrectly martians, which causes a connection reset. This is a limitation of conntrack which does not differentiate perfectly legal packets causing TCP window overflow from actually malformed packets (all get treated as INVALID). Hence the need to drop any conntrack INVALID packet seen when installing SNAT'ed virtual networks.

This is a problem references at several places, due to this netfilter/conntrack limitation: https://serverfault.com/a/312687 https://www.spinics.net/lists/netfilter/msg51409.html Quoting the last link from netfilter mailing list:

If NAT is enabled, never ever let packets with INVALID state pass through, because NAT will skip them. Best regards, Jozsef

The source NAT setup in iptables are installed by Docker for its bridge network support and are thus incomplete. It should be the responsibility of Docker to set this up correctly. Apparently this was never fixed, hence my request to re-open this issue.

I can attempt to make a pull request if it can help, or I can open a new issue if needed, tell me.

Note that the abandoned pull request attempt #1129 does not fix the issue because the inserted rule does not drop the packets. There should be no filter on the destination because at that time the destination is not yet NAT'ed. Any conntrack invalid packets in filter INPUT chain have to be dropped as in : iptables -I INPUT -m conntrack --ctstate INVALID -j DROP.

dcui commented 5 years ago

FYI: "/proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_be_liberal" has gone since 2016-08-13 (see "netfilter: remove ip_conntrack* sysctl compat code" https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=adf0516845bcd0e626323c858ece28ee58c74455)

Now I think we should use "/proc/sys/net/netfilter/nf_conntrack_tcp_be_liberal" instead.

johannesboon commented 5 years ago

FYI: This is also an issue for kubernetes that they are trying to solve with similar strategies:

https://github.com/kubernetes/kubernetes/pull/74840

guillon commented 5 years ago

Hi @aaronlehmann, I think that the issue was closed but never fixed, can you consider re-opening it. Note that the pr #2275 solves the issue.

unilynx commented 4 years ago

I'm using neither AWS nor Kubernetes, and I see the issue too between our office network (where our CI runners use) and external resource at digitalocean or maxmind.com. It generally manifests itself as

curl: (56) OpenSSL SSL_read: SSL_ERROR_SYSCALL, errno 104

With tcpdumps I see lost but then reappearing packets (it reappeared after about 90ms or 200KB of data) triggering a RST. I'm not sure where the actual problem is, I'm assuming our ISP is doings something funky or a link aggregation is messing up packets. It happens mostly during quiet hours and the actual network issue is something we probably have to live with, but a 90ms packet delay shouldn't terminate connections

The liberal sysctl fixes our issue (and firewalling RST probably too), but as the issue is not AWS (or even K8S specific) I too think this issue should be reopened.

rwkarg commented 4 years ago

This is impacting us as well just using docker. Should this issue be reopened?

leakingtapan commented 3 years ago

Had the same issue on GCP when downloading large file from inside container using curl. The iptables rule solves the problem for me. Another workaround was to use wget instead of curl, not this workaround might not be generally applicable to all cases

ssup2 commented 3 years ago

Hello. To solve this problem, I developed a kubernetes controller called node-network-manager. By simply deploying and configuring network-node-manager, you can set iptables -I INPUT -m conntrack --ctstate INVALID -j DROP rule to all nodes of cluster. Please try this and give me feedback. Thanks.

https://github.com/kakao/network-node-manager

karunchennuri commented 3 years ago

This issue pretty much exists in non-AWS, non-GCP world as well. We run our clusters on-prem and were able to reproduce this issue esp with requests going outbound with higher payloads. Getting into details...

Problem: An app team complained an issue with their app behavior. This app reaches outbound external service with certain sizes of payloads. In literal cURL world, it's nothing but passing JSON payloads in --data-raw. What was weird was that the requests went through fine with smaller payloads, but when the payload size reaches certain KB, the request goes outbound through firewall, gets executed on external service but response never reaches the container. We thought it's intermittent issue, but NO we could reproduce this issue 100% with certain request payload size.

Steps we took to narrow down:

To remove any possible bad behavior of app itself due to coding issue, we wrote a simplest client i.e. running the cURL directly from with in SSH'd container instance.
We ran the curl from worker node with smaller payload where the container is hosted, this worked
Ran the curl from worker node with larger payload, this worked
We then ran the cURL with smaller payload from within container (app instance), this worked
Ran the curl from container with larger payload, this failed (intermittent at times)
Took packet captures on the container virtual interface (overlay networking) and eth0 default interface.
Packet captures on the virtual interface had no abnormal behavior. But pcap on eth0 showed RST connections from worker node to external service within a second or 2 of the request initiation.
We took captures on the external endpoint as well as on firewall. All of them showed symptom of the problem but not the root cause.
We tried running the same cURL on other clustered environment based on Kubernetes. We could reproduce this issue on every docker runtime. Though Cloudfoundry uses garden technology, but it still delegates the job to RunC which is the runtime for container based on Docker code.
For us running this command on the echo 1 > /proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_be_liberal on the worker node did the trick! Thanks @aaronlehmann for taking time drafting this issue. Had we not stumbled on this issue, not sure how many man hours we would have spent around troubleshooting.

Since this is a system level setting that impacts not just docker traffic, we are still looking at best action that meets our environment needs. I am not inclined to give a resolution step, but just thought will put my thoughts/experience w.r.t this issue on how this took several man hours of effort to identify the root cause. Reading through above responses, 'am curious to know how this was fixed in AWS and why or if there exists a fix for this in any of the docker releases (considering this issue showed up 4 years ago). If this is not yet fixed, what's the best way forward to reopen this issue?

akerouanton commented 11 months ago

The fact that AWS implemented a fix doesn't mean this issue disappeared. As mentioned by users above, this can still happen in some cases. I'll reopen it and I'll backport the PR submitted by @guillon into github.com/moby/moby in the upcoming weeks.

moby / libnetwork

Add workaround for spurious retransmits leading to connection resets #1090