moby / vpnkit

A toolkit for embedding VPN capabilities in your application
Apache License 2.0
1.09k stars 182 forks source link

Random outbound connection timeouts based on server load #587

Open mavci opened 2 years ago

mavci commented 2 years ago

Hello,

We've been experiencing random outbound connection timeouts based on server load for a very long time. After restarting the server, the problems go away, but after a while the timeouts start again. After some research I found these issues related to this topic:

https://github.com/docker/for-win/issues/8861 https://github.com/docker/for-mac/issues/3448 https://github.com/docker/for-mac/issues/6086 https://github.com/docker/for-win/issues/12671 https://github.com/docker/for-win/issues/12761

I didn't put the technical logs here because these issues contain the relevant logs and results that I already have. Anyone else having similar issues? And how can we fix it, any suggestions?

Thank you.

rossinineto commented 2 years ago

I reported in https://github.com/docker/for-win/issues/8861

Spenhouet commented 2 years ago

@djs55 Contrary to the title of this issue, this is not actually random at all, is very reproducible and has nothing to do with server load. Because of this our application no longer works with any version above docker desktop 4.5.0 on Mac and 4.5.1 on Windows. We now have to force all customers to downgrade. This is rather serious for us. We will share a reproduction soon.

EDIT: We might open a new issue since this one is so unspecific and off from the actual issue.

jdeitrick80 commented 1 year ago

I have also seen this issue in versions >4.5.1 including the latest version, but have found that it can also be triggered with low amounts of traffic. The following is how I have been able to reproduce the issue.

docker run --name session-test -it -v /mnt/c/Users/jde/sessions/test:/test python:buster bash
root@17c8b33e70e6:/# pip install --quiet requests
root@17c8b33e70e6:/# cd test/
root@17c8b33e70e6:/test# python sessions.py
10:52:18: request 1
Request complete, sleep 30
10:52:50: request 2
Request complete, sleep 120
10:54:51: request 3
Request complete, sleep 420
11:01:51: request 4
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/urllib3/connection.py", line 174, in _new_conn
    conn = connection.create_connection(
  File "/usr/local/lib/python3.10/site-packages/urllib3/util/connection.py", line 95, in create_connection
    raise err
  File "/usr/local/lib/python3.10/site-packages/urllib3/util/connection.py", line 85, in create_connection
    sock.connect(sa)
TimeoutError: [Errno 110] Connection timed out

sessions.py

import requests
from datetime import datetime
import time

s = requests.Session()

steps = [30, 120, 420, 420]
step = 1
for i in steps:
    print(datetime.now().strftime("%H:%M:%S") + ": request " + str(step))
    r3 = s.get('https://wttr.in')
    print("Request complete, sleep " + str(i))
    step+=1
    time.sleep(i)

As has been mentioned before if I look at a trace from the containers point of view I only see TCP SYNs being sent out during the 4th attempt after waiting 420s since the last request. Also if I kill the vpnkit while it is still trying the 4th attempt then when the vpnkit starts back up the 4th requests is able to complete successfully.

Some things that I have noticed that I do not think were previously mentioned. If I look at a trace from the host I see the TCP SYNs going out and TCP SYN ACKs coming back from the server, but these are not passed on to the container. If I start up another container while the first is trying unsuccessfully to do the 4th attempt it also is not able to reach the same destination, but is able to reach other destinations.

docker run -it python:buster bash
root@14437db6e250:/# curl https://wttr.in
curl: (7) Failed to connect to wttr.in port 443: Connection timed out
root@14437db6e250:/# curl https://google.com
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="https://www.google.com/">here</A>.
</BODY></HTML>
root@14437db6e250:/# curl https://wttr.in
curl: (7) Failed to connect to wttr.in port 443: Connection timed out

The cause of the issue seems to have something to do with using sessions and having a client side keep alive interval being >=60s. If I change to a 30s client keep alive interval I do not run into the issue.

docker run --name session-test -it -v /mnt/c/Users/jde/sessions/test:/test python:buster bash
root@425db0e6590a:/# pip install --quiet requests
root@425db0e6590a:/# cd test/
root@425db0e6590a:/test# python sessions-ka30.py
11:41:35: request 1
Request complete, sleep 30
11:42:06: request 2
Request complete, sleep 120
11:44:06: request 3
Request complete, sleep 420
11:51:06: request 4
Request complete, sleep 420
root@425db0e6590a:/test#

sessions-ka30.py

import requests
from datetime import datetime
import time

import socket
from requests.adapters import HTTPAdapter

class HTTPAdapterWithSocketOptions(HTTPAdapter):
    def __init__(self, *args, **kwargs):
        self.socket_options = kwargs.pop("socket_options", None)
        super(HTTPAdapterWithSocketOptions, self).__init__(*args, **kwargs)

    def init_poolmanager(self, *args, **kwargs):
        if self.socket_options is not None:
            kwargs["socket_options"] = self.socket_options
        super(HTTPAdapterWithSocketOptions, self).init_poolmanager(*args, **kwargs)

KEEPALIVE_INTERVAL = 30
adapter = HTTPAdapterWithSocketOptions(socket_options=[(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1),
(socket.IPPROTO_TCP, socket.TCP_KEEPIDLE, KEEPALIVE_INTERVAL), (socket.IPPROTO_TCP, socket.TCP_KEEPINTVL, KEEPALIVE_INTERVAL)])
s = requests.Session()
s.mount("http://", adapter)
s.mount("https://", adapter)

steps = [30, 120, 420, 420]
step = 1
for i in steps:
    print(datetime.now().strftime("%H:%M:%S") + ": request " + str(step))
    r3 = s.get('https://wttr.in')
    print("Request complete, sleep " + str(i))
    step+=1
    time.sleep(i)

I hope this information helps in resolving the issue or provides a work around for others experiencing it.

I have also added this information to https://github.com/docker/for-win/issues/8861

rossinineto commented 1 year ago

It´s a issue opened 23 June and it remains unsolved. The related issues above dated from more than a year.

nk9 commented 1 year ago

I'm also running into this on Mac. The problem gets progressively more frequent until the Docker Desktop process (and thus vpnkit) is restarted. I've gone back to Docker Desktop 4.5.0 for the time being. Would really like to see this resolved so we can begin upgrading Docker again.

robertnisipeanu commented 1 year ago

This issue was fixed for me on MacOS after editing ~/Library/Group\ Containers/group.com.docker/settings.json and setting vpnKitMaxPortIdleTime from 300 to 0 (Docker Desktop restart required after). I have changed this over a week ago and till now I did not encounter the issue again.

mavci commented 1 year ago

I was having this issue on linux server (not Docker Desktop) and it was fixed by changing network mode to host. But I see this as workaround, because we expect it to work normally with vpnkit and bridge network mode as well. I am still waiting for some update and fix about this issue.

DirkvanWijk commented 1 year ago

I can also confirm that since Docker 4.6 I'm having the same issues.

nk9 commented 1 year ago

Sorry for the unsolicited tag, @djs55 and @avsm, but I was hoping to get some visibility on this. It's affecting lots of Docker Desktop users, who are sticking with 4.5 (February 2022) for now as a workaround. There are repro steps here and in the linked issue (https://github.com/docker/for-win/issues/8861), but I haven't seen any acknowledgement that the VPNKit team is aware of the issue. Apologies if I missed it!

tristanbrown commented 1 year ago

This is still an issue. Worsening timeouts in Docker containers after they've been running for a few days. This bug is affecting countless developers across all fields, and should be prioritized.

tutcugil commented 1 year ago

We have the same issue, is there any update on this? It is very critical bug and still isn't resolved

djs55 commented 1 year ago

I've got an experimental developer build which might help. If you'd like to try it, it's here:

tutcugil commented 1 year ago

I've got an experimental developer build which might help. If you'd like to try it, it's here:

hello @djs55, thank you, i will try and inform you

Best Regards

tutcugil commented 1 year ago

I've got an experimental developer build which might help. If you'd like to try it, it's here:

hello @djs55, thank you, i will try and inform you

Best Regards

@djs55, our networking problem seems to resolve so far, we are still observing, Will this experimental build release soon on docker?

OlegHudyma commented 1 year ago

@djs55 when can we expect for release version of this build?

I see that latest build version is 4.19 does not have this fixed

MrCSharp22 commented 1 year ago

This remains to be an issue on my end. Running docker in WSL2 or old-school using Hyper-V on Windows 11. Version 4.19 didn't resolve the problem. Currently on version 4.20.1 (Build 110738) and the frequency of outbound connection timeouts (including DNS queries to various DNS servers) seems to have increased.

Any updates to this will be very appreciated. I am having to restart the docker instance every couple of hours and this is very inconvenient.

Edit:

Adding some more info on fixes I tried so far: