Debugging Connection Time

squishycat92 commented 2 years ago

Hello, today when I was using my VPN container, I noticed that the connection time (time to establish a tunnel between the client and server) was quite long. I remember it being <5 seconds in the past, but now it frequently takes around 30 seconds and sometimes times out. This happens regardless of network or client, suggesting that it is an issue with the server or container itself. I am currently connecting using IKEv2, but I'd like to know how I can start debugging this issue. Using docker logs vpn doesn't return anything interesting; only a few startup messages. Thanks in advance!

hwdsl2 commented 2 years ago

@squishycat92 Hello! To debug connection issues, please first enable Libreswan logs in the container. See: https://github.com/hwdsl2/docker-ipsec-vpn-server/blob/master/docs/advanced-usage.md#enable-libreswan-logs

squishycat92 commented 2 years ago

I enabled the logs, but it seems that they don't give any hints as to where the problem is. When the client is connecting to the server (the part that hangs), there is no log output; the connection request is only logged seconds before the client actually connects to the server.

I've also tried restarting my network, restarting the server, and connecting with the generated .mobileconfig file (I had previously used a custom one with VPN On Demand). Any ideas as to what could be causing this problem?

hwdsl2 commented 2 years ago

@squishycat92 If there is no new log lines, the connection attempt did not reach the VPN server (Docker container). I can think of several possibilities:

The network between your VPN server and VPN client is unreliable (GFW blocking traffic?).
Your Docker networking setup is broken and does not forward UDP port 500 and 4500 to the container.
Your Docker container itself has networking issues. Try opening a Bash shell inside the container and check network connection using e.g. curl: https://github.com/hwdsl2/docker-ipsec-vpn-server/blob/master/docs/advanced-usage.md#bash-shell-inside-container

Alternatively you can try a different server to see whether the issue is with the VPN server or client.

squishycat92 commented 2 years ago

That's what I thought as well; the connection wouldn't be reaching the container. However, I'm really unsure as to what could be causing this latency. The network is reachable, but just has high latency; the VPN connection will eventually succeed, but it connects very slowly. I've checked my port forwarding and it works, other services such as SSH work perfectly fine with no issues. The Docker container also has internet access and is quite responsive.

With that being said, there is only one scenario that I can think of that might be causing this. I am running this container on a host whose primary DNS server is set to CloudFlare's IPv6 resolver. Would the presence of IPv6 on the host affect anything on this container? Maybe the connection is defaulting to IPv6 before falling back to IPv4 (my hostname has both IPv4 and IPv6 address pointers).

hwdsl2 commented 2 years ago

@squishycat92 Yes that is possible. Try temporarily disable IPv6 on your Docker host and domain DNS, see if it resolves the issue. Let us know the result.

squishycat92 commented 2 years ago

I recently found that there are a couple of underlying network issues with my mesh Wi-Fi system. Although I doubt that they have anything to do with this issue (my preliminary research shows that no packets are being dropped), I will try to fix them and check the container to see if it has resolved the issue.

hwdsl2 commented 2 years ago

@squishycat92 Sounds good. Please also check whether it is an IPv6 issue, as discussed earlier. I will close this issue, but you can continue to reply to this thread if you have new findings.

squishycat92 commented 2 years ago

Hey @hwdsl2, I recently had more free time to investigate possible causes of the problem. It turns out that this issue has nothing to do IPv6 on the host, but rather something seems to have changed when specifying a hostname. I'm currently connecting through my DuckDNS pointer (identifier.duckdns.org), which seems to be causing the latency when establishing a VPN tunnel. Commenting out the line in the environment file and connecting using my public IP results in the <5 second connection time that I had previously been experiencing.

I'm not sure if this issue is on the container's side or on DuckDNS's side. However, I have noticed that connecting to an SSH session using my DuckDNS pointer seems to work just fine, so it seems to be that this is a problem with the way that the container handles dynamic DNS pointers? Thanks!

hwdsl2 commented 2 years ago

@squishycat92 Thanks for the update. Try opening a Bash shell inside the container, then look up your DuckDNS pointer. See if there is a latency for that. The DNS lookup should use your container's configured DNS servers. If there is no latency there, it could be an issue on DuckDNS's side. https://github.com/hwdsl2/docker-ipsec-vpn-server/blob/master/docs/advanced-usage.md#bash-shell-inside-container

squishycat92 commented 2 years ago

The container has no problem reaching both the DNS server (~12ms average ping) or my DuckDNS pointer (~1 ms average ping). In this case, I believe that it would be an underlying problem with DuckDNS, correct?

hwdsl2 commented 2 years ago

@squishycat92 Yes, it could be an issue on DuckDNS's side.

squishycat92 commented 2 years ago

I see, thank you so much for helping me!

squishycat92 commented 2 years ago

@hwdsl2, unfortunately, this issue might need to be reopened. I recently got one of my friends to do a little more testing for me (as I don't own non-Apple devices), and this is what we found: when running in IKEv2 with a specified hostname, only recent Apple devices run into the latency issue (average 25 second connect time). However, non-Apple devices don't seem to run into this issue at all (average 2 second connect time). This is only when a DuckDNS hostname is specified; connecting using the public IP doesn't cause these latency spikes. In our testing, the issue occurred on macOS 12.4, iPadOS 15.5, and iOS 14.0. The issue did not occur on Windows 10 21H2, Android 12, and iOS 12.5.5. Before this issue, the latency was not present on macOS 12.4 and iPadOS 15.5, suggesting that Apple is not to blame here.

Could you provide some insight as to what is happening here? I don't think that Apple, DuckDNS, or the container are really to blame here - it could be an incompatibility between some of them.

hwdsl2 commented 2 years ago

@squishycat92 Hello! Thank you for doing the additional testing. I suspect that this is related to IPv6. If your VPN client tries to connect to the VPN server using its IPv6 (AAAA) DNS record, and IPv6 is not working on the client, then there will be a timeout before IPv4 is attempted.

Not sure what you mean in this part:

Before this issue, the latency was not present on macOS 12.4 and iPadOS 15.5, suggesting that Apple is not to blame here.

Unfortunately, I do not have additional insights to share for this issue. This is most likely a client side problem (or issue with DuckDNS) rather than an issue with the VPN server itself.

squishycat92 commented 2 years ago

I'm not quite sure it's related to IPv6 at this point. Even with IPv6 disabled at the network level (the Docker host has no global IPv6 address), the issue still persists. Unsetting the IPv6 pointer on the DuckDNS web console doesn't fix this issue either.

Not sure what you mean in this part:

I mean that since the latency was previously not present on macOS 12.4 and iPadOS 15.5, Apple is probably not the root cause of the problem. However, the problem seems to only affect recent Apple devices - older devices such as ones on iOS 12.5.5 don't run into the issue.

squishycat92 commented 2 years ago

This issue might be more complicated than I originally thought. It seems that some networks don't have this latency issue when connecting to the VPN container. However, on most networks, the issue is present.

squishycat92 commented 2 years ago

@hwdsl2, I recently had time to further investigate this issue. I was able to find this serverfault thread, which gave me a basic idea of where to start debugging.

First, I had a look at Console (a system-wide logging tool on macOS). Specifically, I filtered the logs for networkextension, which returned results from many network processes. The process relevant to this issue is NEIKEv2Provider, the system process that is called whenever a command to connect to an IKEv2 VPN is processed.

Using Console, I was able to trace what NEIKEv2Provider was actually doing while System Preferences just says "Connecting." I've attached the log messages relevant to this issue below:

default 16:55:56.168589-0700    NEIKEv2Provider <NEIKEv2Provider: Primary Tunnel (ifIndex 12)>: : handleDNSResolution (resolvedEndpoints count 2) (query status Complete)
default 16:55:56.177402-0700    NEIKEv2Provider NEIKEv2Transport: Adding client IKEv2Session[1, 0000000000000000-0000000000000000] with SPI ECB886CE11B4E98C on <NEIKEv2Transport> UDP ::.500 -> 2600:XXXX:XXXX:XXXX:XXXX:XXXX:XXXX:XXX.500
error   16:56:03.376641-0700    NEIKEv2Provider nw_socket_get_input_frames <private> recvmsg(fd 8, 9216 bytes) [61: Connection refused]
default 16:56:27.371690-0700    NEIKEv2Provider ChildSA[1, (null)-(null)] state Connecting -> Disconnected error (null) -> Error Domain=NEIKEv2ErrorDomain Code=3 "PeerDidNotRespond" UserInfo={NSLocalizedDescription=PeerDidNotRespond}
error   16:56:27.371332-0700    NEIKEv2Provider IKEv2Session[1, ECB886CE11B4E98C-0000000000000000] Failed to receive IKE SA Init reply (connect)
default 16:56:27.371484-0700    NEIKEv2Provider IKEv2IKESA[1.1, ECB886CE11B4E98C-0000000000000000] not changing state Disconnected nor error Error Domain=NEIKEv2ErrorDomain Code=3 "PeerDidNotRespond" UserInfo={NSLocalizedDescription=PeerDidNotRespond} -> Error Domain=NEIKEv2ErrorDomain Code=6 "PeerInvalidSyntax: Failed to receive IKE SA Init reply (connect)" UserInfo={NSLocalizedDescription=PeerInvalidSyntax: Failed to receive IKE SA Init reply (connect)}

Based on these messages, I was able to conclude that the VPN authentication request did not ever reach the server and that the connection specifically failed on an IPv6 address. Sure enough, logging reports that the connection was only able to succeed once the connection had failbacked to IPv4.

default 16:56:27.374002-0700    NEIKEv2Provider <NEIKEv2Provider: Primary Tunnel (ifIndex 12)>: : Stopping tunnel before attempting alternate server address
default 16:56:27.379153-0700    NEIKEv2Provider NEIKEv2Transport: Adding client IKEv2Session[2, 0000000000000000-0000000000000000] with SPI C55BE96736EDF2AB on <NEIKEv2Transport> UDP 0.0.0.0:500 -> 24.XXX.XXX.XXX:500cket-flow (satisfied (Path is satisfied), interface: en1, scoped, ipv4, ipv6, dns)] event: path:satisfied @0.000s, uuid: 9BC23CB6-3F5D-48E0-9F7C-60D87ECED804
default 16:56:27.381026-0700    NEIKEv2Provider IKEv2Session[2, C55BE96736EDF2AB-0000000000000000] Initiating IKEv2 connection
default 16:56:28.293401-0700    NEIKEv2Provider IKEv2Session[2, C55BE96736EDF2AB-D5FC1EEA1FFECFA0] Completed connection (connect)
default 16:56:28.293487-0700    NEIKEv2Provider IKEv2IKESA[2.2, C55BE96736EDF2AB-D5FC1EEA1FFECFA0] state Connecting -> Connected

As a result, the solution to this issue is to prevent the device's DNS server from returning an IPv6 address. Since I don't have any real need for IPv6 on my DuckDNS pointer, I just unset the address from DuckDNS' console (not sure why unsetting it previously had no effect on the issue). I think that if IPv6 is really critical to have on the DuckDNS pointer, users could implement a firewall rule to automatically reject packets sent to IPv6 addresses on UDP ports 500 and 4500 (which would immediately fail the IPv6 addresses instead of waiting for the request to time out).

With all that being said, I think that this issue can finally be considered solved. Thanks so much for bearing with me!

hwdsl2 commented 2 years ago

@squishycat92 Thank you for the update!

hwdsl2 / docker-ipsec-vpn-server

Debugging Connection Time #295