Happy eyeballs implementation times out prematurely

notr1ch commented 1 month ago

Version

v20.12.2

Platform

Linux 5.10.0-21-amd64 #1 SMP Debian 5.10.162-1 (2023-01-21) x86_64 GNU/Linux

Subsystem

No response

What steps will reproduce the bug?

In an IPv4-only environment:

NODE_DEBUG=net node -e 'fetch ("http://ds-test.r-1.ch:8080/");'

autoSelectFamily : If set to true, it enables a family autodetection algorithm that loosely implements section 5 of RFC 8305. The all option passed to lookup is set to true and the sockets attempts to connect to all obtained IPv6 and IPv4 addresses, in sequence, until a connection is established. The first returned AAAA address is tried first, then the first returned A address, then the second returned AAAA address and so on. Each connection attempt (but the last one) is given the amount of time specified by the autoSelectFamilyAttemptTimeout option before timing out and trying the next address. Ignored if the family option is not 0 or if localAddress is set. Connection errors are not emitted if at least one connection succeeds. If all connections attempts fails, a single AggregateError with all failed attempts is emitted. Default: net.getDefaultAutoSelectFamily()

autoSelectFamily defaults on in nodejs v20.

When trying to establish a connection, this fails to account for a host that has no IPv6 connectivity attempting to connect to a dual-stack host with both A and AAAA records. The DNS server returns responses in the order A, AAAA. Node tries to connect to the A address with only 250ms timeout, insufficient for many real-world cases (cellular/satellite links, poorly connected ISPs, far away servers, packet loss, etc). This times out, so node proceed to the last candidate which is supposed to have a longer timeout, however the last candidate is an AAAA address and the host has no IPv6 connectivity so it immediately fails, causing the overall connection to fail.

How often does it reproduce? Is there a required condition?

Reproduced 100% under the correct test conditions.

What is the expected behavior? Why is that the expected behavior?

Connection should be established successfully.

What do you see instead?

Connection times out with ETIMEDOUT AggregateError.

Additional information

# NODE_DEBUG=net node -e 'fetch ("https://ds.sydney.test-ipv6.com/ip/?callback=?&testdomain=test-ipv6.com&testname=test_ds");'
NET 639: pipe false undefined
NET 639: connect: find host ds.sydney.test-ipv6.com
NET 639: connect: dns options { family: undefined, hints: 32 }
NET 639: connect: autodetecting
NET 639: _read - n 16384 isConnecting? true hasHandle? true
NET 639: _read wait for connection
NET 639: connect/multiple: will try the following addresses [
  { address: '45.79.238.217', family: 4 },
  { address: '2400:8907::f03c:94ff:fed0:3fc3', family: 6 }
]
NET 639: connect/multiple: attempting to connect to 45.79.238.217:443 (addressType: 4)
NET 639: connect/multiple: setting the attempt timeout to 250 ms
NET 639: connect/multiple: connection to 45.79.238.217:443 timed out
NET 639: connect/multiple: attempting to connect to 2400:8907::f03c:94ff:fed0:3fc3:443 (addressType: 6)
NET 639: destroy
NET 639: close
NET 639: close handle
node:internal/deps/undici/undici:12618
    Error.captureStackTrace(err, this);
          ^

TypeError: fetch failed
    at node:internal/deps/undici/undici:12618:11
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5) {
  cause: AggregateError [ETIMEDOUT]:
      at internalConnectMultiple (node:net:1116:18)
      at internalConnectMultiple (node:net:1184:5)
      at Timeout.internalConnectMultipleTimeout (node:net:1710:5)
      at listOnTimeout (node:internal/timers:575:11)
      at process.processTimers (node:internal/timers:514:7) {
    code: 'ETIMEDOUT',
    [errors]: [
      Error: connect ETIMEDOUT 45.79.238.217:443
          at createConnectionError (node:net:1646:14)
          at Timeout.internalConnectMultipleTimeout (node:net:1705:38)
          at listOnTimeout (node:internal/timers:575:11)
          at process.processTimers (node:internal/timers:514:7) {
        errno: -110,
        code: 'ETIMEDOUT',
        syscall: 'connect',
        address: '45.79.238.217',
        port: 443
      },
      Error: connect ENETUNREACH 2400:8907::f03c:94ff:fed0:3fc3:443 - Local (:::0)
          at internalConnectMultiple (node:net:1180:16)
          at Timeout.internalConnectMultipleTimeout (node:net:1710:5)
          at listOnTimeout (node:internal/timers:575:11)
          at process.processTimers (node:internal/timers:514:7) {
        errno: -101,
        code: 'ENETUNREACH',
        syscall: 'connect',
        address: '2400:8907::f03c:94ff:fed0:3fc3',
        port: 443
      }
    ]
  }
}

RedYetiDev commented 1 month ago

I'm unable to reproduce:

$  NODE_DEBUG=net node -e 'fetch ("https://ds.sydney.test-ipv6.com/ip/?callback=?&testdomain=test-ipv6.com&testname=test_ds");'
NET 86651: pipe false undefined
NET 86651: connect: find host ds.sydney.test-ipv6.com
NET 86651: connect: dns options { family: undefined, hints: 32 }
NET 86651: connect: autodetecting
NET 86651: _read - n 16384 isConnecting? true hasHandle? true
NET 86651: _read wait for connection
NET 86651: connect/multiple: will try the following addresses [
  { address: '45.79.238.217', family: 4 },
  { address: '2400:8907::f03c:94ff:fed0:3fc3', family: 6 }
]
NET 86651: connect/multiple: attempting to connect to 45.79.238.217:443 (addressType: 4)
NET 86651: connect/multiple: setting the attempt timeout to 250 ms
NET 86651: connect/multiple: connection attempt to 45.79.238.217:443 completed with status 0
NET 86651: afterConnect
NET 86651: _read - n 16384 isConnecting? false hasHandle? true
NET 86651: Socket._handle.readStart
NET 86651: _read - n 16384 isConnecting? false hasHandle? true

[!NOTE] I only have IPv4

notr1ch commented 1 month ago

Try a server further away so you can't complete the IPv4 connection in 250ms. You can also try saturating your connection with a speed test to artificially increase your latency.

Some possible alternative servers to test with: https://test-ipv6.com/mirrors.html.en_US

notr1ch commented 1 month ago

I created a test server that should reproduce this reliably by artificially delaying IPv4 connections.

Try the following:

NODE_DEBUG=net node -e 'fetch ("http://ds-test.r-1.ch:8080/");'

RedYetiDev commented 1 month ago

$ NODE_DEBUG=net node -e 'fetch ("http://ds-test.r-1.ch:8080/");'
NET 261750: createConnection [
  {
    highWaterMark: 65536,
    path: undefined,
    localAddress: null,
    port: '8080',
    host: 'ds-test.r-1.ch'
  },
  null,
  [Symbol(normalizedArgs)]: true
]
NET 261750: pipe false undefined
NET 261750: connect: find host ds-test.r-1.ch
NET 261750: connect: dns options { family: undefined, hints: 32 }
NET 261750: connect: autodetecting
NET 261750: connect/multiple: will try the following addresses [
  { address: '142.4.216.103', family: 4 },
  { address: '2607:5300:60:1567:1183:d3a4:263c:fb36', family: 6 }
]
NET 261750: connect/multiple: attempting to connect to 142.4.216.103:8080 (addressType: 4)
NET 261750: connect/multiple: setting the attempt timeout to 250 ms
NET 261750: connect/multiple: connection to 142.4.216.103:8080 timed out
NET 261750: connect/multiple: attempting to connect to 2607:5300:60:1567:1183:d3a4:263c:fb36:8080 (addressType: 6)
NET 261750: destroy
NET 261750: close
NET 261750: close handle
node:internal/deps/undici/undici:13178
      Error.captureStackTrace(err);
            ^

TypeError: fetch failed
    at node:internal/deps/undici/undici:13178:13
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5) {
  [cause]: AggregateError [ETIMEDOUT]: 
      at internalConnectMultiple (node:net:1116:18)
      at internalConnectMultiple (node:net:1184:5)
      at Timeout.internalConnectMultipleTimeout (node:net:1710:5)
      at listOnTimeout (node:internal/timers:583:11)
      at process.processTimers (node:internal/timers:519:7) {
    code: 'ETIMEDOUT',
    [errors]: [
      Error: connect ETIMEDOUT 142.4.216.103:8080
          at createConnectionError (node:net:1646:14)
          at Timeout.internalConnectMultipleTimeout (node:net:1705:38)
          at listOnTimeout (node:internal/timers:583:11)
          at process.processTimers (node:internal/timers:519:7) {
        errno: -110,
        code: 'ETIMEDOUT',
        syscall: 'connect',
        address: '142.4.216.103',
        port: 8080
      },
      Error: connect ENETUNREACH 2607:5300:60:1567:1183:d3a4:263c:fb36:8080 - Local (:::0)
          at internalConnectMultiple (node:net:1180:16)
          at Timeout.internalConnectMultipleTimeout (node:net:1710:5)
          at listOnTimeout (node:internal/timers:583:11)
          at process.processTimers (node:internal/timers:519:7) {
        errno: -101,
        code: 'ENETUNREACH',
        syscall: 'connect',
        address: '2607:5300:60:1567:1183:d3a4:263c:fb36',
        port: 8080
      }
    ]
  }
}

Node.js v22.6.0

marco-ippolito commented 1 month ago

cc @ShogunPanda

ShogunPanda commented 1 month ago

@notr1ch You can use --network-family-autoselection-attempt-timeout or net.setDefaultAutoSelectFamilyAttemptTimeout to adjust the default value. Does this solve the issue for you?

notr1ch commented 1 month ago

Yes, increasing the timeout or using --no-network-family-autoselection works around the issue, however the default settings should establish a connection successfully. An ideal happy eyeballs implementation should not discard the first connection attempt after a timeout, but start a new one and race them both to completion.

ShogunPanda commented 1 month ago

Unfortunately we can't do that due to some limitation in our internal architecture. Closing this.

notr1ch commented 1 month ago

No modern language runtime should abandon a connection after 250ms and fail in this fashion. A quick search of past closed issues and Google shows many cases of users running into this bug in real-world situations. If you can't implement proper happy eyeballs then the default timeout needs adjusting to match real-world network conditions - I would suggest a minimum of 4-5 seconds to allow for a TCP SYN to be retried to account for packet loss.

ShogunPanda commented 1 month ago

I think that is not going to work. 5 seconds before failing a connection would stall your process. This will be not acceptable for people with problematic routing.

In Node we aim to provide good defaults for the vast majority of people. Excluding some early stage problem in my first implementation of this feature, I think we are now in a good shape. Moreover, since all you need is to use a simple API call when booting your process, I think we are fine the way we are now.

notr1ch commented 4 weeks ago

Defaults matter. I opened this issue because it's happening in our production environment on a 70ms RTT API with datacenter to datacenter connectivity after we upgraded to Node v20. The internet isn't perfect, packet loss happens - a single lost packet should not cause Node to fail the connection due to the connect timeout being shorter than TCP's retry mechanisms. Having to ask people using Node 20+ to add command line parameters to make connections function normally suggests the defaults are not good.

The default also causes issues for many poorly connected countries, China in particular has very high latency through the GFW and countries like Australia have high RTT to endpoints hosted in North America and Europe. Entire types of connections will run into the issue with this default - satellite, in-flight Wi-Fi, old generation cellular networks, etc. will all regularly have over 250ms of overhead and hit this bug. Not everyone in the world is in on gigabit fiber connecting to us-east-1. And not every user of Node is a developer who will know to look through GitHub issues to find a command line parameter to fix it.

I think that is not going to work. 5 seconds before failing a connection would stall your process. This will be not acceptable for people with problematic routing.

I also don't understand how the process would stall with a 5 second timeout - as Node is event based, there should be no blocking socket operations that stall the process. As Node already uses a long timeout on the final candidate, process stalling should have already been an issue with the current implementation.

Regardless of the above arguments, I still consider the behavior described a bug. Node intentionally uses a longer timeout on the final candidate in an attempt to ensure the connection succeeds even if all other candidates have failed. It just so happens that the final candidate is an address family that isn't reachable at all in an IPv4-only environment so the intended longer timeout never even gets attempted. If a better default timeout can't be implemented, I would suggest using the long timeout on the final candidate from each address family instead of just the final candidate overall. This would ensure both IPv4 and IPv6 endpoints can be reached in the event the 250ms timeout is too short for an initial candidate to be viable.

ShogunPanda commented 4 weeks ago

Defaults matter. I opened this issue because it's happening in our production environment on a 70ms RTT API with datacenter to datacenter connectivity after we upgraded to Node v20. The internet isn't perfect, packet loss happens - a single lost packet should not cause Node to fail the connection due to the connect timeout being shorter than TCP's retry mechanisms. Having to ask people using Node 20+ to add command line parameters to make connections function normally suggests the defaults are not good.

I agree that default matters. For people with limited understanding of how networks and dual stack work, 250ms timeout should ensure a good developer experience. For other people which might have some specific needs, we provide a simple to use single call API that they can use to bypass the problem. Or a CLI when you don't control the source code. May I ask you why you can't use the API to increase the timeout at the boot of the process?

The default also causes issues for many poorly connected countries, China in particular has very high latency through the GFW and countries like Australia have high RTT to endpoints hosted in North America and Europe. Entire types of connections will run into the issue with this default - satellite, in-flight Wi-Fi, old generation cellular networks, etc.

I totally agree on that. That's why we use a default timeout which is 2.5 higher that the minimum default which the reference standard suggests.

will all regularly have over 250ms of overhead and hit this bug. Not everyone in the world is in on gigabit fiber connecting > to us-east-1.

Neither do I. I live in the middle of nowhere Italy, definitely far away from meaningful PoP.

And not every user of Node is a developer who will know to look through GitHub issues to find a command line parameter to fix it.

That's probably a good point. Shall we include a better error when all attempts fail? Are you willing to send a PR?

I also don't understand how the process would stall with a 5 second timeout - as Node is event based, there should be no blocking socket operations that stall the process. As Node already uses a long timeout on the final candidate, process stalling should have already been an issue with the current implementation.

System resources are finite. If each connection takes up to 5 seconds, you might run of available sockets. The choice on the final attempt was done to minimize breakage with the previous status quo, but nobody said it's ideal.

Regardless of the above arguments, I still consider the behavior described a bug. Node intentionally uses a longer timeout on the final candidate in an attempt to ensure the connection succeeds even if all other candidates have failed. It just so happens that the final candidate is an address family that isn't reachable at all in an IPv4-only environment so the intended longer timeout never even gets attempted. If a better default timeout can't be implemented, I would suggest using the long timeout on the final candidate from each address family instead of just the final candidate overall. This would ensure both IPv4 and IPv6 endpoints can be reached in the event the 250ms timeout is too short for an initial candidate to be viable.

You can provide a PR which will increase the default attempt timeout. I wouldn't go higher than a second. So we can have a broader audience and see what people think about it.

ShogunPanda commented 4 weeks ago

I forgot: consider that increasing this timeout will be a semver-major change so it will not get into Node before 23.0.0.

notr1ch commented 4 weeks ago

May I ask you why you can't use the API to increase the timeout at the boot of the process?

While it's not a problem for me personally to do this, it's more that I shouldn't need to in the first place. Opening a TCP connection should pretty much "just work" in any programming language, whether it's to a far away server or over a network that has some packet loss. I feel this is a kind of "gotcha" that developers really shouldn't have to worry or even know about.

I totally agree on that. That's why we use a default timeout which is 2.5 higher that the minimum default which the reference standard suggests.

While 100ms is the minimum recommended delay between attempts, it is not intended to be a timeout. The current standard recommends 250ms between connection attempts which Node appears to follow, however Node treats this value as a timeout on the attempt as a whole when it is intended to be a delay between starting attempts. Specifically, Node does not adhere to this part of the standard:

Starting a new connection attempt does not affect previous attempts, as multiple connection attempts may occur in parallel. Once one of the connection attempts succeeds (generally when the TCP handshake completes), all other connections attempts that have not yet succeeded SHOULD be canceled.

By closing the first attempt early, it reduces the chances of a successful connection compared to a non-happy-eyeballs path which goes against the entire idea of the standard.

System resources are finite. If each connection takes up to 5 seconds, you might run of available sockets.

Node uses a 10 second connection timeout for hosts without mixed address families, so 5 seconds per attempt per address family seems reasonable to me. The full 5 seconds would also be a worst case situation where the IPv4 host is actually down or otherwise unreachable, it would be expected that most hosts would respond within 1000ms depending on network conditions.

I'll try and get some more eyes on this issue for additional thoughts and feedback.

ShogunPanda commented 4 weeks ago

While it's not a problem for me personally to do this, it's more that I shouldn't need to in the first place. Opening a TCP connection should pretty much "just work" in any programming language, whether it's to a far away server or over a network that has some packet loss. I feel this is a kind of "gotcha" that developers really shouldn't have to worry or even know about.

Well, while I agree on the principle, there have been not many reports like this so I guess the current setup is fine for most of the people.

While 100ms is the minimum recommended delay between attempts, it is not intended to be a timeout. The current standard recommends 250ms between connection attempts which Node appears to follow, however Node treats this value as a timeout on the attempt as a whole when it is intended to be a delay between starting attempts. Specifically, Node does not adhere to this part of the standard:

Yes, unfortunately this is a known technical limitation of Node and its implementation. The reason is that there is no way for us to cancel successful connection if another goes through so we chose to only have one attempt running at any time.

By closing the first attempt early, it reduces the chances of a successful connection compared to a non-happy-eyeballs path which goes against the entire idea of the standard.

System resources are finite. If each connection takes up to 5 seconds, you might run of available sockets.

Node uses a 10 second connection timeout for hosts without mixed address families, so 5 seconds per attempt per address family seems reasonable to me. The full 5 seconds would also be a worst case situation where the IPv4 host is actually down or otherwise unreachable, it would be expected that most hosts would respond within 1000ms depending on network conditions.

The problem is that for high volume system having a 5 seconds timeout would vastely limit the amount of sockets you can open per second.

I'll try and get some more eyes on this issue for additional thoughts and feedback.

Absolutely. I'm more than happy to change my opinion if more people agree.

nwalters512 commented 1 week ago

This has been discussed elsewhere before, e.g. https://github.com/nodejs/node/issues/52216 and https://github.com/orgs/nodejs/discussions/48028.

I very much agree that Node should provide sensible defaults; I can't see how the current defaults are sensible. As a possibly-helpful datapoint, curl and Python are both able to fetch from https://nodejs.org/download/release/v20.17.0/node-v20.17.0-headers.tar.gz on a machine with high latency, and both (AFAICT) have Happy Eyeballs enabled. That same request fails under Node with the error described on these two threads.

I don't know enough about curl/Python/other runtimes to know what they're doing differently, but it's clear that better default behavior is possible, and I hope that it can work its way into Node, even if it has to wait for a future major version.

nodejs / node