Closed notr1ch closed 1 month ago
I'm unable to reproduce:
$ NODE_DEBUG=net node -e 'fetch ("https://ds.sydney.test-ipv6.com/ip/?callback=?&testdomain=test-ipv6.com&testname=test_ds");'
NET 86651: pipe false undefined
NET 86651: connect: find host ds.sydney.test-ipv6.com
NET 86651: connect: dns options { family: undefined, hints: 32 }
NET 86651: connect: autodetecting
NET 86651: _read - n 16384 isConnecting? true hasHandle? true
NET 86651: _read wait for connection
NET 86651: connect/multiple: will try the following addresses [
{ address: '45.79.238.217', family: 4 },
{ address: '2400:8907::f03c:94ff:fed0:3fc3', family: 6 }
]
NET 86651: connect/multiple: attempting to connect to 45.79.238.217:443 (addressType: 4)
NET 86651: connect/multiple: setting the attempt timeout to 250 ms
NET 86651: connect/multiple: connection attempt to 45.79.238.217:443 completed with status 0
NET 86651: afterConnect
NET 86651: _read - n 16384 isConnecting? false hasHandle? true
NET 86651: Socket._handle.readStart
NET 86651: _read - n 16384 isConnecting? false hasHandle? true
[!NOTE] I only have IPv4
Try a server further away so you can't complete the IPv4 connection in 250ms. You can also try saturating your connection with a speed test to artificially increase your latency.
Some possible alternative servers to test with: https://test-ipv6.com/mirrors.html.en_US
I created a test server that should reproduce this reliably by artificially delaying IPv4 connections.
Try the following:
NODE_DEBUG=net node -e 'fetch ("http://ds-test.r-1.ch:8080/");'
$ NODE_DEBUG=net node -e 'fetch ("http://ds-test.r-1.ch:8080/");'
NET 261750: createConnection [
{
highWaterMark: 65536,
path: undefined,
localAddress: null,
port: '8080',
host: 'ds-test.r-1.ch'
},
null,
[Symbol(normalizedArgs)]: true
]
NET 261750: pipe false undefined
NET 261750: connect: find host ds-test.r-1.ch
NET 261750: connect: dns options { family: undefined, hints: 32 }
NET 261750: connect: autodetecting
NET 261750: connect/multiple: will try the following addresses [
{ address: '142.4.216.103', family: 4 },
{ address: '2607:5300:60:1567:1183:d3a4:263c:fb36', family: 6 }
]
NET 261750: connect/multiple: attempting to connect to 142.4.216.103:8080 (addressType: 4)
NET 261750: connect/multiple: setting the attempt timeout to 250 ms
NET 261750: connect/multiple: connection to 142.4.216.103:8080 timed out
NET 261750: connect/multiple: attempting to connect to 2607:5300:60:1567:1183:d3a4:263c:fb36:8080 (addressType: 6)
NET 261750: destroy
NET 261750: close
NET 261750: close handle
node:internal/deps/undici/undici:13178
Error.captureStackTrace(err);
^
TypeError: fetch failed
at node:internal/deps/undici/undici:13178:13
at process.processTicksAndRejections (node:internal/process/task_queues:95:5) {
[cause]: AggregateError [ETIMEDOUT]:
at internalConnectMultiple (node:net:1116:18)
at internalConnectMultiple (node:net:1184:5)
at Timeout.internalConnectMultipleTimeout (node:net:1710:5)
at listOnTimeout (node:internal/timers:583:11)
at process.processTimers (node:internal/timers:519:7) {
code: 'ETIMEDOUT',
[errors]: [
Error: connect ETIMEDOUT 142.4.216.103:8080
at createConnectionError (node:net:1646:14)
at Timeout.internalConnectMultipleTimeout (node:net:1705:38)
at listOnTimeout (node:internal/timers:583:11)
at process.processTimers (node:internal/timers:519:7) {
errno: -110,
code: 'ETIMEDOUT',
syscall: 'connect',
address: '142.4.216.103',
port: 8080
},
Error: connect ENETUNREACH 2607:5300:60:1567:1183:d3a4:263c:fb36:8080 - Local (:::0)
at internalConnectMultiple (node:net:1180:16)
at Timeout.internalConnectMultipleTimeout (node:net:1710:5)
at listOnTimeout (node:internal/timers:583:11)
at process.processTimers (node:internal/timers:519:7) {
errno: -101,
code: 'ENETUNREACH',
syscall: 'connect',
address: '2607:5300:60:1567:1183:d3a4:263c:fb36',
port: 8080
}
]
}
}
Node.js v22.6.0
cc @ShogunPanda
@notr1ch You can use --network-family-autoselection-attempt-timeout or net.setDefaultAutoSelectFamilyAttemptTimeout to adjust the default value. Does this solve the issue for you?
Yes, increasing the timeout or using --no-network-family-autoselection
works around the issue, however the default settings should establish a connection successfully. An ideal happy eyeballs implementation should not discard the first connection attempt after a timeout, but start a new one and race them both to completion.
Unfortunately we can't do that due to some limitation in our internal architecture. Closing this.
No modern language runtime should abandon a connection after 250ms and fail in this fashion. A quick search of past closed issues and Google shows many cases of users running into this bug in real-world situations. If you can't implement proper happy eyeballs then the default timeout needs adjusting to match real-world network conditions - I would suggest a minimum of 4-5 seconds to allow for a TCP SYN to be retried to account for packet loss.
I think that is not going to work. 5 seconds before failing a connection would stall your process. This will be not acceptable for people with problematic routing.
In Node we aim to provide good defaults for the vast majority of people. Excluding some early stage problem in my first implementation of this feature, I think we are now in a good shape. Moreover, since all you need is to use a simple API call when booting your process, I think we are fine the way we are now.
Defaults matter. I opened this issue because it's happening in our production environment on a 70ms RTT API with datacenter to datacenter connectivity after we upgraded to Node v20. The internet isn't perfect, packet loss happens - a single lost packet should not cause Node to fail the connection due to the connect timeout being shorter than TCP's retry mechanisms. Having to ask people using Node 20+ to add command line parameters to make connections function normally suggests the defaults are not good.
The default also causes issues for many poorly connected countries, China in particular has very high latency through the GFW and countries like Australia have high RTT to endpoints hosted in North America and Europe. Entire types of connections will run into the issue with this default - satellite, in-flight Wi-Fi, old generation cellular networks, etc. will all regularly have over 250ms of overhead and hit this bug. Not everyone in the world is in on gigabit fiber connecting to us-east-1. And not every user of Node is a developer who will know to look through GitHub issues to find a command line parameter to fix it.
I think that is not going to work. 5 seconds before failing a connection would stall your process. This will be not acceptable for people with problematic routing.
I also don't understand how the process would stall with a 5 second timeout - as Node is event based, there should be no blocking socket operations that stall the process. As Node already uses a long timeout on the final candidate, process stalling should have already been an issue with the current implementation.
Regardless of the above arguments, I still consider the behavior described a bug. Node intentionally uses a longer timeout on the final candidate in an attempt to ensure the connection succeeds even if all other candidates have failed. It just so happens that the final candidate is an address family that isn't reachable at all in an IPv4-only environment so the intended longer timeout never even gets attempted. If a better default timeout can't be implemented, I would suggest using the long timeout on the final candidate from each address family instead of just the final candidate overall. This would ensure both IPv4 and IPv6 endpoints can be reached in the event the 250ms timeout is too short for an initial candidate to be viable.
Defaults matter. I opened this issue because it's happening in our production environment on a 70ms RTT API with datacenter to datacenter connectivity after we upgraded to Node v20. The internet isn't perfect, packet loss happens - a single lost packet should not cause Node to fail the connection due to the connect timeout being shorter than TCP's retry mechanisms. Having to ask people using Node 20+ to add command line parameters to make connections function normally suggests the defaults are not good.
I agree that default matters. For people with limited understanding of how networks and dual stack work, 250ms timeout should ensure a good developer experience. For other people which might have some specific needs, we provide a simple to use single call API that they can use to bypass the problem. Or a CLI when you don't control the source code. May I ask you why you can't use the API to increase the timeout at the boot of the process?
The default also causes issues for many poorly connected countries, China in particular has very high latency through the GFW and countries like Australia have high RTT to endpoints hosted in North America and Europe. Entire types of connections will run into the issue with this default - satellite, in-flight Wi-Fi, old generation cellular networks, etc.
I totally agree on that. That's why we use a default timeout which is 2.5 higher that the minimum default which the reference standard suggests.
will all regularly have over 250ms of overhead and hit this bug. Not everyone in the world is in on gigabit fiber connecting > to us-east-1.
Neither do I. I live in the middle of nowhere Italy, definitely far away from meaningful PoP.
And not every user of Node is a developer who will know to look through GitHub issues to find a command line parameter to fix it.
That's probably a good point. Shall we include a better error when all attempts fail? Are you willing to send a PR?
I also don't understand how the process would stall with a 5 second timeout - as Node is event based, there should be no blocking socket operations that stall the process. As Node already uses a long timeout on the final candidate, process stalling should have already been an issue with the current implementation.
System resources are finite. If each connection takes up to 5 seconds, you might run of available sockets. The choice on the final attempt was done to minimize breakage with the previous status quo, but nobody said it's ideal.
Regardless of the above arguments, I still consider the behavior described a bug. Node intentionally uses a longer timeout on the final candidate in an attempt to ensure the connection succeeds even if all other candidates have failed. It just so happens that the final candidate is an address family that isn't reachable at all in an IPv4-only environment so the intended longer timeout never even gets attempted. If a better default timeout can't be implemented, I would suggest using the long timeout on the final candidate from each address family instead of just the final candidate overall. This would ensure both IPv4 and IPv6 endpoints can be reached in the event the 250ms timeout is too short for an initial candidate to be viable.
You can provide a PR which will increase the default attempt timeout. I wouldn't go higher than a second. So we can have a broader audience and see what people think about it.
I forgot: consider that increasing this timeout will be a semver-major change so it will not get into Node before 23.0.0.
May I ask you why you can't use the API to increase the timeout at the boot of the process?
While it's not a problem for me personally to do this, it's more that I shouldn't need to in the first place. Opening a TCP connection should pretty much "just work" in any programming language, whether it's to a far away server or over a network that has some packet loss. I feel this is a kind of "gotcha" that developers really shouldn't have to worry or even know about.
I totally agree on that. That's why we use a default timeout which is 2.5 higher that the minimum default which the reference standard suggests.
While 100ms is the minimum recommended delay between attempts, it is not intended to be a timeout. The current standard recommends 250ms between connection attempts which Node appears to follow, however Node treats this value as a timeout on the attempt as a whole when it is intended to be a delay between starting attempts. Specifically, Node does not adhere to this part of the standard:
Starting a new connection attempt does not affect previous attempts, as multiple connection attempts may occur in parallel. Once one of the connection attempts succeeds (generally when the TCP handshake completes), all other connections attempts that have not yet succeeded SHOULD be canceled.
By closing the first attempt early, it reduces the chances of a successful connection compared to a non-happy-eyeballs path which goes against the entire idea of the standard.
System resources are finite. If each connection takes up to 5 seconds, you might run of available sockets.
Node uses a 10 second connection timeout for hosts without mixed address families, so 5 seconds per attempt per address family seems reasonable to me. The full 5 seconds would also be a worst case situation where the IPv4 host is actually down or otherwise unreachable, it would be expected that most hosts would respond within 1000ms depending on network conditions.
I'll try and get some more eyes on this issue for additional thoughts and feedback.
While it's not a problem for me personally to do this, it's more that I shouldn't need to in the first place. Opening a TCP connection should pretty much "just work" in any programming language, whether it's to a far away server or over a network that has some packet loss. I feel this is a kind of "gotcha" that developers really shouldn't have to worry or even know about.
Well, while I agree on the principle, there have been not many reports like this so I guess the current setup is fine for most of the people.
While 100ms is the minimum recommended delay between attempts, it is not intended to be a timeout. The current standard recommends 250ms between connection attempts which Node appears to follow, however Node treats this value as a timeout on the attempt as a whole when it is intended to be a delay between starting attempts. Specifically, Node does not adhere to this part of the standard:
Yes, unfortunately this is a known technical limitation of Node and its implementation. The reason is that there is no way for us to cancel successful connection if another goes through so we chose to only have one attempt running at any time.
By closing the first attempt early, it reduces the chances of a successful connection compared to a non-happy-eyeballs path which goes against the entire idea of the standard.
System resources are finite. If each connection takes up to 5 seconds, you might run of available sockets.
Node uses a 10 second connection timeout for hosts without mixed address families, so 5 seconds per attempt per address family seems reasonable to me. The full 5 seconds would also be a worst case situation where the IPv4 host is actually down or otherwise unreachable, it would be expected that most hosts would respond within 1000ms depending on network conditions.
The problem is that for high volume system having a 5 seconds timeout would vastely limit the amount of sockets you can open per second.
I'll try and get some more eyes on this issue for additional thoughts and feedback.
Absolutely. I'm more than happy to change my opinion if more people agree.
This has been discussed elsewhere before, e.g. https://github.com/nodejs/node/issues/52216 and https://github.com/orgs/nodejs/discussions/48028.
I very much agree that Node should provide sensible defaults; I can't see how the current defaults are sensible. As a possibly-helpful datapoint, curl and Python are both able to fetch from https://nodejs.org/download/release/v20.17.0/node-v20.17.0-headers.tar.gz
on a machine with high latency, and both (AFAICT) have Happy Eyeballs enabled. That same request fails under Node with the error described on these two threads.
I don't know enough about curl/Python/other runtimes to know what they're doing differently, but it's clear that better default behavior is possible, and I hope that it can work its way into Node, even if it has to wait for a future major version.
Version
v20.12.2
Platform
Subsystem
No response
What steps will reproduce the bug?
In an IPv4-only environment:
autoSelectFamily defaults on in nodejs v20.
When trying to establish a connection, this fails to account for a host that has no IPv6 connectivity attempting to connect to a dual-stack host with both A and AAAA records. The DNS server returns responses in the order A, AAAA. Node tries to connect to the A address with only 250ms timeout, insufficient for many real-world cases (cellular/satellite links, poorly connected ISPs, far away servers, packet loss, etc). This times out, so node proceed to the last candidate which is supposed to have a longer timeout, however the last candidate is an AAAA address and the host has no IPv6 connectivity so it immediately fails, causing the overall connection to fail.
How often does it reproduce? Is there a required condition?
Reproduced 100% under the correct test conditions.
What is the expected behavior? Why is that the expected behavior?
Connection should be established successfully.
What do you see instead?
Connection times out with ETIMEDOUT AggregateError.
Additional information