Closed DevasiaThomas closed 2 months ago
I've read your description of the problem bug I don't think it's actionable without a way to reproduce that doesn't depend on third-party modules like axios. Can you add a test case?
@bnoordhuis i'll see if I can come up with something. Since this happened across multiple frameworks this should be reproducible by using just the core stuff. Gimme this weekend :) I was hoping you check the TCP FIN handling part of the code and see how the socket is treated on the client side on receiving a FIN. If it doesn't remove the connection from the pool before doing anything else - that would be the problem.
Node handles it when ECONNRESET happens when the socket is inside the socket pool but there's a fundamental (and inevitable) race condition when that happens just when the socket is being pulled from the pool.
Node could theoretically use another socket if the current socket fails on first write but without a circuit breaker or other fail-safes that creates a DoS-like scenario when the server immediately closes incoming connections. Node would basically busy-loop trying to send the request.
I understand the inevitable race that can happen and I agree you have to retry in such a scenario and its best that the application takes care of that. But based on my tcp dump the server sends a FIN which causes the client to send an ACK(socket on client transitions to close_wait). Whatever happens after the ACK is sent causes the socket to get pushed to the top of the connection pool(which is why the very next request goes on it). So, there is time for the lifo algorithm or underlying code to push the connection to top of the pool - the same code can remove it from the pool instead of pushing it to the top when the ACK is for a FIN. This is what I wanted to highlight. I am working on a reproducible example. I hope I am successful with it.
@bnoordhuis So, I don't have guaranteed replicability - but over time (5 of the runs i got it in like the first 5 mins, 2 other runs i had to leave it overnight). I'm running them on different physical boxes. I couldn't get resetAndDestroy
to actually send a TCP RST, but the prints happen and socket hangups get reported on the client end. If you can figure out how to send the RST, you'll get the ECONNRESET as reported above. The code tries to mimic what happened in my firm's environment.
@nodejs/http can you take a look?
To surface the problem more frequently, i tried to randomly close ~ 1% of the connections after the response was ended so this
res.socket.requestInFlight = false;
if(res.socket.closeAfterResponse) {
changed to this
res.socket.requestInFlight = false;
let random = Math.random();
if(random > 0.98 || res.socket.closeAfterResponse) {
You can change the number to increase the percentage of connections closed by the server. This what the TCP dump will look like (if my node.js server would actually send the reset).
The cool thing I saw in the dump was that these connections were idle till the FIN showed up. Which is why I'm emphasizing that there is some logic causing an outgoing ACK (not FIN,ACK) to an incoming FIN (on my nodejs client) to appear as an active connection which further causes the lifo scheduling (this happens in fifo as well) to push that connection to the top of the pool instead of marking it as closed from the pool's perspective.
Haven't looked into this in depth but I think this is expected. There is a race condition in the http/1.1 spec for keep alive requests. This can be mostly avoided by applying the keep alive hints that the server provides.
Undici does that and I would recommend you switch to that.
Haven't looked into this in depth but I think this is expected. There is a race condition in the http/1.1 spec for keep alive requests. This can be mostly avoided by applying the keep alive hints that the server provides.
I'm sorry what do you mean by hints? Is the hint part of the spec or implemented by nodejs servers? I don't know about this - so if you could point me to the info regarding this, I would greatly appreciate it.
Undici does that and I would recommend you switch to that.
Can't - its not that simple. Our restAPI interactions are all generated code based on things like OpenApi and protobuff schemas. A quick google search does not find me a generator that uses Undici. If we had to switch - we'd probably choose another language at that point. This is because my colleagues have spend quite a bit of time on this and we don't believe its the race mentioned in spec regarding keep-alive.(We haven't faced this problem in rust and python implementations of core http libs)
Based on the tcpdump timelines and ordering of which connection gets the next request, its pretty clear that nodejs has the time to apply the lifo algorithm to place the socket, that just sent the ACK to the incoming FIN, to the top of the pool. Which also means you have the same amount of time to take that socket out of the connection pool. I believe this is fixable. If you can show me where the code, that handles the connection pool management, is - I can see if I can submit a PR.
MDN calls it parameters. See the timeout parameter. https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Keep-Alive
The client should not make any request on a connection that has been idle more than timeout duration (minus some threshold).
The client should not make any request on a connection that has been idle more than timeout duration (minus some threshold).
@ronag , timeout is not the problem here. Neither is the concept of keep alive. Please have a deep look into my issue. Thanks :)
@mcollina I think this is something we've resolved in undici. This kind of goes into http core client maintenance issue...
@DevasiaThomas I believe this is just the known issue as memtioned by @ronag and @bnoordhuis. For you curiosity why
an outgoing ACK (not FIN,ACK) to an incoming FIN (on my nodejs client) to appear as an active connection which further causes the lifo scheduling (this happens in fifo as well) to push that connection to the top of the pool
your sever code response a HTTP response indicating keep-alive followed by a TCP FIN:
async function handleRequest(req, res) {
// ...
res.socket.requestInFlight = false;
if(res.socket.closeAfterResponse) {
//Send FIN
//console.log('Close after response' + res.socket.remotePort);
res.socket.end();
}
// ...
nodejs client does not take the TCP FIN as a signal to reuse the connection but does take the HTTP response with a keep-alive indication.
FYI, the TCP FIN, in nodejs spectrum, is the 'end' event of a socket. I don't see any code of http agent makes use of this event to decide whether a socket can be reused.
FYI, the TCP FIN, in nodejs spectrum, is the 'end' event of a socket. I don't see any code of http agent makes use of this event to decide whether a socket can be reused.
@ywave620 There's 2 behaviors on an incoming FIN,ACK from the server based on my TCP dump.
Node.js Client sends FIN, ACK --> This has no problem, works fine
This is the default behavior of a nodejs client/server, i.e. nodejs automatically close any half-open connection. This behavior can be overwritten by setting allowHalfOpen
to true in the option paased to http
or net
module when creating outbound requests/connections or creating server
Node.js Client sends ACK --> Idk what code path makes this happen, but this also causes socket to go to top of connection pool. Because the very next request goes on this socket.
This worths further digging. Did you spot Node.js Client send a FIN later in this weird scenario?
Also note that allowHalfOpen
is not officially documented in http
module. Is it intended or do we miss it? @ronag
Node.js Client sends ACK --> Idk what code path makes this happen, but this also causes socket to go to top of connection pool. Because the very next request goes on this socket.
This worths further digging. Did you spot Node.js Client send a FIN later in this weird scenario?
Nope, since the client sends the next request right away, server sends reset. client has no opportunity to send FIN.
I haven't passed allowHalfOpen
option anywhere so i can't comment about that.
@ywave620 I just checked http.createServer
call path real quick and it seems that the Server creates a net.Server
with halfOpen as true always. Please confirm. But my issue isn't with the server but the client. I'll check that as well
EDIT: idk if that is only for the listening socket or applies to all incoming sockets for the server.
allowHalfOpen
applies to either client or server. It make sense in net
abstraction level, however, it is copied to the option paased to net
by http
. E.g. net.createConnection
inherits the option from http.request
, which is why http.request({allowHalfOpen: true})
works. Although I don't know whether this is the desired usage, in other words, should it be considered as backdoor?
I think your best bet is to either disable keep alive or switch to undici.
I think your best bet is to either disable keep alive or switch to undici.
You've mentioned this already. Neither of which I can do. However, I'm fairly certain there's a bug here - please let it get fixed.
Would you like to send a pull request to resolve this issue? Investigating this problem would likely require multiple days of work, and I'm not convinced either @ronag, myself or other members of @nodejs/http have the time to investigate.
@mcollina Sure I can try :)
there is some logic causing an outgoing ACK (not FIN,ACK) to an incoming FIN (on my nodejs client) to appear as an active connection which further causes the lifo scheduling (this happens in fifo as well) to push that connection to the top of the pool instead of marking it as closed from the pool's perspective
Just an observation from reading lib/_http_agent.js but I don't think that's possible. Sockets are removed from the FIFO/LIFO queue but sockets in the queue are never reordered.
(It's the .freeSockets
property, keyed by host.)
What I think can happen is this: destroyed sockets aren't immediately removed from the queue, they're simply discarded (in [FL]IFO order) when the agent looks for a live socket, by inspecting socket.destroyed == true
.
Assume for simplicity that all sockets expire at the same time, e.g., because the network went down. Node gets notified by the operating system on a per-socket basis. Crucially, node may not get those notifications all at the same time.
For the next request, the agent discards the ones it knows are dead. It picks the first one it believes is alive - except it isn't, we just haven't received the notification yet. It's basically the race condition I mentioned earlier.
@bnoordhuis Thank you looking into it :) . I just reached that part of code
Looking for your thoughts on the below:
in addRequest
, the code that clears stale free sockets does not follow the direction of the scheduling algorithm. Any particular reason for this? If it was lifo
, you would get a socket from the other end that was not checked for stale sockets.
in places like addRequest
, on 'free'
and removeSocket
, instead of just destroyed
or !writable
, could we check for half closed scenarios incase allowHalfOpen
is not set -> My case where server sent the FIN
so its the readable end that was closed but not (yet) the writeable.
in addRequest, the code that clears stale free sockets does not follow the direction of the scheduling algorithm. Any particular reason for this? If it was lifo, you would get a socket from the other end that was not checked for stale sockets.
No specific need. That block execute synchronously, so there is no possible race condition there.
Well, I guess that is a bug, if perhaps not the bug. With scheduling === 'lifo'
an already destroyed socket can get popped from the tail, because the cleanup code only shifts them from the head of the list.
(As a mechanical sympathy aside: shifting the head isn't very efficient, it's better to pop the tail.)
No specific need. That block execute synchronously, so there is no possible race condition there.
@mcollina were you referring to point 2?
@DevasiaThomas I was wrong, check the comment from @bnoordhuis.
@mcollina no worries, conflicting replies kinda confused me and I thought you were talking about point 2.
@bnoordhuis and @mcollina any thoughts on point 2? destroyed ==true
requires both ends to be closed, whereas !writable
checks don't care if the connection was closed from the other side(which is the readable end - my case).
Side-note: I did not pass allowHalfOpen: true
in options but @ywave620 surfaced that it is possible to do so
If y'all agree I'll get to working on both of them and submit at PR as soon as I can.
Let's try fixing the while loop in https://github.com/nodejs/node/blob/main/lib/_http_agent.js#L267.
Thanks @DevasiaThomas for raising this and agreeing to work on it.
If it helps to add some more information / background from my work on this:
I think this relates to a bug I just raised in node-fetch
which surfaces on Node.js 19+ (see https://github.com/node-fetch/node-fetch/issues/1735 ) - due to the default http agent changing to has keepalive: true
as of Node.js 19.
There I have a 100% reproduction of this issue - caused by a client sending a Connection: close
header over the default agent with keepalive: true
. I get ECONNRESET 100% of the time that the second request with the same default agent to the same host in the same event loop as the first response was received. But if you do a setTimeout(0)
between the requests, everything is fine.
The fact that this is 100% reproducible suggests to me that it's not simply a race condition but more a bug with the handling logic where Node should really be handling received FINs before it sends new requests.
Actually it suggests to me that there are perhaps two issues:
FIN
but not actioning it until the next event loop - ie still sending requests on this socket - I think this relates to your discussions above about half-open connections and the queues.I think the discussion above deals with my point 1 - and might provide a decent test case.
Do you think it's worth me raising a separate issue for point 2? Or maybe this is handled by the suggestion to avoid half-open connections for new requests?
I was using node-fetch myself 😅. But I’m on nodejs 18. Idk about point b, but point a is what I suspected initially. There’s 2 possible issues I observed after digging through the code. I don’t think we need to raise another issue. As soon as i’m done here I’ll ping you and you can test if the fix works for you. How does that sound David?
On Fri, Apr 21, 2023 at 8:01 AM David Edey @.***> wrote:
Thanks @DevasiaThomas https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FDevasiaThomas&data=05%7C01%7Cdevasiaa%40g-mail.buffalo.edu%7C422b683101574c69a23508db42107d56%7C96464a8af8ed40b199e25f6b50a20250%7C0%7C0%7C638176410810410293%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2BZv5mNZzcKOxPsNqCa%2BoCm7atcisPdiXtKkC6oy0KT4%3D&reserved=0 for raising this and agreeing to work on it.
If it helps to add some more information / background from my work on this:
I think this relates to a bug I just raised in node-fetch which surfaces on Node.js 19+ (see node-fetch/node-fetch#1735 https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnode-fetch%2Fnode-fetch%2Fissues%2F1735&data=05%7C01%7Cdevasiaa%40g-mail.buffalo.edu%7C422b683101574c69a23508db42107d56%7C96464a8af8ed40b199e25f6b50a20250%7C0%7C0%7C638176410810410293%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=tGGAXrwEIjVbC1yBDV2x650MNymRHGnTGLMEf%2Bmp7KE%3D&reserved=0 ) - due to the default http agent changing to has keepalive: true as of Node.js 19.
There I have a 100% reproduction of this issue - caused by a client sending a Connection: close header over the default agent with keepalive: true. I get ECONNRESET 100% of the time that the second request with the same default agent to the same host in the same event loop as the first response was received. But if you do a setTimeout(0) between the requests, everything is fine.
The fact that this is 100% reproducible suggests to me that it's not simply a race condition but more a bug with the handling logic where Node should really be handling received FINs before it sends new requests.
Actually it suggests to me that there are perhaps two issues: a. Node is receiving the FIN but not actioning it until the next event loop - ie still sending requests on this socket - I think this relates to your discussions above about half-open connections and the queues. b. Node isn't noticing that a client sent a Connection: close header and closing the connection itself / not reusing the socket for future requests. I'm not 100% sure if this is desired behaviour - but it feels to me like the http agent should probably handle this better?
I think the discussion above deals with my point a - and might provide a decent test case.
Do you think it's worth me raising a separate issue for point b? Or maybe this is handled by the suggestion to avoid half-open connections for new requests?
— Reply to this email directly, view it on GitHub https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnodejs%2Fnode%2Fissues%2F47130%23issuecomment-1517171547&data=05%7C01%7Cdevasiaa%40g-mail.buffalo.edu%7C422b683101574c69a23508db42107d56%7C96464a8af8ed40b199e25f6b50a20250%7C0%7C0%7C638176410810410293%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=6C8sZwDQS7njcFsXr7IY0zuRxjFOHJw4iEtvYKKHra4%3D&reserved=0, or unsubscribe https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FADSFXYLUCPR5GV52B4QQM3TXCHWPPANCNFSM6AAAAAAV6FH5FY&data=05%7C01%7Cdevasiaa%40g-mail.buffalo.edu%7C422b683101574c69a23508db42107d56%7C96464a8af8ed40b199e25f6b50a20250%7C0%7C0%7C638176410810410293%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=iJHJBZqb6XTdzlAw%2BsAkwmPkKjcp5sQwbm16ACrxkNI%3D&reserved=0 . You are receiving this because you were mentioned.Message ID: @.***>
-- Thanks and Regards Devasia Thomas (Sam) Graduate Student, Dept. Of CSE SUNY Buffalo UBID : 5016 9054 Phone: +1 (484) 786-3421
Sure, sounds good. Thanks DevasiaThomas!
@dhedey @DevasiaThomas - Was there a resolution to this issue?
@dimaman2001 I've not been working on it, I think it's just Devasia at the moment. But maybe someone else could take a look too if it hasn't progressed, or maybe @DevasiaThomas could share their work in progress.
Separately, I think @matthewkeil has done a little hunting and has some ideas in https://github.com/ChainSafe/lodestar/issues/5765#issuecomment-1647375023 which might possibly address issue 1 in my comment here: https://github.com/nodejs/node/issues/47130#issuecomment-1517171547
But I haven't dug into the code myself, and unfortunately won't have time for the next few months.
Also we are hit by this (I name it) bug and are looking forward to an update
Disabling keep-alive
or introducing retry logic seems to be a workaround for us.
Would you happen to have any idea on how to implement the retry logic so that requests that are not idempotent do not create issues? Is it easy to check if fetch has already finished sending the data (to the server) and prevent the retry?
Any updates on this one? It seems this issue still relevant, which prevents us from upgrading to node20
Any updates on this one? issue on node 20.10.x
At this point I don't think much can be done in Node.js because our theory is that it is a race condition. It would be great to have a reproduction for this (cc @dhedey) with a client and server to prove or disprove this theory.
To mitigate, you can:
@mcollina I have a (relatively) simple reproduction in this PR https://github.com/dhedey/node/pull/1 where a server doesn't send back a Connection: Close
header but closes the connection anyway (I believe similar to axum
on receiving a Connection: Close
header). Whilst I agree it's a race condition, I believe it's a race condition which should be able to be fixed inside Node.js (although it might not be easy).
Basically whenever the server closes a TCP connection, the http agent with keep-alive: true is likely to cause a request error for one or more queued requests or requests submitted in the same event loop iteration as the previous response.
The repro seems to back up my understanding that the http agent is too slow to remove server-closed sockets from its pool, thereby re-using them, and immediately triggering a "socket hang up" (ECONNRESET) inside the http client the next time the socket is attempted to be used, via the socketCloseListener
(which transfers a socket "close" event into a request "error").
It's even easier to repro if you use keepAlive: true
and maxSockets: 1
on your http agent, sending two requests, and having a connection close header in your http request. This will result in the server closing the connection immediately on you with a FIN packet. When the first response is returned, the http agent will immediately start on the queued second response, re-using the old socket which has been closed by the server and trigger this bug.
Even without that maxSockets
, this bug will still manifest if there is a little time between a response and next request, and then the http agent can re-use the socket anyway.
Now that we have a repro in the node code base itself, I'll try to find some time soon to have a play with the implementation itself, and propose / implement some fixes to it.
I've spent most of the rest of the day on this, and I'm afraid with my newbie energy I was a little too optimistic, and I think for the greater good I'll have to admit defeat and stop working on this.
The earlier messages in this issue cover a lot of good ground, and are likely better places to get up to speed than this post.
This issue crops up where a server closes the TCP connection (with a "FIN"); while in the mean-time the http agent believes the connection is suitable to be kept alive and routes another request to it.
To hit this issue requires:
keep-alive: true
to be set on the http agent. The default http agent now has this, which is why this issue got more attention in the last year or so.Connection: close
HTTP header on a response before closing the socket with a FIN at the TCP layer (if the http agent sees such a response header, it does correctly ensure the socket is not 'kept alive' after the response completes).At its core, this is a race condition - however, the multiple ticks between the socket's "free" event firing as a result of parsing the http response and the socket "end" or "close" event firing mean that there's actually quite a large window for the condition to be hit, and the socket re-assigned to another request. Typically the socket has already received the FIN, and maybe even ACK'd it, whilst Node is still reusing the connection.
I tried picking up part 2 as DevasiaThomas mentioned here: https://github.com/nodejs/node/issues/47130#issuecomment-1476079996 - essentially improving the handling of freeing to reduce the window/ticks where we've received the FIN but can re-assign the socket. But as detailed in the logs below, no flags are yet set to indicate a FIN has been received by the time the socket is freed and re-assigned.
Setting max sockets to 1 (which seems to give the more straight-forward timeline to reason about), we get a timeline (amended from the test in https://github.com/dhedey/node/pull/1) as follows:
CLIENT_SOCKET_READY: Request 1
SERVER_SOCKET_OPEN: 1
SERVER_SOCKET_END: 1 (i.e. sends FIN)
CLIENT_RESPONSE_END: Request 1 (i.e. HTTP response parser completes)
CLIENT_SOCKET_FREE: Request 1 (which triggers CLIENT_AGENT_SOCKET_FREE)
CLIENT_SOCKET_REUSED: Request 1 for Request 2 (i.e. the "socket" event on request 2 uses the same socket)
> With socket._readableState.ended: false
> With socket._readableState.endEmitted: false
> With socket._writableState.ending: false
> With socket._writableState.ended: false
> With socket._writableState.finished: false
CLIENT_SOCKET_END: Request 1
<CLIENT_SOCKET_CLOSED then fires which causes Request 2 to fail>
Sadly at the time free
is called, none of the flags are set as part of end-processing yet; so a fix here I imagine would need to look at the code which handles the packet containing TCP data + FIN and ideally share the knowledge of the FIN before sharing the data (or as soon as possible afterwards).
I looked through:
readable.js
, writable.js
, duplex.js
)net.js
, tcp_wrap.cc
)_http_client.js
, _http_agent.js
)But with my newbie knowledge I couldn't find where the FIN handling and triggering of the "end" event occurred. (unless it came from the some read call / EOF in Readable.prototype.read
which gets triggered via afterConnect
in net.js
if the socket is no longer readable?).
I guess if the end
/close
event was only triggered by attempting to read from the socket, that might explain why this bug is so noticable - but I'm probably wrong and I think I'm far past my limits here in terms of code-base expertise, so will need to take a step back.
Someone with more knowledge of how this pipeline works may be able to point at how/whether it may be possible to capture / mark the received FIN earlier; before the socket close handlers are processed a few ticks later.
Note - I'm afraid this isn't something I can do, I'm just posting these in case they may be useful to others
node-fetch
before my bug fix), we could avoid this issue if we simply set req.shouldKeepAlive
to false
if the user of http-client adds their own Connection: close
header (and thereby overrides the Connection: keep-alive
header which the agent uses). Note - as mentioned in part 2 of the summarisation, if the server replies with a Connection: close
header, the issue also doesn't occur - but many servers such as axum
don't send their own Connection: close
header on the response when the client added it themselves.Keep-Alive
hint headers; or a more general retry-on-failure approach - but this would be a lot of work and it sounds like the new replacement undici
http client has already fixed this issue; so it's probably not on the cards?With lots more knowledgable people advising using undici
(the new HTTP 1.1 client) or disabling keep alive, there's not much more I can offer here at the moment.
Sorry I couldn't be more use!
I would strongly advice to do the following :
keepAlive
and keepAliveMsecs
, if you have downstream services ensure that you don't have any restarts within the service - > check grace termination I believe I may have found another testcase that highlights this same issue. It can clearly show how the node http agent will assign a dead socket to a new request. Presumably because the close event handler is not given a chance to process. It executes too far down the event loop.
The issue is as follows:
Error: read ECONNRESET
at TCP.onStreamRead (node:internal/stream_base_commons:217:20) {
errno: -4077,
code: 'ECONNRESET',
syscall: 'read'
}
Node.js v20.11.0
Note that if you defer the second request using setTimeout(0), it works fine.
Here's the test code:
////
//// SERVER
////
const server = require('node:http').createServer((req, res) => {
console.log('SERVER: Received:', req.headers);
res.end('OK');
}).listen(3000);
////
//// CLIENT
////
const request = async (url, options) => {
return new Promise((resolve, reject) => {
const req = require('node:http').request(url, options, (res) => {
res.on('data', (data) => console.log('CLIENT: Received: ' + data.toString(), res.headers));
res.on('end', resolve);
});
req.on('error', reject);
req.end();
});
};
(async () => {
////
//// 1. Send a request and wait for response
////
console.log('CLIENT: Send (1) ...');
await request('http://localhost:3000');
////
//// 2. Stay busy long enough for the server to close the connection.
////
console.log('CLIENT: Busy....');
const end = Date.now() + 10000; // nodejs default http server keep-alive is 5 seconds
while (end > Date.now());
////
//// 3. Send a second request and await response
////
// PROBLEM: this request will be over the same socket that was created in the first request,
// which by now has timed out and was closed on the server side
//
// WORKAROUND: defer the request to another event loop iteration (by uncommenting below):
// await require('node:util').promisify(setTimeout)(0);
console.log('CLIENT: Send (2) ...');
await request('http://localhost:3000');
server.close();
})();
There's probably not much that can be done with regards to handling the socket close in a more timely way in this situation, but is there anything that can be done in the agent to prevent assigning a dead socket?
The repro is quite interesting, thanks for that.
The only way to solve this would be to detect the ECONNRESET
internally and eventually retry with a new socket.
But the fix would happen at the agent level I guess.
That's a nice clear (specific) repro; and this is a good thought experiment:
There's probably not much that can be done with regards to handling the socket close in a more timely way in this situation, but is there anything that can be done in the agent to prevent assigning a dead socket?
Indeed, it makes sense that there's not been a chance for node to action the received TCP close because of the while loop in this case.
As per @ShogunPanda 's thinking:
The only way to solve this would be to detect the
ECONNRESET
internally and eventually retry with a new socket. But the fix would happen at the agent level I guess.
In this case, yeah adding a retry when sending down a closed socket sounds like the most fruitful approach (option 3 in the Future Work I laid out).
I'm slightly worried that retrying on ECONNRESET
might not be quite specific enough, as I believe it covers a few different error cases, and we may wish retry to be specific to this "attempted use after closed" case. More specifically - I imagine we may wish to avoid retrying if the socket closes midway through sending data (rather than being already closed) in case this breaks user expectations around e.g. not sending duplicate POST requests (even if such expectations aren't actually guaranteed in general).
But I really don't know enough about this to comment if this is actually a problem. Maybe catching all such ECONNRESET
is just the correct approach! The fact a TCP reset ends up coming out a http socket feels like a little bit of an abstraction break. Although if we've already started sending data from the stream we may be unable to just restart.
Perhaps we can check if the socket is still accessible before using it (via changes in node's C layer via some call into the OS TCP layer or saving a flag earlier when the socket changed - tcp_wrap.cc
/ uv_common.cc
/ tcp.c
/ stream.c
); and throw a specific error at the socket/stream level; which can be used to close the http connection and retry on a new one at the http agent level - although this post RE Java suggests it's standard practice to attempt to read to detect that which I believe is what node is doing already when we get the ECONNRESET
.
In any case, I am far out of my depth, so will leave all this thinking to people who know a lot more than I! I'm glad it's in such hands now.
Version
v18.15.0
Platform
Running this in a Kubernetes cluster inside nodejs:lts container. Details for K8s node "nodeInfo": { "architecture": "amd64", "bootID": "a5847523-d870-4bc0-9f1a-7a4ed9885ca3", "containerRuntimeVersion": "docker://20.10.12", "kernelVersion": "3.10.0-1160.53.1.el7.x86_64", "kubeProxyVersion": "v1.21.12", "kubeletVersion": "v1.21.12", "machineID": "", "operatingSystem": "linux", "osImage": "Red Hat Enterprise Linux", "systemUUID": "290B2642-439A-2099-6F2B-7ED84D7C284B" },
Subsystem
http or net
What steps will reproduce the bug?
Run non stop http1.1 requests over a keepAlive connection pool using
node-fetch
oraxios
against any server that closes the connection from its side after 'N' seconds. In my case the server did a 20 second close. There might be multiple requests to the server over the connection but if 20 seconds have elapsed it closes the connection after the the last http response is sent.I have a default nodejs client configuration ( I haven't assigned it more threads or anything).When configuring the custom http agent on the client side, i supplied
{keepAlive: true, maxSockets:50}
.How often does it reproduce? Is there a required condition?
When the there's a lot of requests being sent constantly things are fine, but if there is a slowdown(not many things do, hence lesser requests to go out) - the next request usually ends up getting an ECONNRESET.
Based on my TCP dump I have, when there's a good load of requests over the connection pool when the server sends a [FIN, ACK], the client sends a [FIN,ACK] and the server sends an ACK back and the connection closes successfully
But when there is a "lull" later and there's not enough requests over the pool, the server sends a [FIN,ACK] for an unused connection in the pool, the nodejs client responds [ACK] and the next request in the queue goes on this socket causing the server to respond with a RESET. (Rightly so - coz the server wanted to close the connection).
Now I believe the reason for the next request to go on the socket that just got the FIN has probably has to do with connection choosing strategy. I think the default is both these frameworks is lifo, and the ACK (without the FIN) that gets sent makes the connection go to the top of the pool for the next request.
What is the expected behavior? Why is that the expected behavior?
A socket closed from the server side (FIN,ACK sent by server) must be removed from the connection pool instead of it being kept in there - Regardless of the fact that a FIN wasn't sent back. And no future requests should go on it.
What do you see instead?
The connection stays in the pool if the FIN wasn't sent back. The next request goes on the connection. Sever forcibly closes connection with a RESET
Additional information
I tried a few other frameworks apart from node-fetch and axios (same issue, making me think it's a node core problem) - But I can't use them in my code so not mentioning them .
When I reduced
maxSockets
from 50 to 20 the issue happened less frequently. Which is why i think it is related activity on those sockets. I switch to thekeepaliveagent
package that has a SocketTTL feature - it helps, but doesn't solve the problem . Resets still happen (same issue). Seems like this issue was reported there and they tried to handle it there (still a problem though). I'm assuming this issue has the same problem I'm facing - they were usingkeepaliveagent
as well.