ECONNRESET while doing http 1.1 keep alive requests and server closes the connections

DevasiaThomas commented 1 year ago

Version

v18.15.0

Platform

Running this in a Kubernetes cluster inside nodejs:lts container. Details for K8s node "nodeInfo": { "architecture": "amd64", "bootID": "a5847523-d870-4bc0-9f1a-7a4ed9885ca3", "containerRuntimeVersion": "docker://20.10.12", "kernelVersion": "3.10.0-1160.53.1.el7.x86_64", "kubeProxyVersion": "v1.21.12", "kubeletVersion": "v1.21.12", "machineID": "", "operatingSystem": "linux", "osImage": "Red Hat Enterprise Linux", "systemUUID": "290B2642-439A-2099-6F2B-7ED84D7C284B" },

Subsystem

http or net

What steps will reproduce the bug?

Run non stop http1.1 requests over a keepAlive connection pool using node-fetch or axios against any server that closes the connection from its side after 'N' seconds. In my case the server did a 20 second close. There might be multiple requests to the server over the connection but if 20 seconds have elapsed it closes the connection after the the last http response is sent.I have a default nodejs client configuration ( I haven't assigned it more threads or anything).

When configuring the custom http agent on the client side, i supplied {keepAlive: true, maxSockets:50}.

How often does it reproduce? Is there a required condition?

When the there's a lot of requests being sent constantly things are fine, but if there is a slowdown(not many things do, hence lesser requests to go out) - the next request usually ends up getting an ECONNRESET.

Based on my TCP dump I have, when there's a good load of requests over the connection pool when the server sends a [FIN, ACK], the client sends a [FIN,ACK] and the server sends an ACK back and the connection closes successfully

But when there is a "lull" later and there's not enough requests over the pool, the server sends a [FIN,ACK] for an unused connection in the pool, the nodejs client responds [ACK] and the next request in the queue goes on this socket causing the server to respond with a RESET. (Rightly so - coz the server wanted to close the connection).

Now I believe the reason for the next request to go on the socket that just got the FIN has probably has to do with connection choosing strategy. I think the default is both these frameworks is lifo, and the ACK (without the FIN) that gets sent makes the connection go to the top of the pool for the next request.

What is the expected behavior? Why is that the expected behavior?

A socket closed from the server side (FIN,ACK sent by server) must be removed from the connection pool instead of it being kept in there - Regardless of the fact that a FIN wasn't sent back. And no future requests should go on it.

What do you see instead?

The connection stays in the pool if the FIN wasn't sent back. The next request goes on the connection. Sever forcibly closes connection with a RESET

Additional information

I tried a few other frameworks apart from node-fetch and axios (same issue, making me think it's a node core problem) - But I can't use them in my code so not mentioning them .

When I reduced maxSockets from 50 to 20 the issue happened less frequently. Which is why i think it is related activity on those sockets. I switch to the keepaliveagent package that has a SocketTTL feature - it helps, but doesn't solve the problem . Resets still happen (same issue). Seems like this issue was reported there and they tried to handle it there (still a problem though). I'm assuming this issue has the same problem I'm facing - they were using keepaliveagent as well.

bnoordhuis commented 1 year ago

I've read your description of the problem bug I don't think it's actionable without a way to reproduce that doesn't depend on third-party modules like axios. Can you add a test case?

DevasiaThomas commented 1 year ago

@bnoordhuis i'll see if I can come up with something. Since this happened across multiple frameworks this should be reproducible by using just the core stuff. Gimme this weekend :) I was hoping you check the TCP FIN handling part of the code and see how the socket is treated on the client side on receiving a FIN. If it doesn't remove the connection from the pool before doing anything else - that would be the problem.

bnoordhuis commented 1 year ago

Node handles it when ECONNRESET happens when the socket is inside the socket pool but there's a fundamental (and inevitable) race condition when that happens just when the socket is being pulled from the pool.

Node could theoretically use another socket if the current socket fails on first write but without a circuit breaker or other fail-safes that creates a DoS-like scenario when the server immediately closes incoming connections. Node would basically busy-loop trying to send the request.

DevasiaThomas commented 1 year ago

I understand the inevitable race that can happen and I agree you have to retry in such a scenario and its best that the application takes care of that. But based on my tcp dump the server sends a FIN which causes the client to send an ACK(socket on client transitions to close_wait). Whatever happens after the ACK is sent causes the socket to get pushed to the top of the connection pool(which is why the very next request goes on it). So, there is time for the lifo algorithm or underlying code to push the connection to top of the pool - the same code can remove it from the pool instead of pushing it to the top when the ACK is for a FIN. This is what I wanted to highlight. I am working on a reproducible example. I hope I am successful with it.

DevasiaThomas commented 1 year ago

@bnoordhuis So, I don't have guaranteed replicability - but over time (5 of the runs i got it in like the first 5 mins, 2 other runs i had to leave it overnight). I'm running them on different physical boxes. I couldn't get resetAndDestroy to actually send a TCP RST, but the prints happen and socket hangups get reported on the client end. If you can figure out how to send the RST, you'll get the ECONNRESET as reported above. The code tries to mimic what happened in my firm's environment.

Server code

```javascript import http from 'node:http'; //Time, in seconds, any socket can be alive for. If there is a request already being processed close it as soon as response is done let maxSocketLifetime = 20; if (maxSocketLifetime === 0) { process.exit(1); } else { maxSocketLifetime = maxSocketLifetime * 1000; } /** * * @param socket {node:net.Socket} * @return {Promise} */ async function initiateClose(socket) { //console.log('Initiate Close' + socket.remotePort); if (socket.writableEnded || socket.destroyed || socket.closed) { return; } if(socket.requestInFlight) { socket.closeAfterResponse = true; return; } //Send FIN //console.log('Timed Close' + socket.remotePort); socket.end(); } async function handleClose(hasError) { console.log('Socket Closed' + this.remotePort); //when it was closed by client clearTimeout(this.intervalz); } /** * * @param socket {node:net.Socket} * @return {Promise} */ async function handleConnection(socket) { console.log('New Connection' + socket.remotePort); socket.requestInFlight = false; socket.closeAfterResponse = false; socket.intervalz = setTimeout(initiateClose, maxSocketLifetime - 100, socket); socket.on('close', handleClose); } /** * * @param req {http.IncomingMessage} * @param res {http.ServerResponse} * @return {Promise} */ async function handleRequest(req, res) { if(req.socket.writableEnded) {// If client sends data after we sent FIN console.log('Reset') req.socket.resetAndDestroy(); } req.socket.requestInFlight = true; let body = []; req.on('data', (chunk) => { body.push(chunk); }).on('end', () => { body = Buffer.concat(body).toString(); res.end(body); res.socket.requestInFlight = false; if(res.socket.closeAfterResponse) { //Send FIN //console.log('Close after response' + res.socket.remotePort); res.socket.end(); } }); } let serverOpts = { keepAliveTimeout: maxSocketLifetime, connectionsCheckingInterval: Math.floor(maxSocketLifetime/3) } let server = http.createServer(serverOpts); server.on('connection', handleConnection); server.on('request', handleRequest); server.listen(6969); console.log('Server Started'); ```

Client code

```javascript import http from "node:http"; let load_arr = [5000, 4000, 3000, 1500, 1000, 600, 400, 300, 200, 100, 50, 20 , 10 , 1]; let interval_arr = [1, 3, 5, 8, 9, 10, 13, 15, 17]; const postData = JSON.stringify([{"_id":"6414532c438f86245e1a9fae","index":0,"guid":"a7bf9526-1972-418d-b251-f865a02131ae","isActive":false,"balance":"$2,510.55","picture":"http://placehold.it/32x32","age":31,"eyeColor":"green","name":"Pittman Green","gender":"male","company":"EVENTEX","email":"pittmangreen@eventex.com","phone":"+1 (812) 487-3784","address":"557 Johnson Avenue, Dotsero, South Dakota, 682","about":"Esse voluptate proident duis sunt excepteur eiusmod proident dolore. Ullamco nulla voluptate nostrud ullamco officia quis proident id pariatur Lorem quis culpa. Non ipsum voluptate exercitation reprehenderit ex duis nisi exercitation.\r\n","registered":"2014-05-19T06:47:19 +07:00","latitude":55.881952,"longitude":176.809812,"tags":["dolor","magna","aute","voluptate","ex","eiusmod","nisi"],"friends":[{"id":0,"name":"Tammi Odom"},{"id":1,"name":"Bush Ingram"},{"id":2,"name":"Pennington Sharp"}],"greeting":"Hello, Pittman Green! You have 1 unread messages.","favoriteFruit":"strawberry"},{"_id":"6414532ce611be5510ba2c21","index":1,"guid":"6350b383-bee6-48a1-b19e-00cc20d74db7","isActive":true,"balance":"$2,193.74","picture":"http://placehold.it/32x32","age":38,"eyeColor":"brown","name":"Rice Mccullough","gender":"male","company":"MAGNINA","email":"ricemccullough@magnina.com","phone":"+1 (915) 527-2484","address":"689 Hanover Place, Lumberton, Wisconsin, 4844","about":"Labore dolor nostrud anim veniam incididunt dolore fugiat proident. Sit fugiat incididunt culpa aute veniam. Sint labore non magna labore tempor amet. Id mollit ipsum eiusmod reprehenderit dolor consequat nostrud nisi.\r\n","registered":"2019-04-02T08:05:50 +07:00","latitude":34.247422,"longitude":-144.451109,"tags":["elit","adipisicing","voluptate","proident","consectetur","minim","duis"],"friends":[{"id":0,"name":"Abbott Compton"},{"id":1,"name":"Cathryn Wilkins"},{"id":2,"name":"Jeannette Robertson"}],"greeting":"Hello, Rice Mccullough! You have 7 unread messages.","favoriteFruit":"strawberry"},{"_id":"6414532c1896aa6394d16d18","index":2,"guid":"22596015-6d9b-447e-8fa4-9007e80b37bb","isActive":true,"balance":"$2,100.66","picture":"http://placehold.it/32x32","age":24,"eyeColor":"green","name":"Katharine Kaufman","gender":"female","company":"ELENTRIX","email":"katharinekaufman@elentrix.com","phone":"+1 (846) 453-2614","address":"271 Grand Avenue, Tonopah, New Hampshire, 9234","about":"Eiusmod officia amet sit dolor elit id labore ea. Aliquip elit veniam reprehenderit aliquip mollit. Tempor non ipsum ad exercitation sit sunt consectetur amet pariatur. Nisi in non incididunt quis ex laboris. Veniam voluptate proident do commodo laborum elit velit Lorem enim aliquip mollit quis tempor veniam. Anim ea velit fugiat sit ea voluptate esse in officia nostrud consequat amet magna.\r\n","registered":"2020-12-21T05:29:02 +08:00","latitude":-54.93951,"longitude":151.859727,"tags":["proident","sint","Lorem","duis","sint","sit","excepteur"],"friends":[{"id":0,"name":"Corrine Chavez"},{"id":1,"name":"Diana Pate"},{"id":2,"name":"Madden Yang"}],"greeting":"Hello, Katharine Kaufman! You have 2 unread messages.","favoriteFruit":"strawberry"},{"_id":"6414532c404e9ebdabd81650","index":3,"guid":"b8d8605f-a72a-4d8f-954d-2339434d3c5d","isActive":false,"balance":"$2,420.51","picture":"http://placehold.it/32x32","age":29,"eyeColor":"green","name":"Mattie Craft","gender":"female","company":"CINASTER","email":"mattiecraft@cinaster.com","phone":"+1 (982) 481-3054","address":"525 Church Lane, Statenville, Michigan, 8845","about":"Et irure velit esse reprehenderit id. Dolore excepteur labore ullamco eu est nostrud et. Ipsum consectetur exercitation ut elit id irure ullamco irure irure voluptate laborum id enim est. Ipsum elit voluptate ullamco culpa dolor esse mollit sint velit. Cupidatat exercitation ipsum commodo enim labore anim officia ut incididunt ea velit amet ut Lorem. Fugiat id occaecat anim aute ex elit irure pariatur commodo dolore magna pariatur.\r\n","registered":"2014-05-10T04:15:02 +07:00","latitude":-46.535255,"longitude":140.547752,"tags":["esse","ullamco","mollit","quis","proident","aliqua","pariatur"],"friends":[{"id":0,"name":"Irene Delgado"},{"id":1,"name":"Christensen Craig"},{"id":2,"name":"Vilma Mccarthy"}],"greeting":"Hello, Mattie Craft! You have 6 unread messages.","favoriteFruit":"banana"},{"_id":"6414532c1a5cfc33a11dc118","index":4,"guid":"d7dfa4ee-e4be-43c4-bd5a-2993debbbebc","isActive":false,"balance":"$2,646.29","picture":"http://placehold.it/32x32","age":23,"eyeColor":"brown","name":"Lopez Davenport","gender":"male","company":"MEDIFAX","email":"lopezdavenport@medifax.com","phone":"+1 (885) 442-2655","address":"401 Eastern Parkway, Loretto, Iowa, 7360","about":"Nulla laboris laborum ipsum voluptate sint deserunt eu. Esse velit sit sunt qui. Proident anim do ea do. Elit qui sit cupidatat dolor quis fugiat dolor quis voluptate in. Velit esse irure est duis nulla incididunt do ut. Tempor qui culpa labore eiusmod voluptate quis elit in cillum ea nostrud.\r\n","registered":"2020-04-25T03:40:35 +07:00","latitude":58.560169,"longitude":58.62533,"tags":["voluptate","esse","est","cupidatat","officia","elit","deserunt"],"friends":[{"id":0,"name":"Blanca Wilkinson"},{"id":1,"name":"Aguirre Schmidt"},{"id":2,"name":"Mandy Norris"}],"greeting":"Hello, Lopez Davenport! You have 1 unread messages.","favoriteFruit":"strawberry"}]); let agent = new http.Agent({ keepAlive:true, maxSockets: 50, }); const options = { agent: agent, hostname: 'localhost', port: 6969, method: 'POST', headers: { 'Content-Type': 'application/json', }, }; async function handleResponse(res) { let body = []; res.setEncoding('utf8'); res.on('data', (chunk) => { body.push(chunk); }); res.on('end', () => { body = body.join(""); }); } async function handleRequestError(error) { console.error(`problem with request: ${error.message}`) } while (true) { let load = load_arr[load_arr.length * Math.random() | 0]; for (let i = 0; i < load; i++) { let req = http.request(options, handleResponse); req.on('error', handleRequestError); req.write(postData); req.end(); } let interval = interval_arr[interval_arr.length * Math.random() | 0]; await new Promise(r => setTimeout(r, interval * 1000)); } ```

bnoordhuis commented 1 year ago

@nodejs/http can you take a look?

DevasiaThomas commented 1 year ago

To surface the problem more frequently, i tried to randomly close ~ 1% of the connections after the response was ended so this

        res.socket.requestInFlight = false;
        if(res.socket.closeAfterResponse) {

changed to this

        res.socket.requestInFlight = false;
        let random = Math.random();
        if(random > 0.98 || res.socket.closeAfterResponse) {

You can change the number to increase the percentage of connections closed by the server. This what the TCP dump will look like (if my node.js server would actually send the reset).

The cool thing I saw in the dump was that these connections were idle till the FIN showed up. Which is why I'm emphasizing that there is some logic causing an outgoing ACK (not FIN,ACK) to an incoming FIN (on my nodejs client) to appear as an active connection which further causes the lifo scheduling (this happens in fifo as well) to push that connection to the top of the pool instead of marking it as closed from the pool's perspective.

ronag commented 1 year ago

Haven't looked into this in depth but I think this is expected. There is a race condition in the http/1.1 spec for keep alive requests. This can be mostly avoided by applying the keep alive hints that the server provides.

Undici does that and I would recommend you switch to that.

DevasiaThomas commented 1 year ago

Haven't looked into this in depth but I think this is expected. There is a race condition in the http/1.1 spec for keep alive requests. This can be mostly avoided by applying the keep alive hints that the server provides.

I'm sorry what do you mean by hints? Is the hint part of the spec or implemented by nodejs servers? I don't know about this - so if you could point me to the info regarding this, I would greatly appreciate it.

Undici does that and I would recommend you switch to that.

Can't - its not that simple. Our restAPI interactions are all generated code based on things like OpenApi and protobuff schemas. A quick google search does not find me a generator that uses Undici. If we had to switch - we'd probably choose another language at that point. This is because my colleagues have spend quite a bit of time on this and we don't believe its the race mentioned in spec regarding keep-alive.(We haven't faced this problem in rust and python implementations of core http libs)

Based on the tcpdump timelines and ordering of which connection gets the next request, its pretty clear that nodejs has the time to apply the lifo algorithm to place the socket, that just sent the ACK to the incoming FIN, to the top of the pool. Which also means you have the same amount of time to take that socket out of the connection pool. I believe this is fixable. If you can show me where the code, that handles the connection pool management, is - I can see if I can submit a PR.

ronag commented 1 year ago

MDN calls it parameters. See the timeout parameter. https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Keep-Alive

ronag commented 1 year ago

The client should not make any request on a connection that has been idle more than timeout duration (minus some threshold).

DevasiaThomas commented 1 year ago

The client should not make any request on a connection that has been idle more than timeout duration (minus some threshold).

@ronag , timeout is not the problem here. Neither is the concept of keep alive. Please have a deep look into my issue. Thanks :)

ronag commented 1 year ago

@mcollina I think this is something we've resolved in undici. This kind of goes into http core client maintenance issue...

ywave620 commented 1 year ago

@DevasiaThomas I believe this is just the known issue as memtioned by @ronag and @bnoordhuis. For you curiosity why

an outgoing ACK (not FIN,ACK) to an incoming FIN (on my nodejs client) to appear as an active connection which further causes the lifo scheduling (this happens in fifo as well) to push that connection to the top of the pool

your sever code response a HTTP response indicating keep-alive followed by a TCP FIN:

async function handleRequest(req, res) {
// ...
        res.socket.requestInFlight = false;
        if(res.socket.closeAfterResponse) {
            //Send FIN
            //console.log('Close after response' + res.socket.remotePort);
            res.socket.end();
        }
// ...

nodejs client does not take the TCP FIN as a signal to reuse the connection but does take the HTTP response with a keep-alive indication.

ywave620 commented 1 year ago

FYI, the TCP FIN, in nodejs spectrum, is the 'end' event of a socket. I don't see any code of http agent makes use of this event to decide whether a socket can be reused.

DevasiaThomas commented 1 year ago

FYI, the TCP FIN, in nodejs spectrum, is the 'end' event of a socket. I don't see any code of http agent makes use of this event to decide whether a socket can be reused.

@ywave620 There's 2 behaviors on an incoming FIN,ACK from the server based on my TCP dump.

Node.js Client sends FIN, ACK --> This has no problem, works fine
Node.js Client sends ACK --> Idk what code path makes this happen, but this also causes socket to go to top of connection pool. Because the very next request goes on this socket.

ywave620 commented 1 year ago

Node.js Client sends FIN, ACK --> This has no problem, works fine

This is the default behavior of a nodejs client/server, i.e. nodejs automatically close any half-open connection. This behavior can be overwritten by setting allowHalfOpen to true in the option paased to http or net module when creating outbound requests/connections or creating server

Node.js Client sends ACK --> Idk what code path makes this happen, but this also causes socket to go to top of connection pool. Because the very next request goes on this socket.

This worths further digging. Did you spot Node.js Client send a FIN later in this weird scenario?

ywave620 commented 1 year ago

Also note that allowHalfOpen is not officially documented in http module. Is it intended or do we miss it? @ronag

DevasiaThomas commented 1 year ago

Node.js Client sends ACK --> Idk what code path makes this happen, but this also causes socket to go to top of connection pool. Because the very next request goes on this socket.

This worths further digging. Did you spot Node.js Client send a FIN later in this weird scenario?

Nope, since the client sends the next request right away, server sends reset. client has no opportunity to send FIN. I haven't passed allowHalfOpen option anywhere so i can't comment about that.

DevasiaThomas commented 1 year ago

@ywave620 I just checked http.createServer call path real quick and it seems that the Server creates a net.Server with halfOpen as true always. Please confirm. But my issue isn't with the server but the client. I'll check that as well

EDIT: idk if that is only for the listening socket or applies to all incoming sockets for the server.

ywave620 commented 1 year ago

allowHalfOpen applies to either client or server. It make sense in net abstraction level, however, it is copied to the option paased to net by http. E.g. net.createConnection inherits the option from http.request, which is why http.request({allowHalfOpen: true}) works. Although I don't know whether this is the desired usage, in other words, should it be considered as backdoor?

ronag commented 1 year ago

I think your best bet is to either disable keep alive or switch to undici.

DevasiaThomas commented 1 year ago

I think your best bet is to either disable keep alive or switch to undici.

You've mentioned this already. Neither of which I can do. However, I'm fairly certain there's a bug here - please let it get fixed.

mcollina commented 1 year ago

Would you like to send a pull request to resolve this issue? Investigating this problem would likely require multiple days of work, and I'm not convinced either @ronag, myself or other members of @nodejs/http have the time to investigate.

DevasiaThomas commented 1 year ago

@ywave620 So default behavior for nodejs server is to allowHalfOpen whereas for clients, its to not allowHalfOpen - you can still pass it in options to override this behavior (backdoor or not - up the maintainers)

DevasiaThomas commented 1 year ago

@mcollina Sure I can try :)

bnoordhuis commented 1 year ago

there is some logic causing an outgoing ACK (not FIN,ACK) to an incoming FIN (on my nodejs client) to appear as an active connection which further causes the lifo scheduling (this happens in fifo as well) to push that connection to the top of the pool instead of marking it as closed from the pool's perspective

Just an observation from reading lib/_http_agent.js but I don't think that's possible. Sockets are removed from the FIFO/LIFO queue but sockets in the queue are never reordered.

(It's the .freeSockets property, keyed by host.)

What I think can happen is this: destroyed sockets aren't immediately removed from the queue, they're simply discarded (in [FL]IFO order) when the agent looks for a live socket, by inspecting socket.destroyed == true.

Assume for simplicity that all sockets expire at the same time, e.g., because the network went down. Node gets notified by the operating system on a per-socket basis. Crucially, node may not get those notifications all at the same time.

For the next request, the agent discards the ones it knows are dead. It picks the first one it believes is alive - except it isn't, we just haven't received the notification yet. It's basically the race condition I mentioned earlier.

DevasiaThomas commented 1 year ago

@bnoordhuis Thank you looking into it :) . I just reached that part of code

Looking for your thoughts on the below:

in addRequest, the code that clears stale free sockets does not follow the direction of the scheduling algorithm. Any particular reason for this? If it was lifo, you would get a socket from the other end that was not checked for stale sockets.
in places like addRequest, on 'free' and removeSocket, instead of just destroyed or !writable, could we check for half closed scenarios incase allowHalfOpen is not set -> My case where server sent the FIN so its the readable end that was closed but not (yet) the writeable.

mcollina commented 1 year ago

in addRequest, the code that clears stale free sockets does not follow the direction of the scheduling algorithm. Any particular reason for this? If it was lifo, you would get a socket from the other end that was not checked for stale sockets.

No specific need. That block execute synchronously, so there is no possible race condition there.

bnoordhuis commented 1 year ago

Well, I guess that is a bug, if perhaps not the bug. With scheduling === 'lifo' an already destroyed socket can get popped from the tail, because the cleanup code only shifts them from the head of the list.

(As a mechanical sympathy aside: shifting the head isn't very efficient, it's better to pop the tail.)

DevasiaThomas commented 1 year ago

No specific need. That block execute synchronously, so there is no possible race condition there.

@mcollina were you referring to point 2?

mcollina commented 1 year ago

@DevasiaThomas I was wrong, check the comment from @bnoordhuis.

DevasiaThomas commented 1 year ago

@mcollina no worries, conflicting replies kinda confused me and I thought you were talking about point 2. @bnoordhuis and @mcollina any thoughts on point 2? destroyed ==true requires both ends to be closed, whereas !writable checks don't care if the connection was closed from the other side(which is the readable end - my case).

Side-note: I did not pass allowHalfOpen: true in options but @ywave620 surfaced that it is possible to do so

If y'all agree I'll get to working on both of them and submit at PR as soon as I can.

mcollina commented 1 year ago

Let's try fixing the while loop in https://github.com/nodejs/node/blob/main/lib/_http_agent.js#L267.

dhedey commented 1 year ago

Thanks @DevasiaThomas for raising this and agreeing to work on it.

If it helps to add some more information / background from my work on this:

I think this relates to a bug I just raised in node-fetch which surfaces on Node.js 19+ (see https://github.com/node-fetch/node-fetch/issues/1735 ) - due to the default http agent changing to has keepalive: true as of Node.js 19.

There I have a 100% reproduction of this issue - caused by a client sending a Connection: close header over the default agent with keepalive: true. I get ECONNRESET 100% of the time that the second request with the same default agent to the same host in the same event loop as the first response was received. But if you do a setTimeout(0) between the requests, everything is fine.

The fact that this is 100% reproducible suggests to me that it's not simply a race condition but more a bug with the handling logic where Node should really be handling received FINs before it sends new requests.

Actually it suggests to me that there are perhaps two issues:

Node is receiving the FIN but not actioning it until the next event loop - ie still sending requests on this socket - I think this relates to your discussions above about half-open connections and the queues.
Node isn't noticing that a client sent a Connection: close header and closing the connection itself / not reusing the socket for future requests. I'm not 100% sure if this is desired behaviour - but it feels to me like the http agent should probably handle this better?

I think the discussion above deals with my point 1 - and might provide a decent test case.

Do you think it's worth me raising a separate issue for point 2? Or maybe this is handled by the suggestion to avoid half-open connections for new requests?

DevasiaThomas commented 1 year ago

I was using node-fetch myself 😅. But I’m on nodejs 18. Idk about point b, but point a is what I suspected initially. There’s 2 possible issues I observed after digging through the code. I don’t think we need to raise another issue. As soon as i’m done here I’ll ping you and you can test if the fix works for you. How does that sound David?

On Fri, Apr 21, 2023 at 8:01 AM David Edey @.***> wrote:

Thanks @DevasiaThomas https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FDevasiaThomas&data=05%7C01%7Cdevasiaa%40g-mail.buffalo.edu%7C422b683101574c69a23508db42107d56%7C96464a8af8ed40b199e25f6b50a20250%7C0%7C0%7C638176410810410293%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2BZv5mNZzcKOxPsNqCa%2BoCm7atcisPdiXtKkC6oy0KT4%3D&reserved=0 for raising this and agreeing to work on it.

If it helps to add some more information / background from my work on this:

I think this relates to a bug I just raised in node-fetch which surfaces on Node.js 19+ (see node-fetch/node-fetch#1735 https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnode-fetch%2Fnode-fetch%2Fissues%2F1735&data=05%7C01%7Cdevasiaa%40g-mail.buffalo.edu%7C422b683101574c69a23508db42107d56%7C96464a8af8ed40b199e25f6b50a20250%7C0%7C0%7C638176410810410293%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=tGGAXrwEIjVbC1yBDV2x650MNymRHGnTGLMEf%2Bmp7KE%3D&reserved=0 ) - due to the default http agent changing to has keepalive: true as of Node.js 19.

There I have a 100% reproduction of this issue - caused by a client sending a Connection: close header over the default agent with keepalive: true. I get ECONNRESET 100% of the time that the second request with the same default agent to the same host in the same event loop as the first response was received. But if you do a setTimeout(0) between the requests, everything is fine.

The fact that this is 100% reproducible suggests to me that it's not simply a race condition but more a bug with the handling logic where Node should really be handling received FINs before it sends new requests.

Actually it suggests to me that there are perhaps two issues: a. Node is receiving the FIN but not actioning it until the next event loop - ie still sending requests on this socket - I think this relates to your discussions above about half-open connections and the queues. b. Node isn't noticing that a client sent a Connection: close header and closing the connection itself / not reusing the socket for future requests. I'm not 100% sure if this is desired behaviour - but it feels to me like the http agent should probably handle this better?

I think the discussion above deals with my point a - and might provide a decent test case.

Do you think it's worth me raising a separate issue for point b? Or maybe this is handled by the suggestion to avoid half-open connections for new requests?

— Reply to this email directly, view it on GitHub https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnodejs%2Fnode%2Fissues%2F47130%23issuecomment-1517171547&data=05%7C01%7Cdevasiaa%40g-mail.buffalo.edu%7C422b683101574c69a23508db42107d56%7C96464a8af8ed40b199e25f6b50a20250%7C0%7C0%7C638176410810410293%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=6C8sZwDQS7njcFsXr7IY0zuRxjFOHJw4iEtvYKKHra4%3D&reserved=0, or unsubscribe https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FADSFXYLUCPR5GV52B4QQM3TXCHWPPANCNFSM6AAAAAAV6FH5FY&data=05%7C01%7Cdevasiaa%40g-mail.buffalo.edu%7C422b683101574c69a23508db42107d56%7C96464a8af8ed40b199e25f6b50a20250%7C0%7C0%7C638176410810410293%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=iJHJBZqb6XTdzlAw%2BsAkwmPkKjcp5sQwbm16ACrxkNI%3D&reserved=0 . You are receiving this because you were mentioned.Message ID: @.***>

-- Thanks and Regards Devasia Thomas (Sam) Graduate Student, Dept. Of CSE SUNY Buffalo UBID : 5016 9054 Phone: +1 (484) 786-3421

dhedey commented 1 year ago

Sure, sounds good. Thanks DevasiaThomas!

dimaman2001 commented 1 year ago

@dhedey @DevasiaThomas - Was there a resolution to this issue?

dhedey commented 1 year ago

@dimaman2001 I've not been working on it, I think it's just Devasia at the moment. But maybe someone else could take a look too if it hasn't progressed, or maybe @DevasiaThomas could share their work in progress.

Separately, I think @matthewkeil has done a little hunting and has some ideas in https://github.com/ChainSafe/lodestar/issues/5765#issuecomment-1647375023 which might possibly address issue 1 in my comment here: https://github.com/nodejs/node/issues/47130#issuecomment-1517171547

But I haven't dug into the code myself, and unfortunately won't have time for the next few months.

MerzDaniel commented 11 months ago

Also we are hit by this (I name it) bug and are looking forward to an update

Disabling keep-alive or introducing retry logic seems to be a workaround for us.

joamag commented 11 months ago

Would you happen to have any idea on how to implement the retry logic so that requests that are not idempotent do not create issues? Is it easy to check if fetch has already finished sending the data (to the server) and prevent the retry?

adijesori commented 10 months ago

Any updates on this one? It seems this issue still relevant, which prevents us from upgrading to node20

jiho-kr commented 9 months ago

Any updates on this one? issue on node 20.10.x

mcollina commented 9 months ago

At this point I don't think much can be done in Node.js because our theory is that it is a race condition. It would be great to have a reproduction for this (cc @dhedey) with a client and server to prove or disprove this theory.

To mitigate, you can:

disable keep-alive
reduce the keep-alive timeout to a couple of seconds

dhedey commented 8 months ago

@mcollina I have a (relatively) simple reproduction in this PR https://github.com/dhedey/node/pull/1 where a server doesn't send back a Connection: Close header but closes the connection anyway (I believe similar to axum on receiving a Connection: Close header). Whilst I agree it's a race condition, I believe it's a race condition which should be able to be fixed inside Node.js (although it might not be easy).

Basically whenever the server closes a TCP connection, the http agent with keep-alive: true is likely to cause a request error for one or more queued requests or requests submitted in the same event loop iteration as the previous response.

The repro seems to back up my understanding that the http agent is too slow to remove server-closed sockets from its pool, thereby re-using them, and immediately triggering a "socket hang up" (ECONNRESET) inside the http client the next time the socket is attempted to be used, via the socketCloseListener (which transfers a socket "close" event into a request "error").

It's even easier to repro if you use keepAlive: true and maxSockets: 1 on your http agent, sending two requests, and having a connection close header in your http request. This will result in the server closing the connection immediately on you with a FIN packet. When the first response is returned, the http agent will immediately start on the queued second response, re-using the old socket which has been closed by the server and trigger this bug.

Even without that maxSockets, this bug will still manifest if there is a little time between a response and next request, and then the http agent can re-use the socket anyway.

Now that we have a repro in the node code base itself, I'll try to find some time soon to have a play with the implementation itself, and propose / implement some fixes to it.

dhedey commented 8 months ago

I've spent most of the rest of the day on this, and I'm afraid with my newbie energy I was a little too optimistic, and I think for the greater good I'll have to admit defeat and stop working on this.

The earlier messages in this issue cover a lot of good ground, and are likely better places to get up to speed than this post.

To summarise my understanding

This issue crops up where a server closes the TCP connection (with a "FIN"); while in the mean-time the http agent believes the connection is suitable to be kept alive and routes another request to it.

To hit this issue requires:

keep-alive: true to be set on the http agent. The default http agent now has this, which is why this issue got more attention in the last year or so.
The server must not have sent a Connection: close HTTP header on a response before closing the socket with a FIN at the TCP layer (if the http agent sees such a response header, it does correctly ensure the socket is not 'kept alive' after the response completes).
If the server sends a FIN after the previous response, the http agent must send another request before it processes the socket "close" event is processed a few ticks later (so either it has a request queued or receive a new request within a few ticks).

At its core, this is a race condition - however, the multiple ticks between the socket's "free" event firing as a result of parsing the http response and the socket "end" or "close" event firing mean that there's actually quite a large window for the condition to be hit, and the socket re-assigned to another request. Typically the socket has already received the FIN, and maybe even ACK'd it, whilst Node is still reusing the connection.

What I've looked into

I tried picking up part 2 as DevasiaThomas mentioned here: https://github.com/nodejs/node/issues/47130#issuecomment-1476079996 - essentially improving the handling of freeing to reduce the window/ticks where we've received the FIN but can re-assign the socket. But as detailed in the logs below, no flags are yet set to indicate a FIN has been received by the time the socket is freed and re-assigned.

Setting max sockets to 1 (which seems to give the more straight-forward timeline to reason about), we get a timeline (amended from the test in https://github.com/dhedey/node/pull/1) as follows:

CLIENT_SOCKET_READY: Request 1
SERVER_SOCKET_OPEN: 1
SERVER_SOCKET_END: 1 (i.e. sends FIN)
CLIENT_RESPONSE_END: Request 1 (i.e. HTTP response parser completes)
CLIENT_SOCKET_FREE: Request 1 (which triggers CLIENT_AGENT_SOCKET_FREE)
CLIENT_SOCKET_REUSED: Request 1 for Request 2 (i.e. the "socket" event on request 2 uses the same socket) 
> With socket._readableState.ended: false
> With socket._readableState.endEmitted: false
> With socket._writableState.ending: false
> With socket._writableState.ended: false
> With socket._writableState.finished: false
CLIENT_SOCKET_END: Request 1
<CLIENT_SOCKET_CLOSED then fires which causes Request 2 to fail>

Sadly at the time free is called, none of the flags are set as part of end-processing yet; so a fix here I imagine would need to look at the code which handles the packet containing TCP data + FIN and ideally share the knowledge of the FIN before sharing the data (or as soon as possible afterwards).

I looked through:

Stream implementation (readable.js, writable.js, duplex.js)
Socket implementation (net.js, tcp_wrap.cc)
HTTP client + Agent implementation (_http_client.js, _http_agent.js)

But with my newbie knowledge I couldn't find where the FIN handling and triggering of the "end" event occurred. (unless it came from the some read call / EOF in Readable.prototype.read which gets triggered via afterConnect in net.js if the socket is no longer readable?).

I guess if the end/close event was only triggered by attempting to read from the socket, that might explain why this bug is so noticable - but I'm probably wrong and I think I'm far past my limits here in terms of code-base expertise, so will need to take a step back.

Someone with more knowledge of how this pipeline works may be able to point at how/whether it may be possible to capture / mark the received FIN earlier; before the socket close handlers are processed a few ticks later.

Possible future work

Note - I'm afraid this isn't something I can do, I'm just posting these in case they may be useful to others

For my original use-case (i.e. old versions of node-fetch before my bug fix), we could avoid this issue if we simply set req.shouldKeepAlive to false if the user of http-client adds their own Connection: close header (and thereby overrides the Connection: keep-alive header which the agent uses). Note - as mentioned in part 2 of the summarisation, if the server replies with a Connection: close header, the issue also doesn't occur - but many servers such as axum don't send their own Connection: close header on the response when the client added it themselves.
Someone with more knowledge could try to pick up the "shortening the window" investigation above. Essentially if we can read the TCP FIN that was sent immediately after the response, and mark the socket as "not to be reused" before we try to re-use it, we could avoid these issues.
Solving Devasia's problem (where the server randomly times out connections, and doesn't use Keep-Alive timeout headers) may benefit from either improved handling of Keep-Alive hint headers; or a more general retry-on-failure approach - but this would be a lot of work and it sounds like the new replacement undici http client has already fixed this issue; so it's probably not on the cards?

In Summary

With lots more knowledgable people advising using undici (the new HTTP 1.1 client) or disabling keep alive, there's not much more I can offer here at the moment.

Sorry I couldn't be more use!

timakhalaya commented 7 months ago

I would strongly advice to do the following :

Check the k8s network policies , check that you don't have any network restrictions
Worth checking the DNS resolution as well...
Check timeout setting on your http module config
And only then we can check keepAlive and keepAliveMsecs, if you have downstream services ensure that you don't have any restarts within the service - > check grace termination

lagog commented 7 months ago

I believe I may have found another testcase that highlights this same issue. It can clearly show how the node http agent will assign a dead socket to a new request. Presumably because the close event handler is not given a chance to process. It executes too far down the event loop.

The issue is as follows:

nodejs http client using default agent sends a request
client remains busy for longer than the server keep-alive timeout (ie. processing the result of request 1 above)
client makes another request to same remote - ERROR:

Error: read ECONNRESET
    at TCP.onStreamRead (node:internal/stream_base_commons:217:20) {
  errno: -4077,
  code: 'ECONNRESET',
  syscall: 'read'
}

Node.js v20.11.0

Note that if you defer the second request using setTimeout(0), it works fine.

Here's the test code:

////
//// SERVER
////

const server = require('node:http').createServer((req, res) => {
    console.log('SERVER: Received:', req.headers);
    res.end('OK');
}).listen(3000);

////
//// CLIENT
////

const request = async (url, options) => {
    return new Promise((resolve, reject) => {
        const req = require('node:http').request(url, options, (res) => {
            res.on('data', (data) => console.log('CLIENT: Received: ' + data.toString(), res.headers));
            res.on('end', resolve);
        });
        req.on('error', reject);
        req.end();
    });
};

(async () => {
    ////
    //// 1. Send a request and wait for response
    ////

    console.log('CLIENT: Send (1) ...');
    await request('http://localhost:3000');

    ////
    //// 2. Stay busy long enough for the server to close the connection.
    ////
    console.log('CLIENT: Busy....');
    const end = Date.now() + 10000;       // nodejs default http server keep-alive is 5 seconds
    while (end > Date.now());

    ////
    //// 3. Send a second request and await response
    ////

    // PROBLEM:    this request will be over the same socket that was created in the first request,
    //             which by now has timed out and was closed on the server side
    //
    // WORKAROUND: defer the request to another event loop iteration (by uncommenting below):
    //     await require('node:util').promisify(setTimeout)(0);
    console.log('CLIENT: Send (2) ...');
    await request('http://localhost:3000');

    server.close();
})();

There's probably not much that can be done with regards to handling the socket close in a more timely way in this situation, but is there anything that can be done in the agent to prevent assigning a dead socket?

ShogunPanda commented 7 months ago

The repro is quite interesting, thanks for that. The only way to solve this would be to detect the ECONNRESET internally and eventually retry with a new socket. But the fix would happen at the agent level I guess.

dhedey commented 7 months ago

That's a nice clear (specific) repro; and this is a good thought experiment:

There's probably not much that can be done with regards to handling the socket close in a more timely way in this situation, but is there anything that can be done in the agent to prevent assigning a dead socket?

Indeed, it makes sense that there's not been a chance for node to action the received TCP close because of the while loop in this case.

As per @ShogunPanda 's thinking:

The only way to solve this would be to detect the ECONNRESET internally and eventually retry with a new socket. But the fix would happen at the agent level I guess.

In this case, yeah adding a retry when sending down a closed socket sounds like the most fruitful approach (option 3 in the Future Work I laid out).

I'm slightly worried that retrying on ECONNRESET might not be quite specific enough, as I believe it covers a few different error cases, and we may wish retry to be specific to this "attempted use after closed" case. More specifically - I imagine we may wish to avoid retrying if the socket closes midway through sending data (rather than being already closed) in case this breaks user expectations around e.g. not sending duplicate POST requests (even if such expectations aren't actually guaranteed in general).

But I really don't know enough about this to comment if this is actually a problem. Maybe catching all such ECONNRESET is just the correct approach! The fact a TCP reset ends up coming out a http socket feels like a little bit of an abstraction break. Although if we've already started sending data from the stream we may be unable to just restart.

Perhaps we can check if the socket is still accessible before using it (via changes in node's C layer via some call into the OS TCP layer or saving a flag earlier when the socket changed - tcp_wrap.cc / uv_common.cc / tcp.c / stream.c); and throw a specific error at the socket/stream level; which can be used to close the http connection and retry on a new one at the http agent level - although this post RE Java suggests it's standard practice to attempt to read to detect that which I believe is what node is doing already when we get the ECONNRESET.

In any case, I am far out of my depth, so will leave all this thinking to people who know a lot more than I! I'm glad it's in such hands now.

nodejs / node