ECONNRESET while doing http 1.1 keep alive requests and server closes the connections

DevasiaThomas commented 1 year ago

Version

v18.15.0

Platform

Running this in a Kubernetes cluster inside nodejs:lts container. Details for K8s node "nodeInfo": { "architecture": "amd64", "bootID": "a5847523-d870-4bc0-9f1a-7a4ed9885ca3", "containerRuntimeVersion": "docker://20.10.12", "kernelVersion": "3.10.0-1160.53.1.el7.x86_64", "kubeProxyVersion": "v1.21.12", "kubeletVersion": "v1.21.12", "machineID": "", "operatingSystem": "linux", "osImage": "Red Hat Enterprise Linux", "systemUUID": "290B2642-439A-2099-6F2B-7ED84D7C284B" },

Subsystem

http or net

What steps will reproduce the bug?

Run non stop http1.1 requests over a keepAlive connection pool using node-fetch or axios against any server that closes the connection from its side after 'N' seconds. In my case the server did a 20 second close. There might be multiple requests to the server over the connection but if 20 seconds have elapsed it closes the connection after the the last http response is sent.I have a default nodejs client configuration ( I haven't assigned it more threads or anything).

When configuring the custom http agent on the client side, i supplied {keepAlive: true, maxSockets:50}.

How often does it reproduce? Is there a required condition?

When the there's a lot of requests being sent constantly things are fine, but if there is a slowdown(not many things do, hence lesser requests to go out) - the next request usually ends up getting an ECONNRESET.

Based on my TCP dump I have, when there's a good load of requests over the connection pool when the server sends a [FIN, ACK], the client sends a [FIN,ACK] and the server sends an ACK back and the connection closes successfully

But when there is a "lull" later and there's not enough requests over the pool, the server sends a [FIN,ACK] for an unused connection in the pool, the nodejs client responds [ACK] and the next request in the queue goes on this socket causing the server to respond with a RESET. (Rightly so - coz the server wanted to close the connection).

Now I believe the reason for the next request to go on the socket that just got the FIN has probably has to do with connection choosing strategy. I think the default is both these frameworks is lifo, and the ACK (without the FIN) that gets sent makes the connection go to the top of the pool for the next request.

What is the expected behavior? Why is that the expected behavior?

A socket closed from the server side (FIN,ACK sent by server) must be removed from the connection pool instead of it being kept in there - Regardless of the fact that a FIN wasn't sent back. And no future requests should go on it.

What do you see instead?

The connection stays in the pool if the FIN wasn't sent back. The next request goes on the connection. Sever forcibly closes connection with a RESET

Additional information

I tried a few other frameworks apart from node-fetch and axios (same issue, making me think it's a node core problem) - But I can't use them in my code so not mentioning them .

When I reduced maxSockets from 50 to 20 the issue happened less frequently. Which is why i think it is related activity on those sockets. I switch to the keepaliveagent package that has a SocketTTL feature - it helps, but doesn't solve the problem . Resets still happen (same issue). Seems like this issue was reported there and they tried to handle it there (still a problem though). I'm assuming this issue has the same problem I'm facing - they were using keepaliveagent as well.

ziimakc commented 6 months ago

Faced same issue while just making 2 simple fetch requests one after another. While setting keepalive: false didn't help, await sleep(0) // setTimeout(0) solved an issue.

Not sure if this just an coincidence, but if I make request to the same api over https I don't get an error.

Lomilar commented 6 months ago

Having this problem specifically when the node process is doing a lot of processing and doing http based requests to itself -- our analogy -- without a chance to "rest" and take care of all the new packets that came in.

We mitigated it by putting an await new Promise(resolve => setTimeout(resolve, 1)) on the beginning of each HTTP request -- setTimeout(0) wasn't enough for us.

For us, this was manifesting as an EPIPE error. We were using a containerized environment on a windows box.

For us, this was happening with undici 6.10.1, node 20, and express.

joelin-vyond commented 5 months ago

H there, Since Node c22 is just released, it would be great to know whether the issue has been fixed in Node 22, or we still have to use node v16 or the other http client undici?

Thanks, Joe Lin

ShogunPanda commented 4 months ago

Hello. Nothing has changed on our side as this is race condition existent in the HTTP spec.

StanislavKharchenko commented 2 months ago

Hello! Any updates? Is there plans to fix for NodeJS >20 ?

10bo commented 2 months ago

Until a fix is ready, we have worked around the EPIPE error by adding a retry when that specific error is thrown.

StanislavKharchenko commented 2 months ago

Why it was closed without any comment? @ronag ?

ronag commented 2 months ago

As has been explained. This is a race condition in the http 1.1 spec and there is nothing further we can do here.

ziimakc commented 2 months ago

@ronag then how we are supposed to use nodejs/undici when it's randomly closes connections from time to time?

RedYetiDev commented 2 months ago

@ronag given the conversation, I've added known limitation, as there's nothing Node.js can do to fix it, but it's known, correct?

Feel free to adjust if needed.

ronag commented 2 months ago

This issue is about the node core http client. Not undici or fetch. Though they will encounter the same issue.

The problem is that the server (unexpectedly for the client but correctly) closes the keep alive connection. Please read through the whole issue if you want to have a better understanding.

rubengmurray commented 2 months ago

What changed? Node 18.13.0 did not have this issue at all for me. Upgrading to Node 20.12.2 has created this issue. So what's different?

Can I get my Node 20.12.2 service to run with the same parameters as Node 18.13.0 so this doesn't happen?

I've read through this issue, and unless I've missed it, that topic has not been touched on?

No fancy config here: http.Server(express)

h0od commented 2 months ago

What changed? Node 18.13.0 did not have this issue at all for me. Upgrading to Node 20.12.2 has created this issue. So what's different?

The difference is that HTTP keep-alives are turned on by default for the HTTP(S) agents in Node 20. So to go back to the old behavior you can simply just explicitly turn keep-alives off.

rubengmurray commented 2 months ago

Thanks @h0od I'll give this a go.

dhedey commented 2 months ago

@ronag I'm not looking to criticise the closing of this ticket (I understand it's a gnarly one and time could be spent better elsewhere - e.g. undici), but I just wanted to enquire a little more about the background to this:

As has been explained. This is a race condition in the http 1.1 spec and there is nothing further we can do here.

Is this effectively saying it's too hard to improve the behaviour, given the current architecture of Node?

I rarely see an "ECONNRESET" or equivalent errors in other http clients, but I see it a lot in Node - I don't think its behaviour fully aligns with the SHOULD of the http 1.1 spec (copied below) - e.g. around handling transport closures and retries of idempotent requests.

Moreover, there are additional issues inside Node itself which cause this issue to be hit much more readily than it seems to elsewhere; and the change to default to keep-alive in the http client has meant lots of people are hitting this issue who didn't previously:

If on first attempted re-use of a socket, it's seen to be already closed at first read, we get an ECONNRESET propagated up to the client; rather than the http client just using a different socket. This feels like an abstraction leak in the connection assignment of the http client.
The race condition is worsened because of the single threaded model in Node slowing down handling of close events, and multiple event loops of delay to update the reusable socket queue.

As a mitigation - you mention in an earlier comment:

Haven't looked into this in depth but I think this is expected. There is a race condition in the http/1.1 spec for keep alive requests. This can be mostly avoided by applying the keep alive hints that the server provides.

Undici does that and I would recommend you switch to that.

Undici does seem pretty great, and I can see that is where the majority of work is being spent - do we expect undici to be moving inside Node, and replace the http client?

It looks like Undici has a retry handler which could partly solve this issue; but I imagine it probably still has inherits some of the underlying awkwardness from the Node socket layer.

I imagine it performs better though, but I haven't tested it against pathological test cases, to see whether it assigns an alternative socket if the first attempted socket is detected to be already closed at/before use.

But if we expect undici to replace the node http client, then I can see why it's worth us focusing the attention there.

HTTP Spec Reflections

The section 8.1.4 of the http 1.1 spec titled "Practical Considerations" states:

When a client or server wishes to time-out it SHOULD issue a graceful close on the transport connection. Clients and servers SHOULD both constantly watch for the other side of the transport close, and respond to it as appropriate. If a client or server does not detect the other side's close promptly it could cause unnecessary resource drain on the network.

I don't think Node does a great job of responding appropriately to the transport close / FIN from the socket. In many cases, the first awareness that we're re-using a closed socket is when we first attempt to read/write to it on a next request, which propagates all the way up to the user.

I know things are currently very asynchronous, but if we could somehow check the status of the socket before attempting to reuse it -- or detect on failure that we haven't actually sent any of a request down the socket yet -- then we could use a different socket for the request, which would avoid a lot of false-positives, and I think solve a lot of the reports in this thread.

A client, server, or proxy MAY close the transport connection at any time. For example, a client might have started to send a new request at the same time that the server has decided to close the "idle" connection. From the server's point of view, the connection is being closed while it was idle, but from the client's point of view, a request is in progress.

This means that clients, servers, and proxies MUST be able to recover from asynchronous close events. Client software SHOULD reopen the transport connection and retransmit the aborted sequence of requests without user interaction so long as the request sequence is idempotent (see section 9.1.2).

Non-idempotent methods or sequences MUST NOT be automatically retried, although user agents MAY offer a human operator the choice of retrying the request(s). Confirmation by user-agent software with semantic understanding of the application MAY substitute for user confirmation. The automatic retry SHOULD NOT be repeated if the second sequence of requests fails.

Which I read as roughly saying a client should retry GET requests if the underlying socket has issues. This is one of three recommendations I previously put forward.

ronag commented 2 months ago

@ronag I'm not looking to criticise the closing of this ticket (I understand it's a gnarly one and time could be spent better elsewhere - e.g. undici), but I just wanted to enquire a little more about the background to this:

As has been explained. This is a race condition in the http 1.1 spec and there is nothing further we can do here.

Is this effectively saying it's too hard to improve the behaviour, given the current architecture of Node?

Has nothing to do with Node. It's a general problem.

I rarely see an "ECONNRESET" or equivalent errors in other http clients, but I see it a lot in Node - I don't think its behaviour fully aligns with the SHOULD of the http 1.1 spec (copied below) - e.g. around handling transport closures and retries of idempotent requests.

Moreover, there are additional issues inside Node itself which cause this issue to be hit much more readily than it seems to elsewhere; and the change to default to keep-alive in the http client has meant lots of people are hitting this issue who didn't previously:

If on first attempted re-use of a socket, it's seen to be already closed at first read, we get an ECONNRESET propagated up to the client; rather than the http client just using a different socket. This feels like an abstraction leak in the connection assignment of the http client.

Yea, undici does this for idempotent requests. Not something we will improve with the core client. PR welcome if someone wants to do it.

The race condition is worsened because of the single threaded model in Node slowing down handling of close events, and multiple event loops of delay to update the reusable socket queue.

Yes that's a possible improvement but I'm sceptical of its practical significance.

As a mitigation - you mention in an earlier comment:

Haven't looked into this in depth but I think this is expected. There is a race condition in the http/1.1 spec for keep alive requests. This can be mostly avoided by applying the keep alive hints that the server provides. Undici does that and I would recommend you switch to that.

Undici does seem pretty great, and I can see that is where the majority of work is being spent - do we expect undici to be moving inside Node, and replace the http client?

That's at least my long term hope. It's difficult to remove the existing client without breaking the ecosystem.

It looks like Undici has a retry handler which could partly solve this issue; but I imagine it probably still has inherits some of the underlying awkwardness from the Node socket layer.

I imagine it performs better though, but I haven't tested it against pathological test cases, to see whether it assigns an alternative socket if the first attempted socket is detected to be already closed at/before use.

It will retry automatically for idempotent requests. Not for e.g. POST.

But if we expect undici to replace the node http client, then I can see why it's worth us focusing the attention there.

HTTP Spec Reflections

The section 8.1.4 of the http 1.1 spec titled "Practical Considerations" states:

When a client or server wishes to time-out it SHOULD issue a graceful close on the transport connection. Clients and servers SHOULD both constantly watch for the other side of the transport close, and respond to it as appropriate. If a client or server does not detect the other side's close promptly it could cause unnecessary resource drain on the network.

I don't think Node does a great job of responding appropriately to the transport close / FIN from the socket. In many cases, the first awareness that we're re-using a closed socket is when we first attempt to read/write to it on a next request, which propagates all the way up to the user.

Possibly something to improve but again I'm sceptical to it's practical significance.

I know things are currently very asynchronous, but if we could somehow check the status of the socket before attempting to reuse it -- or detect on failure that we haven't actually sent any of a request down the socket yet -- then we could use a different socket for the request, which would avoid a lot of false-positives, and I think solve a lot of the reports in this thread.

Possibly something to improve but again I'm sceptical to it's practical significance.

A client, server, or proxy MAY close the transport connection at any time. For example, a client might have started to send a new request at the same time that the server has decided to close the "idle" connection. From the server's point of view, the connection is being closed while it was idle, but from the client's point of view, a request is in progress. This means that clients, servers, and proxies MUST be able to recover from asynchronous close events. Client software SHOULD reopen the transport connection and retransmit the aborted sequence of requests without user interaction so long as the request sequence is idempotent (see section 9.1.2).

Undici does this.

Non-idempotent methods or sequences MUST NOT be automatically retried, although user agents MAY offer a human operator the choice of retrying the request(s). Confirmation by user-agent software with semantic understanding of the application MAY substitute for user confirmation. The automatic retry SHOULD NOT be repeated if the second sequence of requests fails.

Undici does this.

Which I read as roughly saying a client should retry GET requests if the underlying socket has issues. This is one of three recommendations I previously put forward.

Undici does this.

dhedey commented 2 months ago

@ronag many thanks for the detailed reply 👍.

It sounds like undici is better architected, and in general, we (as users in the node ecosystem) should be transitioning across to it over the inbuilt http client.

Just regarding the "sceptical to it's practical significance" bits - I have a 100% repro of this issue using the node http client to send two consecutive awaited POST requests with Connection: Close against a localhost Axum server (which causes the server to send a HTTP response with no Connection Close and then an immediate FIN) - you could argue it's dodgy server behaviour, but other clients I've tested are fine with it, and the server did send a FIN, which should be "responded to as appropriate".

I'm not suggesting we should fix it in the node client though - I guess the next step would be to try undici and see if I get the same issue - if so, we can raise a ticket there to see if it can seamlessly handle socket-was-closed-before-anything-was-sent, even for e.g. POST requests.

ronag commented 2 months ago

Just regarding the "sceptical to it's practical significance" bits

We can reduce the time span where the data race can occur, but we can't solve the root problem. IMHO it's better to properly retry so it works in the general case than micro optimizing to reduce the likelyhood by a little bit (though ofc it doesn't hurt).

Regarding this issue what we maybe could consider is to disable keep-alive by default again and refer to undici for high performance use cases. @mcollina wdyt?

rubengmurray commented 2 months ago

What changed? Node 18.13.0 did not have this issue at all for me. Upgrading to Node 20.12.2 has created this issue. So what's different?

The difference is that HTTP keep-alives are turned on by default for the HTTP(S) agents in Node 20. So to go back to the old behavior you can simply just explicitly turn keep-alives off.

Doesn't look like this worked for us. We created our own agents with keepAlive: false for http & https and passed them to axios and we're getting the same error of socket hang up.

import http from 'http';
import https from 'https';

export const httpAgent = new http.Agent({ keepAlive: false });
export const httpsAgent = new https.Agent({ keepAlive: false });

....

axios
    .request<T>({
      ...config,
      httpAgent,
      httpsAgent,
      method: 'post',
      url,
      data,
    })

It looks like a lot of the recent talk is regarding undici so I think we will try this for outbound requests instead.

rubengmurray commented 2 months ago

Well, unfortunately it seems using undici breaks our integration tests as nock doesn't work with it. There is an issue for this here https://github.com/nock/nock/issues/2183 which says support has been added in the beta, but we've found that not to work.

We'll now be downgrading.

adijesori commented 2 months ago

Is it possible to just use keep-alive = false by default as was originally before node 20?

ziimakc commented 2 months ago

Completely solved issue for my use case by installing undici and using it instead of native fetch:

import {
    fetch as undiciFetch,
    type RequestInfo as UndiciRequestInfo,
    type RequestInit as UndiciRequestInit,
} from "undici";

// works
async function fetchWrapperUndici(
    input: RequestInfo | URL,
    init?: RequestInit,
): Promise<Response> {
    return undiciFetch(input, init);
}

// ECONNRESET all the time
async function fetchWrapperNative(
    input: RequestInfo | URL,
    init?: RequestInit,
): Promise<Response> {
    return fetch(input, init);
}

nodejs / node