nodejs / node

Node.js JavaScript runtime ✨🐢🚀✨
https://nodejs.org
Other
107.31k stars 29.46k forks source link

Socket Hang Up / ECONNRESET / ECONNREFUSED on consecutive http requests (>= 100) and temporary FIX #55330

Closed IbrahimTanyalcin closed 1 week ago

IbrahimTanyalcin commented 1 week ago

Version

v20.18.0

Platform

WSL2/ Ubuntu 20.04 / amd64

Subsystem

node:http

What steps will reproduce the bug?

Using node:http, make consecutive

http.get(`http://${serverDetails.host}:${serverDetails.port}`, (res) => ...)

calls to a standard exressjs/ server

How often does it reproduce? Is there a required condition?

Randomly (every 1 out of 3 ) jest tests that stress test the server will fail with ECONNRESET / ECONNREFUSED

What is the expected behavior? Why is that the expected behavior?

Like in Node 18, Node 16 or before, node should be able to handle concurrent requests and not hand over to the client a socket that has been closed.

What do you see instead?

if you attach error handlers on the request object returned from http.get and console log it, you will end up with one of:

Additional information

I recently updated my application from Node:16 to 20, along with the dependencies and all my integration tests were passing accept the server stress test, that makes 100 requests at the same time and feeds data over Server-Sent events. I am using nvm so I can quickly test different versions.

I initially thought this was because one of these packages were updated;

...
    "express": "^4.21.0",
    "express-session": "^1.18.0",
    "memcached": "^2.2.2",
    "nodemon": "^3.1.7",
    ....
  },
  "devDependencies": {
    "jest": "^29.7.0"
    }
...

but switching back to Node:16 with NEW or OLD versions of the packages consistently passed the tests, where as Node:20 randomly failed the stress test with either NEW or OLD versions of the packages. So I searched in the issues here and over the net and found that many people have similar issues where the server hands over a closed TCP socket to the client/agent, here are other threads:

https://github.com/ChainSafe/lodestar/issues/5765 https://github.com/axios/axios/issues/5929 https://github.com/nodejs/node/issues/47130 https://github.com/node-fetch/node-fetch/issues/1735 https://github.com/nodejs/node/issues/39810 https://github.com/nodejs/node/issues/43456

My original test, consistently passing with Node:16 is as follows (before and after is redacted):

test(`are we able to handle ${nClients} clients?`, async () => {
        expect.assertions(2);
        return Promise.allSettled(
            [...Array(nClients)]
            .map(d => simulClient(serverDetails))
        ).then(async (results) => {
            expect(results.map(d => d.value?.statusCode))
            .toEqual([...Array(nClients)].map(d => 200));
            const [{value:lastCli}] = results.slice(-1);
            let rawData = '';
            lastCli.setEncoding('utf8');
            lastCli.on('data', chunk => rawData += chunk)
            //check if we really have n clients live
            return await until(function(){
                return this.test(rawData)
            },{
                thisArg: new RegExp(
                    `data:\\s*\\x22?size\\x22?\\s*:\\s*${nClients}`,
                    "mi"
                ), 
                interval: 100
            })
        }).then(val => expect(val).toBe(true))
    }, serverDetails.timeout);

above nClients is 100 and serverDetails is:

const serverDetails = {
        started: true,
        host: "localhost",
        port: 3000,
        oProcess: nodeProcess,
        timeout: 20000,
        ...
    }

The simulClient function mimics a user, requests base url, gets the cookie and starts a event-stream (a live response that is never closed) to send data via Server-Sent events {size: //number of clients currently receiving feed}:

 const http = require('node:http');

/**
@description simulates a client navigating to '/',
saving a session cookie and openning
a live TCP connection for server-sent events
@param {Object} serverDetails an object about server info
path to the directiory under which the folders
will be created.
@param {string} serverDetails.host host
@param {string|number} serverDetails.port port
@param {number} serverDetails.timeout timeout in milliseconds
@returns {Promise<http.IncomingMessage>} an `http.IncomingMessage` type from
Node that has an extra `__cliCookie` array as property
@example
const cliResponse = await simulClient(serverDetails);
*/
const simulClient = (serverDetails) => new Promise((r, j) => {
    let timeout = setTimeout(
        ()=> j(new Error (
            "Client request simulation timedout"
        )),
        serverDetails?.timeout ?? 5000
    );
    try {
        http.get(`http://${serverDetails.host}:${serverDetails.port}`, (res) => {
            if (res.statusCode !== 200) {
                throw new Error("Base route non 200 status")
            }
            const 
                headers = new Map(Object.entries(res.headers)),
                cookie = headers.get("set-cookie")?.map(d => d.split(';')[0]);
            http.get(
                `http://${serverDetails.host}:${serverDetails.port}/estream/subscribe`, 
                {headers: {cookie}},
                (res) => {
                    if (res.statusCode !== 200) {
                        throw new Error("Event stream non 200 status")
                    } else if (
                        !res.headers
                        ?.["content-type"]
                        ?.toString()
                        ?.toLowerCase()
                        ?.includes("text/event-stream")
                    ) {
                        throw new Error("Event stream has wrong content-type")
                    }
                    res.__cliCookie = cookie;
                    clearInterval(timeout);
                    r(res)
                }
            )
        })
    } catch (err) {
        j(err)
    }
});
module.exports = simulClient;

I know the throw statements inside the promises wont do anything, please ignore it. One thing that I want to address is that, in the other threads I pasted above, people seem to think keepAlive:false was to be set on the http.Agent but I think this is not the case. I think default Node16 behavior might be:

keepAlive [<boolean>](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#Boolean_type) If set to true, it enables keep-alive functionality on the socket immediately after a new incoming connection is received, similarly on what is done in [socket.setKeepAlive([enable][, initialDelay])][socket.setKeepAlive(enable, initialDelay)]. Default: false.

so this is done one the http.createServer and not on the agent. Although Node:20.18 docs say default is still false, I think this might not be the case.

'FIX'

My fix for this behavior was as follows:

for StressTest:

test(`are we able to handle ${nClients} clients?`, async () => {
        expect.assertions(2);
        return Promise.allSettled(
            [...Array(nClients)]
            .map((d, i) => simulClient({...serverDetails, delay: i * 10}))
        ).then(async (results) => {
            expect(results.map(d => d.value?.statusCode))
            .toEqual([...Array(nClients)].map(d => 200));
            const [{value:lastCli}] = results.slice(-1);
            let rawData = '';
            lastCli.setEncoding('utf8');
            lastCli.on('data', chunk => rawData += chunk)
            //check if we really have n clients live
            return await until(function(){
                return this.test(rawData)
            },{
                thisArg: new RegExp(
                    `data:\\s*\\x22?size\\x22?\\s*:\\s*${nClients}`,
                    "mi"
                ), 
                interval: 100
            })
        }).then(val => expect(val).toBe(true))
    }, serverDetails.timeout);

and simulClient became:

const simulClient = (serverDetails) => new Promise(async (r, j) => {
    try {
        let retry = serverDetails?.retry ?? 3;
        if (!retry){
            throw new Error("Maximum amount of retries were reached")
        }
        let delay = serverDetails?.delay ?? 0,
            iteration = serverDetails?.iteration ?? 0,
            retryPenalty = serverDetails?.retryPenalty ?? 100;
        await new Promise((r) => setTimeout(r, delay));
        await new Promise((r) => setTimeout(r, iteration * retryPenalty));
        let timeout = setTimeout(
                ()=> j(new Error (
                    "Client request simulation timedout"
                )),
                serverDetails?.timeout ?? 5000
            ),
            agent = serverDetails?.agent ?? new http.Agent({keepAlive: false, timeout: serverDetails?.timeout ?? 100});
        const handleRetry = (err) => {
            clearTimeout(timeout); // Clear timeout before retrying
            console.log(`Retrying due to error: ${err.message}. Attempts left: ${retry - 1}`);
            // Retry with decremented retry count and backoff
            r(simulClient({ ...serverDetails, agent, iteration: ++iteration, retry: --retry, delay: 0}));
        };
        http.get(`http://${serverDetails.host}:${serverDetails.port}`, {agent}, (res) => {
            if (res.statusCode !== 200) {
                return handleRetry(new Error("Base route non 200 status"));
            }
            const 
                headers = new Map(Object.entries(res.headers)),
                cookie = headers.get("set-cookie")?.map(d => d.split(';')[0]);
            http.get(
                `http://${serverDetails.host}:${serverDetails.port}/estream/subscribe`, 
                {headers: {cookie}, agent},
                (res) => {
                    if (res.statusCode !== 200) {
                        return handleRetry(new Error("Event stream non 200 status"));
                    } else if (
                        !res.headers
                        ?.["content-type"]
                        ?.toString()
                        ?.toLowerCase()
                        ?.includes("text/event-stream")
                    ) {
                        return handleRetry(new Error("Event stream has wrong content-type"));
                    }
                    res.__cliCookie = cookie;
                    clearTimeout(timeout);
                    r(res)
                }
            ).on("error", (err) => {
                handleRetry(err)
            })
        }).on("error", (err) => {
            handleRetry(err)
        })
    } catch (err) {
        j(err)
    }
});

so the above combines 3 things:

With 3 of these, running the test on cold start or many times 1 after another, consistently passed the tests. Removing 1 of these randomly cropped up the socket hang up error again.

IMO this breaks user space. I understand people pointing out to limitations and race conditions in http 1.1 but Node did not suddenly remember the spec in Node:20. This behavior seems to break user space and I think it is more than just setting keepAlive to false on http.Server. Something in the Node's event loop handling since node 19 must have changed for this type of behavior to emerge.

If the keepAlive on http.Server is the true culprit, shouldn't it be set back to what it was in Node:16? What can be done about it so that user space is prioritized above the spec compliance?

RedYetiDev commented 1 week ago

This might be an issue with the server. Can you reproduce when fetching something else? (Try nodejs.org)

IbrahimTanyalcin commented 1 week ago

@RedYetiDev if it is an issue with the server, why is Node:16, Node:18 is consistently passing the tests? I have attached at least 5 threads related to this issue. They all had issue with the server? @anonrig Did anything change in Node's event loop handling since Node:18 ?

RedYetiDev commented 1 week ago

Like in Node 18, Node 16 or before, node should be able to handle concurrent requests and not hand over to the client a socket that has been closed.

Oh, I didn't see that statement, can you try and reproduce in v22?

IbrahimTanyalcin commented 1 week ago

Like in Node 18, Node 16 or before, node should be able to handle concurrent requests and not hand over to the client a socket that has been closed.

Oh, I didn't see that statement, can you try and reproduce in v22?

I will try with 22, will let you know.

IbrahimTanyalcin commented 1 week ago

@RedYetiDev I tested with 22, I get the same symptoms, HOWEVER, I am starting to think this is because of memcached connections being closed to frequently and Node20/22 hitting a dead socket. I can see that Node18 and before processes the socket connections in a burst of 50 batches, whereas 20/22 is more transient and continuous, which could increase the chances of hitting a dead socket. So to test it out, I commented out cache.end() to memcached (which is not necessary since memcached closes idle sockets after timeout) and now 20/22 is passing the stress test.

I am not sure, I am now inspecting with NODE_DEBUG=NET and will report back once I have more conviction. I think the remedy seems to be not to call cache.end() and let the memcached timeout clear it out on its own. Will report back.

IbrahimTanyalcin commented 1 week ago

Closing this issue as it is indriectly related to whats changed since Node 16. I was able to debug whats going on with NODE_DEBUG=NET, if someone is interested I can attach them. Here is the summary of what was going on:

So as a result:

I do not know if these differences are due to changes in the event loop, but nonetheless they increase your chances of hitting a dead socket. So the FIX is to NOT call cache.end() on currently active pool of memcached connections and let them expire on their own. I am leaving these in here in case anyone bumps to the same problem!