requestTimeout and long running uploads

julien-f commented 1 year ago

Version

v18.14.0

Platform

Linux prometheus 6.1.9-200.fc37.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Feb 2 00:21:48 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Subsystem

http

What steps will reproduce the bug?

Create an HTTP server with a big requestTimeout (a day in this example):

require('node:http').createServer({
  requestTimeout: 24 * 60 * 60 * 1e3, // a day
}, (req, res) => {
  console.log(req.method, req.url);

  req.resume().on('end', () => {
    res.end('Ok');
  }).on('error', console.warn);
}
).listen(8080);

Do a long running request to it:

$ curl -T /dev/urandom http://localhost:8080 -v
*   Trying 127.0.0.1:8080...
* Connected to localhost (127.0.0.1) port 8080 (#0)
> PUT /urandom HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/7.85.0
> Accept: */*
> Transfer-Encoding: chunked
> Expect: 100-continue
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 100 Continue
* Mark bundle as not supporting multiuse
< HTTP/1.1 408 Request Timeout
< Connection: close
< 
* Send failure: Connection reset by peer
* Closing connection 0
curl: (55) Send failure: Connection reset by peer

How often does it reproduce? Is there a required condition?

It happens everytime, and I haven't been able to pinpoint the exact value at which it becomes problematic.

What is the expected behavior?

It should interrupt the request after 24 hours.

What do you see instead?

It interrupt the request after ~30 seconds.

Additional information

I can work-around by disabling the timeout (requestTimeout: 0) completely but setting a value, even high, seemed preferable.

Any other suggestions on how to handle long running uploads?

theanarkh commented 1 year ago

Try this.

require('node:http').createServer({
  requestTimeout: 24 * 60 * 60 * 1e3, // a day
  connectionsCheckingInterval: 24 * 60 * 60 * 1e3
}, (req, res) => {
  console.log(req.method, req.url);

  req.resume().on('end', () => {
    res.end('Ok');
  }).on('error', console.warn);
}
).listen(8080);

julien-f commented 1 year ago

@theanarkh Why? What do you expect to see with this?

From what I understand, there is still an issue in how requestTimeout is handled, and even if connectionsCheckingInterval works, it will check all connections every 24 hours, this will not guarantee that a specific connection can stay open for 24 hours.

julien-f commented 1 year ago

@ShogunPanda Any opinions on this?

ShogunPanda commented 1 year ago

Inside the connection handling algorithm I always use uv_hrtime to track connections.

@nodejs/libuv Do you think this might explain it?

bnoordhuis commented 1 year ago

@ShogunPanda explain what exactly?

ShogunPanda commented 1 year ago

@bnoordhuis I re-read it and got it wrong probably. I'll try to reproduce and see what happens. I'll get back here shortly.

StephenKelly94 commented 1 year ago

I have been experiencing the same issue. I set a requestTimeout to 2 hours, and was seeing the same result (connection reset by peer after ~30 seconds).

I noticed though after 2 hours of uptime on the system it started to work correctly. After this I set the requestTimeout to 5 minutes and sure enough it started working after 5 minutes uptime. So perhaps it's related to system uptime?

So I opted to set the requestTimeout to 0, and this fixes the issue for now.

ShogunPanda commented 1 year ago

Unfortunately I cannot reproduce this. I tried both on a Mac and on Linux. Given you were mentioning uptime of the system and time of the day, can you provide more info so I can take a look?

julien-f commented 1 year ago

I don't see anything related to the uptime of my machine.

@ShogunPanda What behavior do you get with my test script above? On my system (and others like Ubuntu), curl get an ECONNRESET before 30 seconds.

ShogunPanda commented 1 year ago

@julien-f In my case the server keeps receiving the request without issues.

StephenKelly94 commented 1 year ago

So I altered the @julien-f script to work more closely with how we have encountered it.

Node Version: 18.12.1

const http = require('node:http');
const srv = http.createServer((req, res) => {
    console.log(req.method, req.url);

    req.resume().on('end', () => {
        res.end('Ok');
    }).on('error', console.warn);
}
).listen(10080);
srv.requestTimeout = 10 * 60 * 1000; // 10 minute timeout

I get the following output before 10 minutes uptime:

$ uptime
07:44:46 up 5 min,  0 users,  load average: 29.40, 19.92, 8.74
$ curl -T /dev/urandom http://localhost:10080 | cat
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  211M    0     0    0  211M      0  20.0M --:--:--  0:00:10 --:--:-- 20.6M
curl: (55) Send failure: Broken pipe
$ curl -T /dev/urandom http://localhost:10080 | cat
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  256M    0     0    0  256M      0  20.8M --:--:--  0:00:12 --:--:-- 21.3M
curl: (55) Send failure: Broken pipe

After 10 minutes of uptime it looks like this:

$ uptime
 07:56:26 up 16 min,  0 users,  load average: 2.90, 6.76, 8.16
$ curl -T /dev/urandom http://localhost:10080 | cat
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 13.3G    0     0    0 13.3G      0  21.9M --:--:--  0:10:25 --:--:-- 22.3M
curl: (55) Send failure: Broken pipe
$ curl -T /dev/urandom http://localhost:10080 | cat
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 12.8G    0     0    0 12.8G      0  21.9M --:--:--  0:10:01 --:--:-- 21.2M
curl: (55) Send failure: Broken pipe

I'm not sure if I'm missing anything or if setting the requestTimeout after the server is instantiated makes a difference.

nsgundy commented 1 year ago

Running into this as well it seems. Its an ARMv7 architecture. So far I have only manually reproduced it by uploading a 230 MB file, takes about 10 seconds to complete, handled by express and multer (certainly not a minimal reproduction setup). The timeout can happen anywhere within the transfer. This only seems to happen when the uptime of the system is between 2 and 5 minutes. Running node 18.15.0. requestTimeout is unaltered at 300 seconds.

sheepa commented 1 year ago

Getting this issue as well, there seem to be some upper limit to requestTimeout when it stops working and any connections gets timed out whenever connectionsCheckingInterval fires.

EDIT: Can't say for sure what's causing this, or under what circumstances, but requestTimeout is definitely bugged.

das7pad commented 1 year ago

Hello,

We are seeing a lot of unexpected status 408 responses as well. Most notably in our CI environment running Node.js 18.17.1 in Google Cloud Build on n1-highcpu-32 VMs (x86_64). We see 408s for pretty much all requests, even very simple GET requests. The test HTTP servers use default options/timeouts. Our busiest test pipeline saw at least one unexpected 408 request timeout in about 50% of the builds.

We traced this down to a bug in the logic behind connectionsCheckingInterval: The connection cleanup routine uses uv_hrtime() readings for tracking the last activity per connection. It uses uv_hrtime()-requestTimeout*1e6/uv_hrtime()-headersTimeout*1e6 as reference for staleness, stored in an unsigned 64-bit integer. Per the docs of uv_hrtime(), it returns arbitrary but monotonic clock readings. https://github.com/nodejs/node/blob/c9c958e188af5c2a64ff945dbd2604f19b53004e/deps/uv/docs/src/misc.rst?plain=1#L563-L572 By definition, uv_hrtime() readings may be less than the configured request or header timeouts. In turn, the subtractions may overflow and flag all connections as stale. I'll add a few example readings below.

The default request timeout is `300000000000`, so `242653195462 - 300000000000 = -57346804538` or as uint64 `18446744016362747078`. - 239934240956 - 233476307396 - 233536410987 - 233606060330 - 238853987548 - 239820072736 - 240095063948 - 240250386896 - 240297219170 - 240032647966 - 238283618445 - 242653195462 - 221756831153 - 233199677409

The overflow bug has been fixed on the main and 20.x branches already: https://github.com/nodejs/node/pull/48291. The patch applies cleanly to 18.17.1 and our CI pipeline is happy again! WDYT about backporting the fix to Node.js 18 via 18.x-staging? I'm happy to open a PR.

Also, WDYT about updating the comment in the above patch to explain that this affects ALL environments, not just IoT/embedded devices? Something like this s/On IoT or embedded devices the uv_hrtime() may return the timestamp/By definition, uv_hrtime() may return a timestamp/? Again, I'm happy to open a PR for main and include the comment tweak in the backport PR for 18.x-staging and open another one for 20.x-staging (if needed).

sheepa commented 1 year ago

The #48291 fix will set the timeout to infinite if the hrtime is too low? If a shorter timeout is requested is that not unexpected?

das7pad commented 1 year ago

You can see uv_hrtime() as a counter that gets incremented for every nanosecond of program runtime. The counter is initialized with a random number x, which is picked from 0 < x < Date.now()*1e6. For two consecutive calls of uv_hrtime(), we get values y and z, with x <= y <= z (the counter only increments and we can either observe the same value or a greater one). y will be our last_message_start_ as assigned when a new message starts and z will be now as read by the connection expiry check. In case uv_hrtime() returns now < timeout we can deduce that last_message_start_ is also last_message_start_ < timeout, via x <= y <= z/x <= last_message_start_ <= now. In other words, when the counter was not yet incremented timeout times, we can deduce that no prior counter reading was greater than timeout, so subtracting timeout from now would yield a negative number or zero (aka point at a time before the program start or the program start). Which in turn allows us to deduce that no connection can be stale for more than timeout nanoseconds yet, and we do not need to check any connections for staleness regarding this timeout. We flag a timeout as "do not check" by setting it to zero.

das7pad commented 1 year ago

https://github.com/nodejs/node/pull/48291 has been backported to Node 18.18 via https://github.com/nodejs/node/commit/977e9a38b4709a2bf81dbf1f7e3feff66c979c40. We were able to revert our workaround now.

nodejs / node