Open julien-f opened 1 year ago
Try this.
require('node:http').createServer({
requestTimeout: 24 * 60 * 60 * 1e3, // a day
connectionsCheckingInterval: 24 * 60 * 60 * 1e3
}, (req, res) => {
console.log(req.method, req.url);
req.resume().on('end', () => {
res.end('Ok');
}).on('error', console.warn);
}
).listen(8080);
@theanarkh Why? What do you expect to see with this?
From what I understand, there is still an issue in how requestTimeout
is handled, and even if connectionsCheckingInterval
works, it will check all connections every 24 hours, this will not guarantee that a specific connection can stay open for 24 hours.
@ShogunPanda Any opinions on this?
Inside the connection handling algorithm I always use uv_hrtime
to track connections.
@nodejs/libuv Do you think this might explain it?
@ShogunPanda explain what exactly?
@bnoordhuis I re-read it and got it wrong probably. I'll try to reproduce and see what happens. I'll get back here shortly.
I have been experiencing the same issue. I set a requestTimeout
to 2 hours, and was seeing the same result (connection reset by peer after ~30 seconds).
I noticed though after 2 hours of uptime on the system it started to work correctly. After this I set the requestTimeout
to 5 minutes and sure enough it started working after 5 minutes uptime. So perhaps it's related to system uptime?
So I opted to set the requestTimeout
to 0, and this fixes the issue for now.
Unfortunately I cannot reproduce this. I tried both on a Mac and on Linux. Given you were mentioning uptime of the system and time of the day, can you provide more info so I can take a look?
I don't see anything related to the uptime of my machine.
@ShogunPanda What behavior do you get with my test script above? On my system (and others like Ubuntu), curl
get an ECONNRESET
before 30 seconds.
@julien-f In my case the server keeps receiving the request without issues.
So I altered the @julien-f script to work more closely with how we have encountered it.
Node Version: 18.12.1
const http = require('node:http');
const srv = http.createServer((req, res) => {
console.log(req.method, req.url);
req.resume().on('end', () => {
res.end('Ok');
}).on('error', console.warn);
}
).listen(10080);
srv.requestTimeout = 10 * 60 * 1000; // 10 minute timeout
I get the following output before 10 minutes uptime:
$ uptime
07:44:46 up 5 min, 0 users, load average: 29.40, 19.92, 8.74
$ curl -T /dev/urandom http://localhost:10080 | cat
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 211M 0 0 0 211M 0 20.0M --:--:-- 0:00:10 --:--:-- 20.6M
curl: (55) Send failure: Broken pipe
$ curl -T /dev/urandom http://localhost:10080 | cat
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 256M 0 0 0 256M 0 20.8M --:--:-- 0:00:12 --:--:-- 21.3M
curl: (55) Send failure: Broken pipe
After 10 minutes of uptime it looks like this:
$ uptime
07:56:26 up 16 min, 0 users, load average: 2.90, 6.76, 8.16
$ curl -T /dev/urandom http://localhost:10080 | cat
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 13.3G 0 0 0 13.3G 0 21.9M --:--:-- 0:10:25 --:--:-- 22.3M
curl: (55) Send failure: Broken pipe
$ curl -T /dev/urandom http://localhost:10080 | cat
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 12.8G 0 0 0 12.8G 0 21.9M --:--:-- 0:10:01 --:--:-- 21.2M
curl: (55) Send failure: Broken pipe
I'm not sure if I'm missing anything or if setting the requestTimeout
after the server is instantiated makes a difference.
Running into this as well it seems. Its an ARMv7 architecture. So far I have only manually reproduced it by uploading a 230 MB file, takes about 10 seconds to complete, handled by express and multer (certainly not a minimal reproduction setup). The timeout can happen anywhere within the transfer. This only seems to happen when the uptime of the system is between 2 and 5 minutes. Running node 18.15.0
. requestTimeout
is unaltered at 300
seconds.
Getting this issue as well, there seem to be some upper limit to requestTimeout when it stops working and any connections gets timed out whenever connectionsCheckingInterval fires.
EDIT: Can't say for sure what's causing this, or under what circumstances, but requestTimeout is definitely bugged.
Hello,
We are seeing a lot of unexpected status 408 responses as well. Most notably in our CI environment running Node.js 18.17.1 in Google Cloud Build on n1-highcpu-32 VMs (x86_64). We see 408s for pretty much all requests, even very simple GET requests. The test HTTP servers use default options/timeouts. Our busiest test pipeline saw at least one unexpected 408 request timeout in about 50% of the builds.
We traced this down to a bug in the logic behind connectionsCheckingInterval
: The connection cleanup routine uses uv_hrtime()
readings for tracking the last activity per connection. It uses uv_hrtime()-requestTimeout*1e6
/uv_hrtime()-headersTimeout*1e6
as reference for staleness, stored in an unsigned 64-bit integer.
Per the docs of uv_hrtime()
, it returns arbitrary but monotonic clock readings.
https://github.com/nodejs/node/blob/c9c958e188af5c2a64ff945dbd2604f19b53004e/deps/uv/docs/src/misc.rst?plain=1#L563-L572
By definition, uv_hrtime()
readings may be less than the configured request or header timeouts. In turn, the subtractions may overflow and flag all connections as stale. I'll add a few example readings below.
The overflow bug has been fixed on the main
and 20.x branches already: https://github.com/nodejs/node/pull/48291.
The patch applies cleanly to 18.17.1 and our CI pipeline is happy again!
WDYT about backporting the fix to Node.js 18 via 18.x-staging
? I'm happy to open a PR.
Also, WDYT about updating the comment in the above patch to explain that this affects ALL environments, not just IoT/embedded devices? Something like this s/On IoT or embedded devices the uv_hrtime() may return the timestamp/By definition, uv_hrtime() may return a timestamp/
? Again, I'm happy to open a PR for main and include the comment tweak in the backport PR for 18.x-staging and open another one for 20.x-staging (if needed).
The #48291 fix will set the timeout to infinite if the hrtime is too low? If a shorter timeout is requested is that not unexpected?
You can see uv_hrtime()
as a counter that gets incremented for every nanosecond of program runtime. The counter is initialized with a random number x
, which is picked from 0 < x < Date.now()*1e6
. For two consecutive calls of uv_hrtime()
, we get values y
and z
, with x <= y <= z
(the counter only increments and we can either observe the same value or a greater one). y
will be our last_message_start_
as assigned when a new message starts and z
will be now
as read by the connection expiry check. In case uv_hrtime()
returns now < timeout
we can deduce that last_message_start_
is also last_message_start_ < timeout
, via x <= y <= z
/x <= last_message_start_ <= now
.
In other words, when the counter was not yet incremented timeout
times, we can deduce that no prior counter reading was greater than timeout
, so subtracting timeout
from now
would yield a negative number or zero (aka point at a time before the program start or the program start).
Which in turn allows us to deduce that no connection can be stale for more than timeout
nanoseconds yet, and we do not need to check any connections for staleness regarding this timeout. We flag a timeout as "do not check" by setting it to zero.
https://github.com/nodejs/node/pull/48291 has been backported to Node 18.18 via https://github.com/nodejs/node/commit/977e9a38b4709a2bf81dbf1f7e3feff66c979c40. We were able to revert our workaround now.
Version
v18.14.0
Platform
Linux prometheus 6.1.9-200.fc37.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Feb 2 00:21:48 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Subsystem
http
What steps will reproduce the bug?
Create an HTTP server with a big
requestTimeout
(a day in this example):Do a long running request to it:
How often does it reproduce? Is there a required condition?
It happens everytime, and I haven't been able to pinpoint the exact value at which it becomes problematic.
What is the expected behavior?
It should interrupt the request after 24 hours.
What do you see instead?
It interrupt the request after ~30 seconds.
Additional information
I can work-around by disabling the timeout (
requestTimeout: 0
) completely but setting a value, even high, seemed preferable.Any other suggestions on how to handle long running uploads?