Closed zbjornson closed 3 years ago
I agree that https://github.com/nodejs/node/pull/17054 seems likely as a source of the issue, and I agree with your reasoning re: stream usage.
@zbjornson I think your benchmark is artificial in that you're measuring the latency of a 16mb readFile()
vs reading 16mb in one shot?
For the general case, readFile()
should definitely not be reading 16mb in one shot, since this would affect not just the latency of a single readFile()
call (which would improve), but also the throughput of the Node.js threadpool and system as a whole (which would degrade).
I think https://github.com/nodejs/node/pull/17054 was a brilliant contribution.
We built our web app to load-shed while maintaining quality of service for as many clients as possible, so interleaving is not what we want.
In fact, your moving to fs.read()
is the right way to go about this. You can express your buffering intention more clearly, since you are in a position to know the latency and throughput characteristics you are targeting much better than Node. Node can't know what kind of service you're running, or whether you want to read 16mb in one shot or not. Reading 16mb in one shot should not be the default.
Perhaps fs.readFile()
could accept a partition
option to set the size of the buffer used for reading, but I don't think https://github.com/nodejs/node/pull/17054 should be blamed here.
Did a quick dirty check, I reverted the change https://github.com/nodejs/node/pull/17054, so now the example benchmark looks like:
//node master
fs.readFile() 2358
oneshot 459
//after the revert - node master
fs.readFile() 475
oneshot 492
Here is the diff BTW:
diff --git a/lib/internal/fs/read_file_context.js b/lib/internal/fs/read_file_context.js
index d7543ffa5a..341d00c148 100644
--- a/lib/internal/fs/read_file_context.js
+++ b/lib/internal/fs/read_file_context.js
@@ -79,7 +79,7 @@ class ReadFileContext {
} else {
buffer = this.buffer;
offset = this.pos;
- length = Math.min(kReadFileBufferLength, this.size - this.pos);
+ length = this.size - this.pos;
}
const req = new FSReqCallback();
I think https://github.com/nodejs/node/blob/26f80dcddd441c064df82d5caddfdcafbd36350f/lib/internal/fs/read_file_context.js#L7 might need to be increased for this use case, as 8KB seems not big enough to me to balance the two interests. Providing an option to set it might be a good call, as moving it to somewhere between 64KB to 512KB might provide better compromises.
Does this affect fs.readFileSync
too?
@jdalton no, fs.readFileSync
benchmarks the same between v8.x and v10.x.
@jorangreef The benchmark reflects a real-world use case (asynchronously read a file as fast as possible in a scenario where this type of concurrency isn't an issue and/or it's an issue that can be mitigated by increasing the threadpool size). I'd love to know where fs.readFile()
is used in web applications currently, and if any of those cases are ones that shouldn't be using fs.createReadStream()
(which effectively partitions).
Node can't know what kind of service you're running, or whether you want to read 16mb in one shot or not.
This is equally valid to support not partitioning reads. Note in my OP that it's 16 LOC for (I think) a safe implementation of a non-partitioning read, while concatenating a stream is only 5 LOC (below). For that reason, it would be nicer for Node.js to provide a one-shot method and let users create a partitioning method.
function readPartitioned(filename, cb) {
const rs = fs.createReadStream(filename);
const buffs = [];
fs.on("data", d => buffs.push(d));
fs.on("error", e => cb(e));
fs.on("end", () => cb(null, Buffer.concat(buffs)));
}
[affects] the throughput of the Node.js threadpool and system as a whole (which would degrade).
I did sort of a poor job trying to explain this in my OP, but partitioning still causes degradation, and in some senses, causes worse degradation due to the increased overhead.
@zbjornson ,
Your argument is kind of like saying TCP should not have congestion control. A single TCP connection should be given maximum bandwidth, to download a file "as fast as possible".
The problem with this, is that it ignores TCP as a system of connections, and allows a single connection to starve all others. I don't think that leaving congestion control out of TCP is a good default, most users of TCP won't realize their need for it, and the Internet might collapse.
Sure, congestion control has an effect on single connection download latency, but it's what most users of TCP want (whether they know it or not). If you really want "fast as possible" downloads, you are always free to use something custom such as UDT, and then I think metrics such as LOC are not the biggest concern.
readFile()
should remain partitioned for the same reasons, and @mcollina has already provided the way forward.
(TCP congestion control is adaptive to network conditions and is designed to allow large segments when there is low congestion to maximize throughput. Always partitioning fs reads is like always running with a small segment size. You also generally can't quickly add more bandwidth to a network to mitigate congestion, whereas you can increase the threadpool size, number of cluster workers and/or number of servers running a web app.)
If increasing the partition size works, then that's a perfect solution. But my point is that the primary use cases for fs.readFile
I think are not in web servers (rather in build tools, test frameworks, compilers), so partitioning unnecessarily deoptimizes those major uses cases.
I disagree. Avoiding partitioning optimizes with the usecase of “only one read at a time”, and in fact you could potentially replace that with a single readFileSync call. Asking for a single thread to block to read a 1GB file is problematic.
The problem we are facing atm is due to the high cost of moving control back-and-forth between JS and C++. Increasing the default chunk size and making it customizable will help significantly.
Offered for discussion: here's an alternative solution that uses aio_read(3)
. https://github.com/zbjornson/node-aio
That has nearly the same performance as a one-shot read, and unblocks the event loop almost entirely, while (I think) working around the issues that prevented libuv from using that method in the first place.
If folks think that's a viable solution, I'd be happy to work on a PR for it.
Offered for discussion: here's an alternative solution that uses aio_read(3). https://github.com/zbjornson/node-aio
That has nearly the same performance as a one-shot read, and unblocks the event loop almost entirely, while (I think) working around the issues that prevented libuv from using that method in the first place.
If folks think that's a viable solution, I'd be happy to work on a PR for it.
@nodejs/libuv
aio_read() uses threads too (or rather, libaio does) so no silver bullet there. :)
Aye, but it offloads thread scheduling to the kernel/OS. Think it's a no-go?
The parameters might be a little different from libuv but it's still a threadpool under the hood. I don't think it's going to make a material difference.
The parameters might be a little different from libuv but it's still a threadpool under the hood. I don't think it's going to make a material difference.
I agree. However the aio implementation would not give back control to JS after reading a chunk, and crossing that barrier adds up pretty quickly with 8KB chunks. We can definitely add the same to libuv, or mitigate it on the Node side with larger chunks by default.
I'll run and post some benchmarks with varying chunk sizes and concurrency later today! That'll work off of a branch that allows specifying the chunk size, so if it becomes a parameter to readFile, that commit could become a PR.
Here are curves showing various chunk sizes for two file sizes for several numbers of concurrent reads of different files. This used this branch, which adds a chunkSize
option to fs.readFile
.
Changing the default from 8 KiB to 512 KiB gets to >50% of the max in all cases I tried. From this comment it sounds like the intended default was 64 KiB? — Raising the default plus having the option to set it to Infinity
means the workaround is easy. Nonetheless I'm still not sure it's wise to partition by default.
The parameters might be a little different from libuv but it's still a threadpool under the hood. I don't think it's going to make a material difference.
There are a few differences that might inspire a better solution in libuv/Node.js than always partitioning.
There are more threads by default (4 permanent in libuv vs. up to 20 dynamic in glibc's aio), and they are only used for I/O. Since blocking read
calls don't use [much] CPU time, it makes sense that the thread count isn't tied to the number of CPU cores, and that these threads are separate from CPU-bound activities like crypto/zlib.
When there are more requests than threads, aio queues new requests instead of partitioning ones that are already running. This isn't too different from how Node.js used to be. This avoids unbounded degradation of in-progress reads. In the timeline plots below (actual measurements), I'm showing the effect of a new batch of reads (orange) that are issued just after 20 small reads (gray) end and while one slightly larger read (blue) is running. With partitioning, the new orange requests cause the blue request to take an additional >200 ms vs. without partitioning. Meanwhile, nothing is improved in exchange. (The blue file is slightly larger than the others so that it's easier to see this effect, but it happens to some degree at any scale.)
Also note that by increasing the duration of reads, partitioning can increase the memory requirement vs. letting earlier reads complete and their resulting buffer possibly/likely being released sooner. (Above, the blue req's buffer is reserved for an extra 200 ms.)
the aio implementation would not give back control to JS after reading a chunk, and crossing that barrier adds up pretty quickly
Not just JS/CPP transitions, but also more syscalls+context switches and possibly issued I/O ops. Below are the strace summaries for the various methods. It's a shame to waste cycles especially when there's no contention for execution resources.
I'd love to hear counterarguments, but it seems like partitioning has replaced one theoretical DoS vector (via blocking the event loop edit a worker thread /edit through an API that generally shouldn't be used in Web servers) with several new issues that may provide other DoS vectors (via infinitely accepting new work instead of waiting for some existing work to finish, via increasing memory requirements, and via reducing overall performance).
an API that generally shouldn't be used in Web servers
Why not? Because you think streams cover all use-cases?
it seems like partitioning has replaced one theoretical DoS vector [...] with several new issues that may provide other DoS vectors
The DoS vector addressed by partitioning is not theoretical. Please don't muddy the water.
I would love to see working exploits of your non-theoretical DoS vectors (plural).
infinitely accepting new work instead of waiting for some existing work to finish
It was never the goal of partitioning to safeguard users from infinitely accepting new work. That's a strawman.
There are more threads by default (4 permanent in libuv vs. up to 20 dynamic in glibc's aio), and they are only used for I/O. Since blocking read calls don't use [much] CPU time, it makes sense that the thread count isn't tied to the number of CPU cores, and that these threads are separate from CPU-bound activities like crypto/zlib.
You might want to look into the multi-threadpool work being doing by @davisjam, the author of the partitioning patch you are so opposed to.
@zbjornson your analysis is 100% correct if you consider an application whose scope is to only read files. A typical Node.js application uses the thread pool for all sort of operations, including some critical V8 functions, such as the parallel garbage collector. The previous implementation could saturate the thread pool because the time spent reading was unbounded.
A typical Node.js application uses the thread pool for all sort of operations, including some critical V8 functions, such as the parallel garbage collector.
(V8 operations are currently on a separate threadpool.)
@addaleax is that documented/written somewhere? I'm asking because I thought that and somebody at the last collaborator summit said it was scheduled on the same pool.
The concept still holds true, the threadpool is shared between a lot of different operations, and big file reads could starve it.
@mcollina Only in the fact that it’s currently fully separate code. https://github.com/nodejs/node/pull/22631 would possibly change this, but that doesn’t mean that we wouldn’t add mechanisms that prevent starvation of V8 tasks (e.g. keeping threads free for them).
an API that generally shouldn't be used in Web servers
Why not? Because you think streams cover all use-cases?
Streams are specifically designed to partition work so that I/O can be interleaved between concurrent requests.
I don't understand why fs.readFile()
should also support that use-case. Its one-shot, for better or worse. If its worse for a specific use-case, why shouldn't streams be used?
@sam-github I’m not happy of blocking a thread for a time that is function of the input size. I would be happy to interleave at 1MB or more, even.
@mcollina I don't feel really strongly about this, but I do think some APIs are not appropriate for some uses. The *Sync()
APIs, dns.lookup()
, etc. We push streams pretty hard for this particular use-case! If a much bigger buffer resolves this sufficiently, lets do it, but I do worry that it muddies our messaging.
I agree. I think we always said “use readFile because it’s safe and it does not block the event loop” and without the interleaving this sentence is not correct. I prefer not to change our messaging regarding this and add some interleaving.
@jorangreef I'm sorry if anything I said came off as rude or an attack. My only goal here is to point out the current issues and try to propose better solutions.
an API that generally shouldn't be used in Web servers
Why not? Because you think streams cover all use-cases?
Not all, but most of the ones that come to mind based on the discussion in #17054. @davisjam even said fs.readFile
is not a good idea in the server context in https://github.com/nodejs/node/pull/17054#issuecomment-345029409.
The DoS vector addressed by partitioning is not theoretical. Please don't muddy the water.
"Theoretical" in the sense that @bnoordhuis described it as theoretical in https://github.com/nodejs/node/pull/17054#issuecomment-344916812 and https://github.com/nodejs/node/pull/17054#issuecomment-344944545 -- no users have opened issues for it and I've not seen it described elsewhere as a vulnerability. @davisjam's explanation of it as a DoS vector required using fs.readFile()
insead of fs.createReadStream()
and unsanitized user input.
A concrete example @davisjam provides is fs.readFile("/dev/zero|random")
. In Node.js v10.x, the only thing that has changed is that the OOM crash happens later than in v8.x, after Node.js has spent more time doing wasteful work than it did in v8.x. (edit Note that using streams here would be "safe" -- the attacker would just get 0s or random bytes.)
I would love to see working exploits of your non-theoretical DoS vectors (plural).
I didn't say any of the three new issues are DoS vectors; I said they are issues that may provide other DoS vectors. I hesitate to call anything that the user has control over a "DoS vector," otherwise you can say the Array constructor is a DoS vector also. Nonetheless, in my previous post I think I explained fairly thoroughly how these new issues can lead to resource exhaustion, which could be considered vectors.
It was never the goal of partitioning to safeguard users from infinitely accepting new work.
Partitioning made this an issue. Previously Node.js would queue up work instead of accepting work infinitely.
the multi-threadpool work
I'm a fan of it and hope it or a variant of it lands, as I think it's a better solution than partitioning. @bnoordhuis has investigated a separate I/O threadpool several times per https://github.com/nodejs/node/issues/11855 and I'd love to get his input on that vs. partitioning. The extent of my own testing was limited to the aio repo I posted earlier, which uses a separate threadpool just for I/O, and it seemed promising.
does not block the event loop
@mcollina I think that was a typo, but it doesn't block the event loop, it blocks one of the worker threads. The API docs right now just discourage use of the *Sync methods, so the messaging I think remains clear. At the risk of another lengthy post that might be misconstrued as argumentative, I can explain why I think this blocking is preferable to partitioning if you want, but it's rooted in the issues I described in the previous post.
uses the thread pool for all sort of operations, including some critical V8 functions
(The reasoning that's not the case right now is described in https://github.com/nodejs/node/pull/14001#issuecomment-321453386. I'd actually consider the threadpool usage "limited" -- just those listed in https://nodejs.org/api/cli.html#cli_uv_threadpool_size_size -- while acknowledging that many HTTP server apps use zlib and crypto.)
@zbjornson,
The thing is, even without partitioning, fs.readFile()
is never going to read a file "as fast as possible".
If "as fast as possible" is your goal, and it looks like it is, then you should be using:
open()
callsposix_memalign
or _aligned_malloc
to make it easy for the kernelO_DIRECT
or FILE_FLAG_NO_BUFFERING
to avoid copying to the kernel's cacheYou sacrifice all of these when you use fs.readFile()
.
For your use case, you should be using fs.read()
, as you're doing now.
I see I'm a bit late to this discussion.
@zbjornson, you're right that #17054 may have a non-trivial performance impact for certain workloads.
My general position is with @jorangreef: if you want to read files as fast as possible, you should write a C++ add-on or use the Node.js syscall wrappers. Indeed you seem to have discovered this :-).
A higher-level API like fs.readFile
can reasonably be expected to prioritize factors other than performance. My intent in #17054 was to favor small requests over (potentially pathologically) large ones. In the DoS context the effect of #17054 was to force an attacker to submit many more than ~4 requests in order to trigger appreciable slowdowns, with the goal of shifting the attack from an asymmetric "Event Handler Poisoning" exploit to a standard "Submit lots of annoying jobs" exploit that could be defeated using e.g. rate limiting as @jorangreef pointed out above.
I agree with @mcollina that
Providing an option to set it might be a good call, as moving it to somewhere between 64KB to 512KB might provide better compromises.
If you feel this is a critical problem, then PRs welcome!
I know this was not fully resolved but does this need to remain open at this point?
I'm seeing a 7.6-13.5x drop in read throughput between 8.x and 10.x in both the readfile benchmark and our real-world benchmarks that heavily exercise
fs.readFile
. Based on my troubleshooting, I think it's from #17054.The readfile benchmark (Ubuntu 16):
From what I can extract from the comments in #17054, either no degradation or a 3.6-4.8x degradation was expected for the len=16M cases.
As for why I think it's because of #17054, the benchmark below compares
fs.readFile
against an approximation of howfs.readFile
used to work (one-shot read), measuring time to read the same 16 MB file 50 times.fs.readFile()
(ms)We've switched to
fs.fstat()
and thenfs.read()
as a work-around (above; note how many lines of code it is), but I wouldn't be surprised if this has negatively impacted other apps/tools. As far as the original justification: I'm not convinced that was an appropriate fix or a problem that needed fixing. Web servers (where DoS attacks are relevant) should generally be usingfs.createReadStream.pipe(response)
for serving files (ignoring other use cases for now). Other sorts of apps like build tools, compilers and test frameworks (DoS attacks irrelevant) are the ones that more often need to read an entire file as fast as possible. In some senses, the fix made the DoS situation worse by increasing overhead (reducing number of useful CPU cycles) and by infinitely partitioning reads (each new request delays all existing requests -- the requests are waiting for their own ticks plus some number of ticks from all of the other requests). We built our web app to load-shed while maintaining quality of service for as many clients as possible, so interleaving is not what we want. -- Happy to discuss that more on- or off-line.Is anyone else able to verify that this degradation exists and/or was expected?
cc: @davisjam