fs.readFile is slower in v10

zbjornson commented 5 years ago

Version: v10.15.0
Platform: Ubuntu 16, Win 10. Haven't tested macOS
Subsystem: fs

I'm seeing a 7.6-13.5x drop in read throughput between 8.x and 10.x in both the readfile benchmark and our real-world benchmarks that heavily exercise fs.readFile. Based on my troubleshooting, I think it's from #17054.

The readfile benchmark (Ubuntu 16):

Test	v8.15.0	v10.15.0	8 ÷ 10
concurrent=1 len=1024	6,661	7,066	1.06x
concurrent=10 len=1024	23,100	21,079	0.91x
concurrent=1 len=16777216	156.6	11.6	13.5x
concurrent=10 len=16777216	584	76.6	7.6x

From what I can extract from the comments in #17054, either no degradation or a 3.6-4.8x degradation was expected for the len=16M cases.

As for why I think it's because of #17054, the benchmark below compares fs.readFile against an approximation of how fs.readFile used to work (one-shot read), measuring time to read the same 16 MB file 50 times.

// npm i async

const fs = require("fs");
const async = require("async");

function chunked(filename, cb) {
    fs.readFile(filename, cb);
}

function oneshot(filename, cb) {
    fs.open(filename, "r", 0o666, (err, fd) => {
        if (err) return cb(err);
        const onerr = err => fs.close(fd, () => cb(err));
        fs.fstat(fd, (err, stats) => {
            if (err) return onerr(err);
            const data = Buffer.allocUnsafe(stats.size);
            fs.read(fd, data, 0, stats.size, 0, (err, bytesRead) => {
                if (err) return onerr(err);
                if (bytesRead !== stats.size) {
                    const err = new Error("Read fewer bytes than requested");
                    return onerr(err);
                }
                fs.close(fd, err => cb(err, data));
            });
        });
    });
}

fs.writeFileSync("./test.dat", Buffer.alloc(16e6, 'x'));

function bm(method, name, cb) {
    const start = Date.now();
    async.timesSeries(50, (n, next) => {
        method("./test.dat", next);
    }, err => {
        if (err) return cb(err);
        const diff = Date.now() - start;
        console.log(name, diff);
        cb();
    });
}

async.series([
    cb => bm(chunked, "fs.readFile()", cb),
    cb => bm(oneshot, "oneshot", cb)
])

Node.js	OS	`fs.readFile()` (ms)	one-shot (ms)
v10.15.0	Ubuntu 16	7320	370
v8.15.0	Ubuntu 16	693	378
v10.15.0	Win 10	2972	493

We've switched to fs.fstat() and then fs.read() as a work-around (above; note how many lines of code it is), but I wouldn't be surprised if this has negatively impacted other apps/tools. As far as the original justification: I'm not convinced that was an appropriate fix or a problem that needed fixing. Web servers (where DoS attacks are relevant) should generally be using fs.createReadStream.pipe(response) for serving files (ignoring other use cases for now). Other sorts of apps like build tools, compilers and test frameworks (DoS attacks irrelevant) are the ones that more often need to read an entire file as fast as possible. In some senses, the fix made the DoS situation worse by increasing overhead (reducing number of useful CPU cycles) and by infinitely partitioning reads (each new request delays all existing requests -- the requests are waiting for their own ticks plus some number of ticks from all of the other requests). We built our web app to load-shed while maintaining quality of service for as many clients as possible, so interleaving is not what we want. -- Happy to discuss that more on- or off-line.

Is anyone else able to verify that this degradation exists and/or was expected?

cc: @davisjam

addaleax commented 5 years ago

I agree that https://github.com/nodejs/node/pull/17054 seems likely as a source of the issue, and I agree with your reasoning re: stream usage.

jorangreef commented 5 years ago

@zbjornson I think your benchmark is artificial in that you're measuring the latency of a 16mb readFile() vs reading 16mb in one shot?

For the general case, readFile() should definitely not be reading 16mb in one shot, since this would affect not just the latency of a single readFile() call (which would improve), but also the throughput of the Node.js threadpool and system as a whole (which would degrade).

I think https://github.com/nodejs/node/pull/17054 was a brilliant contribution.

We built our web app to load-shed while maintaining quality of service for as many clients as possible, so interleaving is not what we want.

In fact, your moving to fs.read() is the right way to go about this. You can express your buffering intention more clearly, since you are in a position to know the latency and throughput characteristics you are targeting much better than Node. Node can't know what kind of service you're running, or whether you want to read 16mb in one shot or not. Reading 16mb in one shot should not be the default.

Perhaps fs.readFile() could accept a partition option to set the size of the buffer used for reading, but I don't think https://github.com/nodejs/node/pull/17054 should be blamed here.

antsmartian commented 5 years ago

Did a quick dirty check, I reverted the change https://github.com/nodejs/node/pull/17054, so now the example benchmark looks like:

//node master
fs.readFile() 2358
oneshot 459

//after the revert - node master
fs.readFile() 475
oneshot 492

Here is the diff BTW:

diff --git a/lib/internal/fs/read_file_context.js b/lib/internal/fs/read_file_context.js
index d7543ffa5a..341d00c148 100644
--- a/lib/internal/fs/read_file_context.js
+++ b/lib/internal/fs/read_file_context.js
@@ -79,7 +79,7 @@ class ReadFileContext {
     } else {
       buffer = this.buffer;
       offset = this.pos;
-      length = Math.min(kReadFileBufferLength, this.size - this.pos);
+      length = this.size - this.pos;
     }

     const req = new FSReqCallback();

mcollina commented 5 years ago

I think https://github.com/nodejs/node/blob/26f80dcddd441c064df82d5caddfdcafbd36350f/lib/internal/fs/read_file_context.js#L7 might need to be increased for this use case, as 8KB seems not big enough to me to balance the two interests. Providing an option to set it might be a good call, as moving it to somewhere between 64KB to 512KB might provide better compromises.

jdalton commented 5 years ago

Does this affect fs.readFileSync too?

zbjornson commented 5 years ago

@jdalton no, fs.readFileSync benchmarks the same between v8.x and v10.x.

@jorangreef The benchmark reflects a real-world use case (asynchronously read a file as fast as possible in a scenario where this type of concurrency isn't an issue and/or it's an issue that can be mitigated by increasing the threadpool size). I'd love to know where fs.readFile() is used in web applications currently, and if any of those cases are ones that shouldn't be using fs.createReadStream() (which effectively partitions).

Node can't know what kind of service you're running, or whether you want to read 16mb in one shot or not.

This is equally valid to support not partitioning reads. Note in my OP that it's 16 LOC for (I think) a safe implementation of a non-partitioning read, while concatenating a stream is only 5 LOC (below). For that reason, it would be nicer for Node.js to provide a one-shot method and let users create a partitioning method.

function readPartitioned(filename, cb) {
  const rs = fs.createReadStream(filename);
  const buffs = [];
  fs.on("data", d => buffs.push(d));
  fs.on("error", e => cb(e));
  fs.on("end", () => cb(null, Buffer.concat(buffs)));
}

[affects] the throughput of the Node.js threadpool and system as a whole (which would degrade).

I did sort of a poor job trying to explain this in my OP, but partitioning still causes degradation, and in some senses, causes worse degradation due to the increased overhead.

jorangreef commented 5 years ago

@zbjornson ,

Your argument is kind of like saying TCP should not have congestion control. A single TCP connection should be given maximum bandwidth, to download a file "as fast as possible".

The problem with this, is that it ignores TCP as a system of connections, and allows a single connection to starve all others. I don't think that leaving congestion control out of TCP is a good default, most users of TCP won't realize their need for it, and the Internet might collapse.

Sure, congestion control has an effect on single connection download latency, but it's what most users of TCP want (whether they know it or not). If you really want "fast as possible" downloads, you are always free to use something custom such as UDT, and then I think metrics such as LOC are not the biggest concern.

readFile() should remain partitioned for the same reasons, and @mcollina has already provided the way forward.

zbjornson commented 5 years ago

(TCP congestion control is adaptive to network conditions and is designed to allow large segments when there is low congestion to maximize throughput. Always partitioning fs reads is like always running with a small segment size. You also generally can't quickly add more bandwidth to a network to mitigate congestion, whereas you can increase the threadpool size, number of cluster workers and/or number of servers running a web app.)

If increasing the partition size works, then that's a perfect solution. But my point is that the primary use cases for fs.readFile I think are not in web servers (rather in build tools, test frameworks, compilers), so partitioning unnecessarily deoptimizes those major uses cases.

mcollina commented 5 years ago

I disagree. Avoiding partitioning optimizes with the usecase of “only one read at a time”, and in fact you could potentially replace that with a single readFileSync call. Asking for a single thread to block to read a 1GB file is problematic.

The problem we are facing atm is due to the high cost of moving control back-and-forth between JS and C++. Increasing the default chunk size and making it customizable will help significantly.

zbjornson commented 5 years ago

Offered for discussion: here's an alternative solution that uses aio_read(3). https://github.com/zbjornson/node-aio

That has nearly the same performance as a one-shot read, and unblocks the event loop almost entirely, while (I think) working around the issues that prevented libuv from using that method in the first place.

If folks think that's a viable solution, I'd be happy to work on a PR for it.

Trott commented 5 years ago

Offered for discussion: here's an alternative solution that uses aio_read(3). https://github.com/zbjornson/node-aio

That has nearly the same performance as a one-shot read, and unblocks the event loop almost entirely, while (I think) working around the issues that prevented libuv from using that method in the first place.

If folks think that's a viable solution, I'd be happy to work on a PR for it.

@nodejs/libuv

bnoordhuis commented 5 years ago

aio_read() uses threads too (or rather, libaio does) so no silver bullet there. :)

zbjornson commented 5 years ago

Aye, but it offloads thread scheduling to the kernel/OS. Think it's a no-go?

bnoordhuis commented 5 years ago

The parameters might be a little different from libuv but it's still a threadpool under the hood. I don't think it's going to make a material difference.

mcollina commented 5 years ago

The parameters might be a little different from libuv but it's still a threadpool under the hood. I don't think it's going to make a material difference.

I agree. However the aio implementation would not give back control to JS after reading a chunk, and crossing that barrier adds up pretty quickly with 8KB chunks. We can definitely add the same to libuv, or mitigate it on the Node side with larger chunks by default.

zbjornson commented 5 years ago

I'll run and post some benchmarks with varying chunk sizes and concurrency later today! That'll work off of a branch that allows specifying the chunk size, so if it becomes a parameter to readFile, that commit could become a PR.

zbjornson commented 5 years ago

Here are curves showing various chunk sizes for two file sizes for several numbers of concurrent reads of different files. This used this branch, which adds a chunkSize option to fs.readFile.

Changing the default from 8 KiB to 512 KiB gets to >50% of the max in all cases I tried. From this comment it sounds like the intended default was 64 KiB? — Raising the default plus having the option to set it to Infinity means the workaround is easy. Nonetheless I'm still not sure it's wise to partition by default.

The parameters might be a little different from libuv but it's still a threadpool under the hood. I don't think it's going to make a material difference.

There are a few differences that might inspire a better solution in libuv/Node.js than always partitioning.

There are more threads by default (4 permanent in libuv vs. up to 20 dynamic in glibc's aio), and they are only used for I/O. Since blocking read calls don't use [much] CPU time, it makes sense that the thread count isn't tied to the number of CPU cores, and that these threads are separate from CPU-bound activities like crypto/zlib.
When there are more requests than threads, aio queues new requests instead of partitioning ones that are already running. This isn't too different from how Node.js used to be. This avoids unbounded degradation of in-progress reads. In the timeline plots below (actual measurements), I'm showing the effect of a new batch of reads (orange) that are issued just after 20 small reads (gray) end and while one slightly larger read (blue) is running. With partitioning, the new orange requests cause the blue request to take an additional >200 ms vs. without partitioning. Meanwhile, nothing is improved in exchange. (The blue file is slightly larger than the others so that it's easier to see this effect, but it happens to some degree at any scale.)

Also note that by increasing the duration of reads, partitioning can increase the memory requirement vs. letting earlier reads complete and their resulting buffer possibly/likely being released sooner. (Above, the blue req's buffer is reserved for an extra 200 ms.)

the aio implementation would not give back control to JS after reading a chunk, and crossing that barrier adds up pretty quickly

Not just JS/CPP transitions, but also more syscalls+context switches and possibly issued I/O ops. Below are the strace summaries for the various methods. It's a shame to waste cycles especially when there's no contention for execution resources.

strace summaries

``` for fs.readFile() 8k % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 45.41 3.183924 21 150349 27 futex 37.34 2.617896 27 97852 epoll_pwait 16.29 1.142409 12 97861 read for fs.readFile() 512k % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 71.34 0.676356 392 1725 epoll_pwait 11.83 0.112131 35 3244 26 futex 8.66 0.082141 47 1734 read for oneshot % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 85.70 0.559448 2049 273 epoll_pwait 3.16 0.020625 24 856 22 futex 0.80 0.005234 19 282 read for aio % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 49.36 0.126387 25 4988 4 futex 24.04 0.061566 23 2691 epoll_pwait 17.13 0.043858 16 2709 read ```

I'd love to hear counterarguments, but it seems like partitioning has replaced one theoretical DoS vector (via blocking ~~the event loop~~ edit a worker thread /edit through an API that generally shouldn't be used in Web servers) with several new issues that may provide other DoS vectors (via infinitely accepting new work instead of waiting for some existing work to finish, via increasing memory requirements, and via reducing overall performance).

jorangreef commented 5 years ago

an API that generally shouldn't be used in Web servers

Why not? Because you think streams cover all use-cases?

it seems like partitioning has replaced one theoretical DoS vector [...] with several new issues that may provide other DoS vectors

The DoS vector addressed by partitioning is not theoretical. Please don't muddy the water.

I would love to see working exploits of your non-theoretical DoS vectors (plural).

infinitely accepting new work instead of waiting for some existing work to finish

It was never the goal of partitioning to safeguard users from infinitely accepting new work. That's a strawman.

There are more threads by default (4 permanent in libuv vs. up to 20 dynamic in glibc's aio), and they are only used for I/O. Since blocking read calls don't use [much] CPU time, it makes sense that the thread count isn't tied to the number of CPU cores, and that these threads are separate from CPU-bound activities like crypto/zlib.

You might want to look into the multi-threadpool work being doing by @davisjam, the author of the partitioning patch you are so opposed to.

mcollina commented 5 years ago

@zbjornson your analysis is 100% correct if you consider an application whose scope is to only read files. A typical Node.js application uses the thread pool for all sort of operations, including some critical V8 functions, such as the parallel garbage collector. The previous implementation could saturate the thread pool because the time spent reading was unbounded.

addaleax commented 5 years ago

A typical Node.js application uses the thread pool for all sort of operations, including some critical V8 functions, such as the parallel garbage collector.

(V8 operations are currently on a separate threadpool.)

mcollina commented 5 years ago

@addaleax is that documented/written somewhere? I'm asking because I thought that and somebody at the last collaborator summit said it was scheduled on the same pool.

The concept still holds true, the threadpool is shared between a lot of different operations, and big file reads could starve it.

addaleax commented 5 years ago

@mcollina Only in the fact that it’s currently fully separate code. https://github.com/nodejs/node/pull/22631 would possibly change this, but that doesn’t mean that we wouldn’t add mechanisms that prevent starvation of V8 tasks (e.g. keeping threads free for them).

sam-github commented 5 years ago

an API that generally shouldn't be used in Web servers

Why not? Because you think streams cover all use-cases?

Streams are specifically designed to partition work so that I/O can be interleaved between concurrent requests.

I don't understand why fs.readFile() should also support that use-case. Its one-shot, for better or worse. If its worse for a specific use-case, why shouldn't streams be used?

mcollina commented 5 years ago

@sam-github I’m not happy of blocking a thread for a time that is function of the input size. I would be happy to interleave at 1MB or more, even.

sam-github commented 5 years ago

@mcollina I don't feel really strongly about this, but I do think some APIs are not appropriate for some uses. The *Sync() APIs, dns.lookup(), etc. We push streams pretty hard for this particular use-case! If a much bigger buffer resolves this sufficiently, lets do it, but I do worry that it muddies our messaging.

mcollina commented 5 years ago

I agree. I think we always said “use readFile because it’s safe and it does not block the event loop” and without the interleaving this sentence is not correct. I prefer not to change our messaging regarding this and add some interleaving.

zbjornson commented 5 years ago

@jorangreef I'm sorry if anything I said came off as rude or an attack. My only goal here is to point out the current issues and try to propose better solutions.

an API that generally shouldn't be used in Web servers

Why not? Because you think streams cover all use-cases?

Not all, but most of the ones that come to mind based on the discussion in #17054. @davisjam even said fs.readFile is not a good idea in the server context in https://github.com/nodejs/node/pull/17054#issuecomment-345029409.

The DoS vector addressed by partitioning is not theoretical. Please don't muddy the water.

"Theoretical" in the sense that @bnoordhuis described it as theoretical in https://github.com/nodejs/node/pull/17054#issuecomment-344916812 and https://github.com/nodejs/node/pull/17054#issuecomment-344944545 -- no users have opened issues for it and I've not seen it described elsewhere as a vulnerability. @davisjam's explanation of it as a DoS vector required using fs.readFile() insead of fs.createReadStream() and unsanitized user input.

A concrete example @davisjam provides is fs.readFile("/dev/zero|random"). In Node.js v10.x, the only thing that has changed is that the OOM crash happens later than in v8.x, after Node.js has spent more time doing wasteful work than it did in v8.x. (edit Note that using streams here would be "safe" -- the attacker would just get 0s or random bytes.)

I would love to see working exploits of your non-theoretical DoS vectors (plural).

I didn't say any of the three new issues are DoS vectors; I said they are issues that may provide other DoS vectors. I hesitate to call anything that the user has control over a "DoS vector," otherwise you can say the Array constructor is a DoS vector also. Nonetheless, in my previous post I think I explained fairly thoroughly how these new issues can lead to resource exhaustion, which could be considered vectors.

It was never the goal of partitioning to safeguard users from infinitely accepting new work.

Partitioning made this an issue. Previously Node.js would queue up work instead of accepting work infinitely.

the multi-threadpool work

I'm a fan of it and hope it or a variant of it lands, as I think it's a better solution than partitioning. @bnoordhuis has investigated a separate I/O threadpool several times per https://github.com/nodejs/node/issues/11855 and I'd love to get his input on that vs. partitioning. The extent of my own testing was limited to the aio repo I posted earlier, which uses a separate threadpool just for I/O, and it seemed promising.

does not block the event loop

@mcollina I think that was a typo, but it doesn't block the event loop, it blocks one of the worker threads. The API docs right now just discourage use of the *Sync methods, so the messaging I think remains clear. At the risk of another lengthy post that might be misconstrued as argumentative, I can explain why I think this blocking is preferable to partitioning if you want, but it's rooted in the issues I described in the previous post.

uses the thread pool for all sort of operations, including some critical V8 functions

(The reasoning that's not the case right now is described in https://github.com/nodejs/node/pull/14001#issuecomment-321453386. I'd actually consider the threadpool usage "limited" -- just those listed in https://nodejs.org/api/cli.html#cli_uv_threadpool_size_size -- while acknowledging that many HTTP server apps use zlib and crypto.)

jorangreef commented 5 years ago

@zbjornson,

The thing is, even without partitioning, fs.readFile() is never going to read a file "as fast as possible".

If "as fast as possible" is your goal, and it looks like it is, then you should be using:

a cache of open file descriptors to avoid open() calls
static allocation instead of dynamic allocation to avoid gc pressure
guaranteed sector-aligned buffers using posix_memalign or _aligned_malloc to make it easy for the kernel
O_DIRECT or FILE_FLAG_NO_BUFFERING to avoid copying to the kernel's cache
a user-space block cache to reduce cache misses and avoid accessing the kernel's cache

You sacrifice all of these when you use fs.readFile().

For your use case, you should be using fs.read(), as you're doing now.

davisjam commented 5 years ago

I see I'm a bit late to this discussion.

@zbjornson, you're right that #17054 may have a non-trivial performance impact for certain workloads.

My general position is with @jorangreef: if you want to read files as fast as possible, you should write a C++ add-on or use the Node.js syscall wrappers. Indeed you seem to have discovered this :-).

A higher-level API like fs.readFile can reasonably be expected to prioritize factors other than performance. My intent in #17054 was to favor small requests over (potentially pathologically) large ones. In the DoS context the effect of #17054 was to force an attacker to submit many more than ~4 requests in order to trigger appreciable slowdowns, with the goal of shifting the attack from an asymmetric "Event Handler Poisoning" exploit to a standard "Submit lots of annoying jobs" exploit that could be defeated using e.g. rate limiting as @jorangreef pointed out above.

I agree with @mcollina that

Providing an option to set it might be a good call, as moving it to somewhere between 64KB to 512KB might provide better compromises.

If you feel this is a critical problem, then PRs welcome!

jasnell commented 4 years ago

I know this was not fully resolved but does this need to remain open at this point?

nodejs / node

fs.readFile is slower in v10 #25741