Open artempyanykh opened 3 years ago
I mean, we already know that it is never going to be as fast as using the blocking APIs directly. Did you try with a non-blocking std::fs::read_dir
?
@Darksonn
Did you try with a non-blocking std::fs::read_dir?
Sorry, not sure what you mean by non-blocking std::fs::read_dir
. std::fs
provides a blocking API.
My setup is described here in detail, with code and perf data.
I mean, we already know that it is never going to be as fast as using the blocking APIs directly.
There are several things at play here. First there's overhead from async, then from tokio::fs
wrappers, but then there is a speed-up from parallel processing of files in case of async-par
implementation.
In any case 25x to 64x slow-down from going to tokio::fs
+ async
compared to a blocking version is pretty extreme, isn't it? We're talking about 200ms (feels instant) vs 12sec (feels like eternity) difference.
What I meant to suggest was to replace std::fs::read_dir
in the linked code with tokio::fs::read_dir
.
It is a big slowdown, and there have been several examples of people building really slow benchmarks and finding some trivial change to their code that yields a massive speedup, but those were all for reading the contents of the files. I think ultimately you are just running into a lot of back-and-forth between a bunch of threads, and that is just expensive.
@Darksonn let me try to clarify. As I explained in the README.md there are several branches, each has its own implementation:
std::fs
; it can be considered a baseline,tokio::fs
(incl. tokio::fs::read_dir
and tokio::fs::symlink_metadata
) and does processing sequentially (so option 1 but with tokio::fs
and .await
when necessary). This is 64x slower than option 1. Numbers are pretty much the same for both single- and multi-threaded runtimes. The amount of context switches is huge for both runtimes too.tokio::fs
, but also does as many things concurrently as possible by utilising FuturesUnordered
and select!
.If there is a trivial change to my code that can make, say, async-seq
version to perform at least within 2x margin of sync
version, I'd be more than happy to learn what it is 🙂
I've done more testing on other platforms:
This means that the issue is either Linux specific (unlikely) or WSL2 specific (seems more likely). I don't have a native linux box at hand to test this out now.
I also tried different version of rustc
(1.49, 1.50, 1.51) but observed similar behaviour.
I tried running it on my laptop which is a native Linux box, but async-par
failed with "too many open files". Here are the others:
Benchmark #1: du -hs ~/src
Time (mean ± σ): 813.7 ms ± 21.5 ms [User: 249.6 ms, System: 557.6 ms]
Range (min … max): 785.1 ms … 853.3 ms 10 runs
Benchmark #2: builds/sync ~/src
Time (mean ± σ): 884.7 ms ± 8.9 ms [User: 239.9 ms, System: 638.6 ms]
Range (min … max): 871.0 ms … 896.5 ms 10 runs
Benchmark #3: builds/async-seq ~/src
Time (mean ± σ): 5.603 s ± 0.059 s [User: 2.810 s, System: 4.733 s]
Range (min … max): 5.537 s … 5.735 s 10 runs
There were all built with --release
of course.
Great, so async-seq
is 6.3x slower, but not 64x, that's reassuring! 🙂
Could you try to increase the nofile limit and try async-par
again (e.g. ulimit -S -n 4096
may help)?
Sure.
Benchmark #1: builds/async-par ~/src
Time (mean ± σ): 4.462 s ± 1.566 s [User: 5.288 s, System: 7.233 s]
Range (min … max): 2.740 s … 7.184 s 10 runs
Thank you! async-par
performs better, but not to the extent I hoped. Both async versions are quite slow (good it's not 60x, but 6x is still a considerable slowdown).
I'm tempted to setup native Linux on my PC over the weekend and run it on the same set of files in Win, WSL2 and native Linux to have apples-to-apples comparison.
My main opinion on issues like this one is that if someone submits a PR that improves the speed of filesystem operations, I am happy to add those improvements (#3518 is an example), but it is not my a sufficiently large priority to spend time looking for fixes myself. People who need speedups for their fs ops can get already get speedups now by moving the operation into a single spawn_blocking
call.
@Darksonn that’s fair. To be clear, I don’t expect you spending time to diagnose the issue and come up with a fix, we all have different priorities and it’s fine.
The way I see it - this perf characteristics are surprising at the very least, so creating an issue is like putting a stick to the ground to tell “We’re aware of this”, and then maybe
However, I can also see that this types of issues may be seen as not directly actionable. Which is totally fair. If this is the case for tokio
project - I’d be fine to close the issue.
And in any case, I apologise for the inconvenience if I had missed something about this in the guidelines.
Updated benchmarks https://github.com/artempyanykh/rdu:
On Windows perf. profile is very different from Linux; naive async version is ~2.2x slower which is kind of acceptable. On native Linux with warm disk cache naive async version is 9x slower, and on WSL2 it's 55x slower.
This is referred to in this talk Java and Rust by Yishai Galatzer. They used Tokio async fs operations (on a benchmark) and compared that with Java NIO.
IMO, it unfairly pitches Rust as being too slow compared to Java, which is of course not really true.
Version 1.4.0
Platform 64-bit WSL2 Linux: Linux 4.19.104-microsoft-standard #1 SMP x86_64 x86_64 x86_64 GNU/Linux
Description The code is in this repo. The setup is explained in the README.
TL;DR:
du -hs
with blocking and async APIs.std::fs
is about 35% slower thandu
, not bad.tokio::fs
but processes files sequentially is 64x! slower than the blocking version.FuturesUnordered
andselect!
is 2.5x faster than the sequential version, but still 25x slower than a simple blocking version.I understand that
tokio::fs
usesstd::fs
under the hood, that there's no non-blocking system API to FS (modulo 'io-uring' but 🤷♂️), thatasync
has inherent overhead especially if the disk cache is hot and there's not much waiting on blocking calls.However, 25x (not saying 64x) just feels too extreme of a slowdown, so I wonder
tokio::fs
code needs tuning/optimization,