`tokio::fs` + async is 1-2 orders of magnitude slower than a blocking version

artempyanykh commented 3 years ago

Version 1.4.0

Platform 64-bit WSL2 Linux: Linux 4.19.104-microsoft-standard #1 SMP x86_64 x86_64 x86_64 GNU/Linux

Description The code is in this repo. The setup is explained in the README.

TL;DR:

Implement toy clone of du -hs with blocking and async APIs.
Blocking std::fs is about 35% slower than du, not bad.
An async version that uses tokio::fs but processes files sequentially is 64x! slower than the blocking version.
An async version that tries to do as many things concurrently as possible using FuturesUnordered and select! is 2.5x faster than the sequential version, but still 25x slower than a simple blocking version.

I understand that tokio::fs uses std::fs under the hood, that there's no non-blocking system API to FS (modulo 'io-uring' but 🤷‍♂️), that async has inherent overhead especially if the disk cache is hot and there's not much waiting on blocking calls.

However, 25x (not saying 64x) just feels too extreme of a slowdown, so I wonder

if this is actually expected,
or some tokio::fs code needs tuning/optimization,
or something else entirely (wrong setup?)

Darksonn commented 3 years ago

I mean, we already know that it is never going to be as fast as using the blocking APIs directly. Did you try with a non-blocking std::fs::read_dir?

artempyanykh commented 3 years ago

@Darksonn

Did you try with a non-blocking std::fs::read_dir?

Sorry, not sure what you mean by non-blocking std::fs::read_dir. std::fs provides a blocking API.

My setup is described here in detail, with code and perf data.

I mean, we already know that it is never going to be as fast as using the blocking APIs directly.

There are several things at play here. First there's overhead from async, then from tokio::fs wrappers, but then there is a speed-up from parallel processing of files in case of async-par implementation.

In any case 25x to 64x slow-down from going to tokio::fs + async compared to a blocking version is pretty extreme, isn't it? We're talking about 200ms (feels instant) vs 12sec (feels like eternity) difference.

Darksonn commented 3 years ago

What I meant to suggest was to replace std::fs::read_dir in the linked code with tokio::fs::read_dir.

It is a big slowdown, and there have been several examples of people building really slow benchmarks and finding some trivial change to their code that yields a massive speedup, but those were all for reading the contents of the files. I think ultimately you are just running into a lot of back-and-forth between a bunch of threads, and that is just expensive.

artempyanykh commented 3 years ago

@Darksonn let me try to clarify. As I explained in the README.md there are several branches, each has its own implementation:

sync branch uses std::fs; it can be considered a baseline,
async-seq branch uses tokio::fs (incl. tokio::fs::read_dir and tokio::fs::symlink_metadata) and does processing sequentially (so option 1 but with tokio::fs and .await when necessary). This is 64x slower than option 1. Numbers are pretty much the same for both single- and multi-threaded runtimes. The amount of context switches is huge for both runtimes too.
async-par branch uses tokio::fs, but also does as many things concurrently as possible by utilising FuturesUnordered and select!.

If there is a trivial change to my code that can make, say, async-seq version to perform at least within 2x margin of sync version, I'd be more than happy to learn what it is 🙂

artempyanykh commented 3 years ago

I've done more testing on other platforms:

On MacOS BigSur the 'async-seq' version is ~3x slower than 'sync', but 'async-par' is ~15% faster than 'sync'.
On Windows 10 the 'async-seq' version is ~2.25x slower than 'sync', but 'async-par' is ~2x faster than `sync' (makes sense since my desktop has more cores than MBP and can benefit more from 'async-par').

This means that the issue is either Linux specific (unlikely) or WSL2 specific (seems more likely). I don't have a native linux box at hand to test this out now.

I also tried different version of rustc (1.49, 1.50, 1.51) but observed similar behaviour.

Darksonn commented 3 years ago

I tried running it on my laptop which is a native Linux box, but async-par failed with "too many open files". Here are the others:

Benchmark #1: du -hs ~/src
  Time (mean ± σ):     813.7 ms ±  21.5 ms    [User: 249.6 ms, System: 557.6 ms]
  Range (min … max):   785.1 ms … 853.3 ms    10 runs

Benchmark #2: builds/sync ~/src
  Time (mean ± σ):     884.7 ms ±   8.9 ms    [User: 239.9 ms, System: 638.6 ms]
  Range (min … max):   871.0 ms … 896.5 ms    10 runs

Benchmark #3: builds/async-seq ~/src
  Time (mean ± σ):      5.603 s ±  0.059 s    [User: 2.810 s, System: 4.733 s]
  Range (min … max):    5.537 s …  5.735 s    10 runs

There were all built with --release of course.

artempyanykh commented 3 years ago

Great, so async-seq is 6.3x slower, but not 64x, that's reassuring! 🙂

Could you try to increase the nofile limit and try async-par again (e.g. ulimit -S -n 4096 may help)?

Darksonn commented 3 years ago

Sure.

Benchmark #1: builds/async-par ~/src
  Time (mean ± σ):      4.462 s ±  1.566 s    [User: 5.288 s, System: 7.233 s]
  Range (min … max):    2.740 s …  7.184 s    10 runs

artempyanykh commented 3 years ago

Thank you! async-par performs better, but not to the extent I hoped. Both async versions are quite slow (good it's not 60x, but 6x is still a considerable slowdown). I'm tempted to setup native Linux on my PC over the weekend and run it on the same set of files in Win, WSL2 and native Linux to have apples-to-apples comparison.

Darksonn commented 3 years ago

My main opinion on issues like this one is that if someone submits a PR that improves the speed of filesystem operations, I am happy to add those improvements (#3518 is an example), but it is not my a sufficiently large priority to spend time looking for fixes myself. People who need speedups for their fs ops can get already get speedups now by moving the operation into a single spawn_blocking call.

artempyanykh commented 3 years ago

@Darksonn that’s fair. To be clear, I don’t expect you spending time to diagnose the issue and come up with a fix, we all have different priorities and it’s fine.

The way I see it - this perf characteristics are surprising at the very least, so creating an issue is like putting a stick to the ground to tell “We’re aware of this”, and then maybe

there will be an improvement PR - either someone stumble upon this issue and comes up with an improvement, or I will dig deeper when I have spare time.
Or we will confirm that in this sort of workload the perf hit is just inherent and there’s nothing fishy going on. And in this case the good outcome would probably be a section in the docs, so that new users at least have awareness.

However, I can also see that this types of issues may be seen as not directly actionable. Which is totally fair. If this is the case for tokio project - I’d be fine to close the issue.

And in any case, I apologise for the inconvenience if I had missed something about this in the guidelines.

artempyanykh commented 3 years ago

Updated benchmarks https://github.com/artempyanykh/rdu:

Same machine, same set of files.
Ran on Native Linux, WSL2, and Native Windows;
On Linux with warm and cold disk cache:

On Windows perf. profile is very different from Linux; naive async version is ~2.2x slower which is kind of acceptable. On native Linux with warm disk cache naive async version is 9x slower, and on WSL2 it's 55x slower.

ahmedriza commented 1 year ago

This is referred to in this talk Java and Rust by Yishai Galatzer. They used Tokio async fs operations (on a benchmark) and compared that with Java NIO.

IMO, it unfairly pitches Rust as being too slow compared to Java, which is of course not really true.

tokio-rs / tokio

`tokio::fs` + async is 1-2 orders of magnitude slower than a blocking version #3664