tokio-rs / tokio

A runtime for writing reliable asynchronous applications with Rust. Provides I/O, networking, scheduling, timers, ...
https://tokio.rs
MIT License
27.09k stars 2.49k forks source link

`tokio::fs` + async is 1-2 orders of magnitude slower than a blocking version #3664

Open artempyanykh opened 3 years ago

artempyanykh commented 3 years ago

Version 1.4.0

Platform 64-bit WSL2 Linux: Linux 4.19.104-microsoft-standard #1 SMP x86_64 x86_64 x86_64 GNU/Linux

Description The code is in this repo. The setup is explained in the README.

TL;DR:

I understand that tokio::fs uses std::fs under the hood, that there's no non-blocking system API to FS (modulo 'io-uring' but 🤷‍♂️), that async has inherent overhead especially if the disk cache is hot and there's not much waiting on blocking calls.

However, 25x (not saying 64x) just feels too extreme of a slowdown, so I wonder

  1. if this is actually expected,
  2. or some tokio::fs code needs tuning/optimization,
  3. or something else entirely (wrong setup?)
Darksonn commented 3 years ago

I mean, we already know that it is never going to be as fast as using the blocking APIs directly. Did you try with a non-blocking std::fs::read_dir?

artempyanykh commented 3 years ago

@Darksonn

Did you try with a non-blocking std::fs::read_dir?

Sorry, not sure what you mean by non-blocking std::fs::read_dir. std::fs provides a blocking API.

My setup is described here in detail, with code and perf data.

I mean, we already know that it is never going to be as fast as using the blocking APIs directly.

There are several things at play here. First there's overhead from async, then from tokio::fs wrappers, but then there is a speed-up from parallel processing of files in case of async-par implementation.

In any case 25x to 64x slow-down from going to tokio::fs + async compared to a blocking version is pretty extreme, isn't it? We're talking about 200ms (feels instant) vs 12sec (feels like eternity) difference.

Darksonn commented 3 years ago

What I meant to suggest was to replace std::fs::read_dir in the linked code with tokio::fs::read_dir.

It is a big slowdown, and there have been several examples of people building really slow benchmarks and finding some trivial change to their code that yields a massive speedup, but those were all for reading the contents of the files. I think ultimately you are just running into a lot of back-and-forth between a bunch of threads, and that is just expensive.

artempyanykh commented 3 years ago

@Darksonn let me try to clarify. As I explained in the README.md there are several branches, each has its own implementation:

  1. sync branch uses std::fs; it can be considered a baseline,
  2. async-seq branch uses tokio::fs (incl. tokio::fs::read_dir and tokio::fs::symlink_metadata) and does processing sequentially (so option 1 but with tokio::fs and .await when necessary). This is 64x slower than option 1. Numbers are pretty much the same for both single- and multi-threaded runtimes. The amount of context switches is huge for both runtimes too.
  3. async-par branch uses tokio::fs, but also does as many things concurrently as possible by utilising FuturesUnordered and select!.

If there is a trivial change to my code that can make, say, async-seq version to perform at least within 2x margin of sync version, I'd be more than happy to learn what it is 🙂

artempyanykh commented 3 years ago

I've done more testing on other platforms:

This means that the issue is either Linux specific (unlikely) or WSL2 specific (seems more likely). I don't have a native linux box at hand to test this out now.

I also tried different version of rustc (1.49, 1.50, 1.51) but observed similar behaviour.

Darksonn commented 3 years ago

I tried running it on my laptop which is a native Linux box, but async-par failed with "too many open files". Here are the others:

Benchmark #1: du -hs ~/src
  Time (mean ± σ):     813.7 ms ±  21.5 ms    [User: 249.6 ms, System: 557.6 ms]
  Range (min … max):   785.1 ms … 853.3 ms    10 runs

Benchmark #2: builds/sync ~/src
  Time (mean ± σ):     884.7 ms ±   8.9 ms    [User: 239.9 ms, System: 638.6 ms]
  Range (min … max):   871.0 ms … 896.5 ms    10 runs

Benchmark #3: builds/async-seq ~/src
  Time (mean ± σ):      5.603 s ±  0.059 s    [User: 2.810 s, System: 4.733 s]
  Range (min … max):    5.537 s …  5.735 s    10 runs

There were all built with --release of course.

artempyanykh commented 3 years ago

Great, so async-seq is 6.3x slower, but not 64x, that's reassuring! 🙂

Could you try to increase the nofile limit and try async-par again (e.g. ulimit -S -n 4096 may help)?

Darksonn commented 3 years ago

Sure.

Benchmark #1: builds/async-par ~/src
  Time (mean ± σ):      4.462 s ±  1.566 s    [User: 5.288 s, System: 7.233 s]
  Range (min … max):    2.740 s …  7.184 s    10 runs
artempyanykh commented 3 years ago

Thank you! async-par performs better, but not to the extent I hoped. Both async versions are quite slow (good it's not 60x, but 6x is still a considerable slowdown). I'm tempted to setup native Linux on my PC over the weekend and run it on the same set of files in Win, WSL2 and native Linux to have apples-to-apples comparison.

Darksonn commented 3 years ago

My main opinion on issues like this one is that if someone submits a PR that improves the speed of filesystem operations, I am happy to add those improvements (#3518 is an example), but it is not my a sufficiently large priority to spend time looking for fixes myself. People who need speedups for their fs ops can get already get speedups now by moving the operation into a single spawn_blocking call.

artempyanykh commented 3 years ago

@Darksonn that’s fair. To be clear, I don’t expect you spending time to diagnose the issue and come up with a fix, we all have different priorities and it’s fine.

The way I see it - this perf characteristics are surprising at the very least, so creating an issue is like putting a stick to the ground to tell “We’re aware of this”, and then maybe

  1. there will be an improvement PR - either someone stumble upon this issue and comes up with an improvement, or I will dig deeper when I have spare time.
  2. Or we will confirm that in this sort of workload the perf hit is just inherent and there’s nothing fishy going on. And in this case the good outcome would probably be a section in the docs, so that new users at least have awareness.

However, I can also see that this types of issues may be seen as not directly actionable. Which is totally fair. If this is the case for tokio project - I’d be fine to close the issue.

And in any case, I apologise for the inconvenience if I had missed something about this in the guidelines.

artempyanykh commented 3 years ago

Updated benchmarks https://github.com/artempyanykh/rdu:

On Windows perf. profile is very different from Linux; naive async version is ~2.2x slower which is kind of acceptable. On native Linux with warm disk cache naive async version is 9x slower, and on WSL2 it's 55x slower.

ahmedriza commented 1 year ago

This is referred to in this talk Java and Rust by Yishai Galatzer. They used Tokio async fs operations (on a benchmark) and compared that with Java NIO.

IMO, it unfairly pitches Rust as being too slow compared to Java, which is of course not really true.