nextest seems to handle stdio-intensive tests less efficiently than cargo test

HadrienG2 commented 10 months ago

I have some tests around which assert that panics occur when invalid inputs are sent to certain functions. These tests run several times slower under nextest than under cargo test when RUST_BACKTRACE=1, and a quick perf profile suggests that they spend most of their time waiting for nextest to read out the data that they have sent down the stdout/stderr pipe.

The most likely reason why these tests are the only ones affected is that the default Rust panic handler sends lots of text down the pipe when backtraces are enabled, so these are probably my most chatty tests by a large margin.

In this particular situation, I can sidestep the problem entirely by replacing the Rust panic handler with a no-op during execution of these tests, which will eliminate the constant stdio traffic linked to the constant printing of backtraces. But it might be worth investigating whether nextest can handle heavy stdio traffic more efficiently by e.g. using larger pipe buffer sizes.

sunshowers commented 10 months ago

Thanks for the report!

Huh, that's interesting -- how big are stdout and stderr for your tests? The default Linux x86_64 pipe size is 65536 bytes. Regardless, 700ms+ just in write and read is a lot and is definitely worth fixing.

sunshowers commented 10 months ago

To be honest, I can't reproduce this. With a Ryzen 7950x and Linux kernel 6.5.6 and Rust 1.73.0, with this test that writes roughly 500MiB, I get:

cargo test: 0.903 seconds
cargo nextest run: 0.551 seconds

Replacing eprintln! with std::io::stderr().write_all() (which isn't intercepted by libtest by default) doesn't change things either -- cargo nextest run takes the same amount of time, while cargo test 2>test-file takes 0.740 seconds. In both cases, cargo nextest run is faster than cargo test.

I do believe the flamegraph, I'm just trying to figure out what might be going on.

sunshowers commented 10 months ago

I do see from perf that nextest is spending most of its time reading from the pipe, that's true.

I also tried applying this patch (Linux only) to make pipe buffers bigger and it didn't appear to make things faster or slower -- I suspect the limit is not that.

This post suggests that tweaking buffer sizes etc will not make a difference, and the main way to make things faster than read and write is to use vmsplice -- which wouldn't seem to help because vmsplice is equivalent to read when transferring data from a pipe to user memory.

Unless there are some easy wins elsewhere I don't think we can make this better unfortunately.

NobodyXu commented 10 months ago

Does nextest use single-thread tokio runtime and does it spawn the wait tasks so that it can be run in parallel?

sunshowers commented 10 months ago

Nextest uses a multithreaded runtime, and each test is its own task. However, reading from stdout and stderr are not split up into their own tasks.

NobodyXu commented 10 months ago

I think splitting reading stdout/stderr might be beneficial if the tests writes a lot to both stdout and stderr.

HadrienG2 commented 10 months ago

First things first : for a reproducer you can...

Install the libhwloc development package of your preferred package manager
Clone the repo of my project
Checkout commit 1a78fc223aebefe351566d76da1b91df55c8e6e6 (current master at the time of writing, empty panic hook hasn't been merged yet)

...then run the following commands:

# Only observable with backtraces, which greatly increase the stdout/stderr output
export RUST_BACKTRACE=1

# Takes 2.3s on an AMD R5 5600G
cargo test ffi::int::tests -- --test-threads=1

# Takes 18.3s on the same CPU as above
cargo nextest run -j1 -E 'test(/ffi::int::tests/)'

I'm using only 1 test thread here because it makes timings more reproducible and reduces the perf workload during profiling (which enables going to higher sampling rates).

To be clear, the slowdown of enabling backtraces is already massive with cargo test (the same test run takes 0.3s without), and although it has implications that I don't like when running tests over multiple threads, replacing the standard panic hook with a no-op while running those tests is an acceptable workaround.

What I find suspicious here is that the slowdown is almost an order of magnitude higher with nextest than with cargo test, even though in this particular scenario neither of them should be doing anything with the test output, they should be just collecting it in some buffer in order to display it in case of failure. Somehow, cargo test must use a different test output collection strategy that's a lot more efficient.

Trying to investigate, a perf profile of the cargo test run (which unfortunately has much lower-quality data than my former profile, probably because I'm missing some debug info I can't identify on this machine) shows that cargo has almost no CPU activity during this test run.

I wonder how they manage to avoid the pipe overhead, assuming that's the only issue here. Perhaps they redirect the test output to some tempfile instead of a pipe so the kernel can handle all the data transfers without needing round trips to user space? Or perhaps they manage to have libtest do the buffering in-process somehow, and only print the output when a failure occurs?

To answer your first question, cargo test ffi::int::tests:: -- --test-threads=1 --nocapture 2>&1 | wc -c tells me that all these tests combined produce a total of 173 MiB of output over stdout + stderr. Yikes, I know. But this is testing an integer newtype, and unfortunately integers use panic as a failure mode in a lot of different operations...

Also, knowing that modern RAM can transfer data at tens of GB/s on consumer hardware does put the ~10MiB/s data transfer rate that is implied by taking 18s to do what amounts to filling up a ~180 MB buffer in a less favorable light.

HadrienG2 commented 10 months ago

To be honest, I can't reproduce this. With a Ryzen 7950x and Linux kernel 6.5.6 and Rust 1.73.0, with this test that writes roughly 500MiB, I get:
* `cargo test`: 0.903 seconds

* `cargo nextest run`: 0.551 seconds
Replacing eprintln! with std::io::stderr().write_all() (which isn't intercepted by libtest by default) doesn't change things either -- cargo nextest run takes the same amount of time, while cargo test 2>test-file takes 0.740 seconds. In both cases, cargo nextest run is faster than cargo test.

I do believe the flamegraph, I'm just trying to figure out what might be going on.

Perhaps the difference is that the std::backtrace printer in the panic hook does finer-grained writes? Because stderr is unbuffered, its performance can vary greatly depending on write granularity.

sunshowers commented 10 months ago

Somehow, cargo test must use a different test output collection strategy that's a lot more efficient.

cargo test does its output collection in-process, without using OS pipes. That might explain the difference.

Your point about stderr being unbuffered is well-taken (and is easy to verify by looking at https://doc.rust-lang.org/src/std/io/stdio.rs.html -- Stdout has a LineWriter in it, while Stderr does not). Maybe we need to add some buffering to std::backtrace in upstream Rust.

HadrienG2 commented 10 months ago

cargo test does its output collection in-process, without using OS pipes. That might explain the difference.

Sounds pretty neat, is there a way nextest could leverage the same infrastructure? Or if not, is there a way the infrastructure could be modified to make it happen?

Maybe we need to add some buffering to std::backtrace in upstream Rust.

This could also be a fallback plan if the above, more general optimization can't be made to work. I'm wondering if this would fit the design constraints of std though: the other day I remember seeing a rust-lang issue that proposed turning OOM errors into panics, if that's the plan you probably don't want to allocate when printing the OOM backtrace...

Of course, this is assuming all other allocations can be removed from the backtrace computation, which is perhaps a little optimistic. And maybe a statically allocated or stack-allocated buffer could be made to work?

sunshowers commented 10 months ago

Sounds pretty neat, is there a way nextest could leverage the same infrastructure?

Hmm, so a thing I realized is that nextest currently always passes in --nocapture, whether it is interested in output in case the test succeeds or not, even though most of the time users aren't interested in success output for the test. A thing nextest could do is to not pass in --nocapture when nextest isn't interested in output for the test on success.

sunshowers commented 9 months ago

While writing this, I sadly realized that --nocapture is pretty load bearing for tests that time out -- we want to show whatever output the test produced by the time it was killed, which means instructing libtest not to capture any output.

I think the only way to resolve this is to add buffering upstream (though maybe that's out of the question). There's nothing that can be done on nextest's side as far as I can tell.

HadrienG2 commented 9 months ago

How would you feel about an upstream feature that dumps the stdio buffer when a certain signal (say, SIGUSR1) is received? I think it is the minimal upstream change that would preserve optimal I/O performance in the common case, while allowing for the buffer to be dumped in case of timeout.

sunshowers commented 9 months ago

What if a test sets a SIGUSR1 handler for itself? I don't know how common that is but it is a thing that's possible.

To be honest I still don't see how the effort is worth it. This seems to be an edge case that is addressed by setting your own panic handler.

If you're printing out tons of logs to stderr you're most likely doing line buffering internally.
If you're using serde_json, which writes stuff character by character, just wrap the writer in a buffer yourself.

We can document this and maybe even detect it if we're getting a ton of small reads.

HadrienG2 commented 9 months ago

I would obviously be fine with a "wontfix please silence your panics" resolution since I already have a panic handler that does this around ;)

nextest-rs / nextest

nextest seems to handle stdio-intensive tests less efficiently than cargo test #1068