Open HadrienG2 opened 10 months ago
Thanks for the report!
Huh, that's interesting -- how big are stdout and stderr for your tests? The default Linux x86_64 pipe size is 65536 bytes. Regardless, 700ms+ just in write and read is a lot and is definitely worth fixing.
To be honest, I can't reproduce this. With a Ryzen 7950x and Linux kernel 6.5.6 and Rust 1.73.0, with this test that writes roughly 500MiB, I get:
cargo test
: 0.903 secondscargo nextest run
: 0.551 secondsReplacing eprintln!
with std::io::stderr().write_all()
(which isn't intercepted by libtest by default) doesn't change things either -- cargo nextest run
takes the same amount of time, while cargo test 2>test-file
takes 0.740 seconds. In both cases, cargo nextest run
is faster than cargo test
.
I do believe the flamegraph, I'm just trying to figure out what might be going on.
I do see from perf
that nextest is spending most of its time reading from the pipe, that's true.
I also tried applying this patch (Linux only) to make pipe buffers bigger and it didn't appear to make things faster or slower -- I suspect the limit is not that.
This post suggests that tweaking buffer sizes etc will not make a difference, and the main way to make things faster than read
and write
is to use vmsplice
-- which wouldn't seem to help because vmsplice
is equivalent to read
when transferring data from a pipe to user memory.
Unless there are some easy wins elsewhere I don't think we can make this better unfortunately.
Does nextest use single-thread tokio runtime and does it spawn the wait tasks so that it can be run in parallel?
Nextest uses a multithreaded runtime, and each test is its own task. However, reading from stdout and stderr are not split up into their own tasks.
I think splitting reading stdout/stderr might be beneficial if the tests writes a lot to both stdout and stderr.
First things first : for a reproducer you can...
...then run the following commands:
# Only observable with backtraces, which greatly increase the stdout/stderr output
export RUST_BACKTRACE=1
# Takes 2.3s on an AMD R5 5600G
cargo test ffi::int::tests -- --test-threads=1
# Takes 18.3s on the same CPU as above
cargo nextest run -j1 -E 'test(/ffi::int::tests/)'
I'm using only 1 test thread here because it makes timings more reproducible and reduces the perf workload during profiling (which enables going to higher sampling rates).
To be clear, the slowdown of enabling backtraces is already massive with cargo test (the same test run takes 0.3s without), and although it has implications that I don't like when running tests over multiple threads, replacing the standard panic hook with a no-op while running those tests is an acceptable workaround.
What I find suspicious here is that the slowdown is almost an order of magnitude higher with nextest than with cargo test, even though in this particular scenario neither of them should be doing anything with the test output, they should be just collecting it in some buffer in order to display it in case of failure. Somehow, cargo test must use a different test output collection strategy that's a lot more efficient.
Trying to investigate, a perf profile of the cargo test
run (which unfortunately has much lower-quality data than my former profile, probably because I'm missing some debug info I can't identify on this machine) shows that cargo
has almost no CPU activity during this test run.
I wonder how they manage to avoid the pipe overhead, assuming that's the only issue here. Perhaps they redirect the test output to some tempfile instead of a pipe so the kernel can handle all the data transfers without needing round trips to user space? Or perhaps they manage to have libtest do the buffering in-process somehow, and only print the output when a failure occurs?
To answer your first question, cargo test ffi::int::tests:: -- --test-threads=1 --nocapture 2>&1 | wc -c
tells me that all these tests combined produce a total of 173 MiB of output over stdout + stderr. Yikes, I know. But this is testing an integer newtype, and unfortunately integers use panic as a failure mode in a lot of different operations...
Also, knowing that modern RAM can transfer data at tens of GB/s on consumer hardware does put the ~10MiB/s data transfer rate that is implied by taking 18s to do what amounts to filling up a ~180 MB buffer in a less favorable light.
To be honest, I can't reproduce this. With a Ryzen 7950x and Linux kernel 6.5.6 and Rust 1.73.0, with this test that writes roughly 500MiB, I get:
* `cargo test`: 0.903 seconds * `cargo nextest run`: 0.551 seconds
Replacing
eprintln!
withstd::io::stderr().write_all()
(which isn't intercepted by libtest by default) doesn't change things either --cargo nextest run
takes the same amount of time, whilecargo test 2>test-file
takes 0.740 seconds. In both cases,cargo nextest run
is faster thancargo test
.I do believe the flamegraph, I'm just trying to figure out what might be going on.
Perhaps the difference is that the std::backtrace printer in the panic hook does finer-grained writes? Because stderr is unbuffered, its performance can vary greatly depending on write granularity.
Somehow, cargo test must use a different test output collection strategy that's a lot more efficient.
cargo test
does its output collection in-process, without using OS pipes. That might explain the difference.
Your point about stderr being unbuffered is well-taken (and is easy to verify by looking at https://doc.rust-lang.org/src/std/io/stdio.rs.html -- Stdout
has a LineWriter
in it, while Stderr
does not). Maybe we need to add some buffering to std::backtrace in upstream Rust.
cargo test does its output collection in-process, without using OS pipes. That might explain the difference.
Sounds pretty neat, is there a way nextest could leverage the same infrastructure? Or if not, is there a way the infrastructure could be modified to make it happen?
Maybe we need to add some buffering to std::backtrace in upstream Rust.
This could also be a fallback plan if the above, more general optimization can't be made to work. I'm wondering if this would fit the design constraints of std though: the other day I remember seeing a rust-lang issue that proposed turning OOM errors into panics, if that's the plan you probably don't want to allocate when printing the OOM backtrace...
Of course, this is assuming all other allocations can be removed from the backtrace computation, which is perhaps a little optimistic. And maybe a statically allocated or stack-allocated buffer could be made to work?
Sounds pretty neat, is there a way nextest could leverage the same infrastructure?
Hmm, so a thing I realized is that nextest currently always passes in --nocapture
, whether it is interested in output in case the test succeeds or not, even though most of the time users aren't interested in success output for the test. A thing nextest could do is to not pass in --nocapture
when nextest isn't interested in output for the test on success.
While writing this, I sadly realized that --nocapture is pretty load bearing for tests that time out -- we want to show whatever output the test produced by the time it was killed, which means instructing libtest not to capture any output.
I think the only way to resolve this is to add buffering upstream (though maybe that's out of the question). There's nothing that can be done on nextest's side as far as I can tell.
How would you feel about an upstream feature that dumps the stdio buffer when a certain signal (say, SIGUSR1) is received? I think it is the minimal upstream change that would preserve optimal I/O performance in the common case, while allowing for the buffer to be dumped in case of timeout.
What if a test sets a SIGUSR1 handler for itself? I don't know how common that is but it is a thing that's possible.
To be honest I still don't see how the effort is worth it. This seems to be an edge case that is addressed by setting your own panic handler.
We can document this and maybe even detect it if we're getting a ton of small reads.
I would obviously be fine with a "wontfix please silence your panics" resolution since I already have a panic handler that does this around ;)
I have some tests around which assert that panics occur when invalid inputs are sent to certain functions. These tests run several times slower under nextest than under cargo test when
RUST_BACKTRACE=1
, and a quick perf profile suggests that they spend most of their time waiting for nextest to read out the data that they have sent down the stdout/stderr pipe.The most likely reason why these tests are the only ones affected is that the default Rust panic handler sends lots of text down the pipe when backtraces are enabled, so these are probably my most chatty tests by a large margin.
In this particular situation, I can sidestep the problem entirely by replacing the Rust panic handler with a no-op during execution of these tests, which will eliminate the constant stdio traffic linked to the constant printing of backtraces. But it might be worth investigating whether nextest can handle heavy stdio traffic more efficiently by e.g. using larger pipe buffer sizes.