Tracking issue for eRFC 2318, Custom test frameworks

Centril commented 6 years ago

This is a tracking issue for the eRFC "Custom test frameworks" (rust-lang/rfcs#2318).

Documentation:

Steps:

[ ] Implement the RFC (cc @rust-lang/dev-tools @rust-lang/cargo -- can anyone write up mentoring instructions?)
[ ] Adjust documentation (see instructions on forge)
[ ] Stabilization PR (see instructions on forge)

Unresolved questions:

Notes:

Destabilization of #[bench]: https://github.com/rust-lang/rust/issues/63798#issuecomment-526593504

Implementation history:

Initial implementation: #53410

CAD97 commented 6 years ago

On stdout/err capture (after doing some research around the proof-of-concept #50457):

The current state is sub-optimal, to say the least. Both stdout and stderr exist as thread-local handles that only the stdlib has access to, which print family of macros use, as does panic.

Relevant libstd snippets

https://github.com/rust-lang/rust/blob/935a2f12b2a3ce44cbcbd0f7de1b388949cccca2/src/libstd/io/stdio.rs#L23-L27 https://github.com/rust-lang/rust/blob/935a2f12b2a3ce44cbcbd0f7de1b388949cccca2/src/libstd/panicking.rs#L38-L42 https://github.com/rust-lang/rust/blob/935a2f12b2a3ce44cbcbd0f7de1b388949cccca2/src/libstd/io/stdio.rs#L614-L711

The existing stdout/err capture for the test harness works by setting the thread-local, which, if set, the `print` family of macros will use, otherwise falling back to `io::stdout()`. This setup is problematic because it's fairly trivial to escape this (#12309) and 99.9% of logging frameworks will escape accidentally (#40298). The reason for this is that any solution that's generic over `io::Write` is going to use `io::stdout()` as the sink to hit stdout, and that doesn't go through the tread-local plumbing, so hits the true stdout directly. The way I see it, there are three ways to resolve this quirk: - Create a new stdlib `io::Write` sink that uses the thread-local plumbing that the `print` macros do today, and implement the `print` macros as writing to that sink. That's the approach I proved was possible with #50457. - Create a new `io::Write` sink that uses (`e`)`print!` behind the scenes. This is possible entirely in an external crate:

(E)PrintWriter - feel free to use these, I am the author and release them under MIT/Unlicense

```rust /// IO sink that uses the `print` macro, for test stdout capture. pub struct PrintWriter; impl std::io::Write for PrintWriter { fn write(&mut self, buf: &[u8]) -> std::io::Result { print!("{}", String::from_utf8_lossy(buf)); Ok(buf.len()) } fn flush(&mut self) -> std::io::Result<()> { std::io::stdout().flush() } } /// IO sink that uses the `eprint` macro, for test stdout capture. pub struct EPrintWriter; impl std::io::Write for EPrintWriter { fn write(&mut self, buf: &[u8]) -> std::io::Result { eprint!("{}", String::from_utf8_lossy(buf)); Ok(buf.len()) } fn flush(&mut self) -> std::io::Result<()> { std::io::stdout().flush() } } ```

Unfortunately, this suffers from a large number of drawbacks. Because we only have access to the `print` macro, we have to interpret whatever sequence of bytes the sink gets as a UTF-8 string, even though that's potentially not true, etc. etc. not zero-cost when it's not, and so on. It's a decent solution for hooking up logging in tests, but is very much not a general solution that could be used outside tests. That takes us to the third option, which is almost certainly the best, but also the most complicated: - Actually change stdout/err It sounds simple, but it's a lot simpler said than done with the _current_ test harness setup. `io::Stderr` boils down to a handle to a write stream available to the process, and is a _process-local_ resource -- there's only _one_ stdout per process. The test harness currently runs tests parallel on different threads within the same process, thus the current thread-local replacement. The other issue is that the test harness doesn't _just_ want to capture stdout of the tests, it wants to capture stdout of the tests _separately_ for each test, so it can print just the stdout of the failing test(s) and not have the other tests' output interspersed with it. The only solution I see at this time that maintains current behavior of separate captures per test run is to run them all in their own process, and capture stdout that way. That way each process can pipe its stdout/err to the test runner, which can then synchronize them and decide which ones to output to the true stdout/err if desired. A sketch of the plumbing required to make an out-of-tree libtest that works with this behavior: - `#[test] fn` are all collected into a list - A `fn main` is generated that takes an argument of one of the paths, and runs that test - (The existing `fn main` is hidden) - As normal, successful execution is success, and a panic is a failure exit code - Our runner calls the generated binary with each of the test paths that match any filtering done as separate processes, captures their stdout/err and reports them on failure Note that this is unfortunately stateful: we need to fold over all `#[test]` fn _before_ we can emit the proper `main`. I'm not knowledgeable enough around compiler internals to be much help implementing the machinery right now (though I'd love to learn my way around and help out!), but give me an API to build against and I'll definitely help with any test frameworks built against the plumbing the eRFC has us set to experiment with.

CAD97 commented 6 years ago

TL;DR:

To have sane capture behavior for stdout and stderr for print-family macros, panic, and io::{stdout, stderr}, running each test in its own (child) process is probably required. I'd love to help out wherever I can, but lack confidence.

retep998 commented 6 years ago

The only solution I see at this time that maintains current behavior of separate captures per test run is to run them all in their own process, and capture stdout that way.

This would be significantly slower on certain platforms where process creation is slow, notably Windows. Especially when the tests are all very quick this will cause severe regressions in total time.

CAD97 commented 6 years ago

Another possibility for process separation of tests is to spawn a process for each unit of parallelism, but each process is reused for multiple tests.

Of course, we could also just promote the thread-local replaceable stdout/err to io::Stdout/Err, but that would be suboptimal itself.

The issue is that stdout is, as far as I can tell, a process-local resource. To capture io::Stdout we either need our own stdout replacement machinery on top or to separate processes.

SoniEx2 commented 6 years ago

if I want to argue about this do I argue about it here?

so like I made hexchat-plugin crate. the problem is you can't test hexchat side without hexchat. so I'd need some sort of #[hexchat_test].

but I still want the users to be able to use plain old vanilla rust #[test]s. I shouldn't have to reimplement/replace the whole testing system just to be able to produce some hexchat-based unit tests (in addition to normal tests).

(disclaimer: I don't fully understand the RFC, but it seems to require either replacing everything, or not having tests)

jonhoo commented 6 years ago

@SoniEx2 this RFC is about how we'd like to change the underlying mechanism that is used to arrange testing in Rust so that authors can write new test frameworks (e.g., quickcheck) that have access to the same mechanisms as existing built-in tests ( able to access private functions, can find all tagged functions, etc.). The current proposal is relatively wholesale (you change the test framework for a whole crate), but this is something that may be improved upon. We decided against more sophisticated designs in the initial implementation phase (this is just an "experimental" RFC) so that we don't bite off more than we can chew, but I think down the line we'd want more fine-grained test framework selection too.

As for your specific example, you don't actually need custom frameworks for that feature :) It seems like you want to only run a test if hexchat isn't running? To do that, you could either mark the test with #[allow_fail], or you could have the test just return if it fails to connect. If you wanted a more fine-grained approach, you could write a procedural macro that re-writes any function annotated with #[hexchat_test] into a #[test] that returns if hexchat isn't running.

SoniEx2 commented 6 years ago

no you don't understand, they have to run inside hexchat. loaded with dlopen (or equivalent). after plugin registration.

djrenren commented 6 years ago

Hey all, I've come up with a simplified proposal and I have an implementation now (though it needs a little cleaning up before a PR).

https://blog.jrenner.net/rust/testing/2018/08/06/custom-test-framework-prop.html

Lemme know what you think!

CAD97 commented 6 years ago

I've thought some more about stdin/stderr capture. Here's a draft of a system that should be able to "perfectly" capture stdin/stdout, though not without O(1) (once per test suite) overhead:

The main process is started by cargo test. It does no actual testing; its job is to coordinate child processes to do the actual tests.
At startup of the main process, it spawns a child process for each unit of parallelism in the test runner. This is a one-time cost, as these child processes are reused for multiple tests.
With each child process spawned, a communication channel is created between the processes distinct of stdin/out/err. This is used to instruct the child process which test to run next, and for the child process to report back the success/failure of the test.
We avoid talking to the child process on stdin because the test could listen to stdin, and we want to avoid accidentally polluting that. (Also, we need a back-channel that's not stdout to communicate test results anyway.)
The stdout/stderr of the child processes are piped into communication channels into the main process, which handles reporting those if the test failed.
Tests are run off the main thread in child processes, so that the child process doesn't die to a panic and that it can (or the main thread can instruct it to) terminate runaway tests early.
After the main process has handed out all tests to the child processes, it shuts them down as it receives the final reports, and then outputs the final report to the user before terminating.

jonhoo commented 6 years ago

@djrenren finally had a chance to read your proposal. I'm excited by the progress on this, though your write-up did raise some questions for me:

Early on, you say that "Annotated items must be a nameable const, static, or fn.". Further down, when you discuss the use-case of a framework that supports test suites, you show a #[test_suite] annotation on a mod with #[suite_member] on the nested tests. It'd be good to show how that expansion works. Does a proc macro turn #[test_suite] into a const with an element for each #[suite_member]? How does it find all of them? Are the #[suite_member]s all turned into #[test_case]? How do you avoid double-executing those test cases (once as a "child" and once due to the #[test-case] annotation)?
For #![test_runner] you say "The type of the function must be Fn(&[&mut Foo]) -> impl Termination for some Foo which is the test type", yet further down you write a Fn(&[&dyn Foo]). I assume the latter is the correct version?
Is TestDescAndFn sufficient for our purposes here? I don't think that can (currently) wrap a struct or const in a meaningful way?
I find the "test runner" and "test format" examples a little backwards. To me, "formatting" is about output, and "running" is about the structure of the tests themselves (and how they're run). We can bikeshed names later though.
The "editor wants to integrate with testing" example makes me sad. It seems really sad for the crate-under-test to have to be modified to work with my editor. One of the earlier RFCs had a section on separating the test harness from the test executor, with the former controlling output and the latter how tests are written and executed. And crucially, the former should be possible to swap without changing the source code of the c-u-t. That doesn't solve the case of an editor wanting to control test execution as well, but I think ideally we'd be able to fold that into the harness somehow too.

CAD97 commented 6 years ago

I think in most cases the CUT won't need to change itself to integrate with the IDE test runner, so long as the IDE understands the test runner you're working with. As I understand how JetBrains' IntelliJ IDEA integrates with both JUnit and Rust's libtest is that it communicates using the standard API and stdin/out. JUnit's integration is obviously tighter, as IDEA was a Java IDE first, but the integration with libtest tests in Rust is already quite good even without any specific integration required.

Obviously the IDE needs to understand how the test runner works or the test runner needs to understand how the IDE works, but under most cases of using a popular test runner, the CUT shouldn't need to change to be tested with in-IDE functionality.

SoniEx2 commented 6 years ago

I'm gonna assume this isn't for me, as a crate author, but for IDE developers.

Let me know when crate authors are part of the plan.

djrenren commented 6 years ago

@jonhoo

Early on, you say that...

I don't want to get too bikesheddy on how that particular macro would work but basically we'd just produce a single const that had the #[test_case] annotation. The children would be aggregated inside it.

I assume the latter is the correct version?

Fn(&[&Foo]) -> () is a subtype Fn(&[&mut Foo]) -> impl Termination. It's not required that Foo be a Trait but if it is, then you need dyn, so those two lines aren't in contradiction.

Is TestDescAndFn sufficient for our purposes here?

I agree TestDescAndFn is not sufficient, though it could successfully be constructed from all sorts of things with the use of proc macros. I think for libtest we should just make a trait that essentially mirrors TestDescAndFn and accept that.

One of the earlier RFCs had a section on separating the test harness from the test executor, with the former controlling output and the latter how tests are written and executed.

This is still entirely possible under the current design, it just requires the test runner to be designed to be pluggable.

djrenren commented 6 years ago

@SoniEx2 It is for crate authors! 😄

For most crate authors (unless you're authoring test machinery), you'll just import any macros you want for declaring tests or benchmarks, and then you'll pick a test runner. For most cases the built-in one should be fine though.

SoniEx2 commented 6 years ago

but how do you run tests inside a hexchat instance, considering the whole hexchat plugin loading system and stuff?

SoniEx2 commented 6 years ago

I mean you can't (easily) run hexchat (a GUI-based networked program) in cargo, can you?

SoniEx2 commented 6 years ago

can I assume this won't help test integrations (how your plugin behaves running inside a host program) and is more geared towards testing your plugin's independent parts?

djrenren commented 6 years ago

For those worried about stdout capture, I've just demonstrated how this proposal can allow for @CAD97's style of multiprocess stdout capturing: https://github.com/djrenren/rust-test-frameworks/blob/master/multiprocess_runner/src/lib.rs

This code compiles and runs under my PR: https://github.com/rust-lang/rust/pull/53410

The output of cargo test is available in the same directory and is called output.txt

CAD97 commented 6 years ago

And just for clarity, the example does spawn a process for each test, which is suboptimal on OSes where process creation is slow, e.g. Windows. A real runner would implement a scheme like I described earlier to avoid spawning more processes than necessary. (And actually parallelize the running of tests.)

japaric commented 6 years ago

Whoops, I'm (very) late to the party.

I have read the blog post but not looked at the rustc implementation and I'm concerned with the presence of impl Termination and String (see Testable.name) in the proposal.

How will this proposal work with the version of #![no_std] binaries that's being stabilized for the edition release? Those #![no_std] binaries use #![no_main] to avoid the unstable start (i.e. fn main() interface) and termination (i.e. Termination trait) lang items. Without start if rustc --test injects a main function into the crate it will end up being discarded as these binaries use a custom entry point. Without termination the Termination trait doesn't exist (it is defined in std after all).

Also it would be best if using a custom test runner didn't require an allocator as that requires quite a bit of boilerplate (e.g. #[alloc_error_handler]) and uses up precious Flash memory.

Of course I'm not going to block any PR since this is in eRFC state and the idea is to experiment; i just want to make sure that the embedded use case is not forgotten and that what we end up stabilizing does support it.

cc @Manishearth I thought cargo fuzz used #![no_main] and that it had similar needs as the embedded use case, but perhaps the fuzzed programs do link to std so they are not that similar?.

djrenren commented 6 years ago

@japaric : support for the embedded use case has been a goal for me from day one, though I will confess I'm not intimately familiar with it.

The Testable trait in the proposal is defined by the testing framework, not the language, so it can definitely be designed without the need for allocation.

Test runner's definitely don't require dynamic allocation now because the slice that's passed is static.

As for the #![no_main] problem I'm not very familiar with it so I'll have to review a bit, but I see the issue... Might have to revise a bit so the test runner can be called from a generated start instead of main.

Manishearth commented 6 years ago

Yeah, cargo-fuzz uses no_main but I thought we had dropped the use case from the original eRFC? (or had considered it low priority).

cargo-fuzz can be changed to be non-no_main with some libfuzzer edits.

djrenren commented 6 years ago

Hey all,

As I'm wrapping up my internship and will have less time to work on this, I figured I'd leave some notes for the community on the state of things merged so far and going forward. (I'll still be around, but a full-time job will definitely slow my contributions).

Current Status:

#[test_case] and #![test_runner] are merged
No cargo integration currently (though they work just fine with cargo test)

Work to be done:

Migrate test::test_main_static to accepting a trait instead of a struct so it can be compatible with other test data formats beyond TestDescAndFn
Work with various test crate authors to evaluate the ergonomics of the system. (criterion or wasm-bindgen is probably a good place to start).
Experiment with abstractions on top of these tools. It may turn out that the primitives are too primitive

Having presented this once at Mozilla and at a meetup, I've gotten a bit of feedback that's worth mentioning. There's concern about only having a single test runner for the crate. @SergioBenitez recommended having #[test_case] be parameterized with what kind of test it's annotating, and building separate binaries for different kinds of tests.

I think this adds a reasonably heavy requirement on configuration and build-tools so I'm not sure I agree but the point is well-taken that cooperation between tests and runners could be tenuous under the current circumstances. I believe that having a default Testable trait accessible in the compiler could act as defacto standard for compatibility.

I'm sure he could say more about it and offer a stronger argument, but either way this is something we should sort out. Luckily these things could be implemented on top of the current primitives so experimentation should be doable.

SergioBenitez commented 6 years ago

Happy to expand on that, @djrenren.

The Problem

The current implementation makes it impossible, without workarounds or test runner intervention, to write tests that are intended to be run by different runners in a single crate. For example, consider a crate that would like to use both quickcheck and criterion:

#[quickcheck]
fn identity(a: i32) -> bool { ... }

#[criterion_bench]
fn benchmark(c: &mut Criterion) { ... }

Under the current plan, this would expand to:

#[test_case]
fn identity(a: i32) -> bool { ... }

#[test_case]
fn benchmark(c: &mut Criterion) { ... }

To run these under the current proposal, a user would need to provide a single runner that can execute both of these test cases. Of course, that's clearly not the intention here; one runner (quickcheck) is responsible for identity, while a different runner (criterion) is responsible for benchmark. Alas, there is no way to express this.

Solutions

I see two potential solutions to this problem: one puts the onus on the test runner and cargo while the other on the std. lib and cargo. The first requires test runners to emit a cfg when they emit a test_case to ensure that annotated items are only compiled when that runner is requested. With this approach, criterion and quickcheck would be modified so that the original code would instead expand to something like:

#[test_case]
#[cfg(target_runner = quickcheck::run)]
fn identity(a: i32) -> bool { ... }

#[test_case]
#[cfg(target_runner = criterion::bench)]
fn benchmark(c: &mut Criterion) { ... }

When a runner foo::bar is selected, cargo would set the target_runner = foo::bar flag automatically.

The second solution is almost exactly like the first but moves the responsibility for emitting the cfg attribute from the test runner crate to the standard-library. Under this approach, the test_case attribute takes a single parameter: the path to the runner intended to be used for the given test case. With this approach, criterion and quickcheck would expand the original code to something like:

#[test_case(quickcheck::run)]
fn identity(a: i32) -> bool { ... }

#[test_case(criterion::bench)]
fn benchmark(c: &mut Criterion) { ... }

Then, a test_case(runner) is only compiled when runner is the selected runner; how this happens is an implementation detail, but I presume it would work similar to the cfg approach above.

Summary

I believe this second approach is the better approach because it is less error prone, more succinct, and hides implementation details. With the first approach, all crates must follow the appropriate convention or risk breaking consumer crate: if even a single crate does not cfg a test_case, it again becomes impossible to write test cases for multiple runners in a single crate. What's more, the first approach results in adding boilerplate to every test runner crate and requires cargo to recognize a new cfg that must be stabilized.

I think this adds a reasonably heavy requirement on configuration and build-tools...

I disagree. If cargo is going to be aware of test runners anyway, having it emit a cfg that recognizes which runner was selected is a trivial addition.

Addendum

Why is the user restricted to selecting a single test runner? It would be nice to compile one binary that executes test cases from multiple runners.

CAD97 commented 6 years ago

I like the idea of #[test_case(quickcheck::run)]. Just to strawman out an idea that would allow for running multiple test runners in one go:

fn main() { // entry generated by `cargo test`
    let libtest_tests = &[ /* collected tests */ ];
    let quickcheck_tests = &[ /* collected tests */ ];
    let criterion_tests = &[ /* collected tests */ ];

    ::test::run_tests(libtest_tests);
    ::quickcheck::run(quickcheck_tests);
    ::criterion::bench(criterion_tests);
}

cargo-test would allow running specific tests filtering as normal, but also allow filtering by runner if you, say, only wanted to run the criterion benchmarks, or wanted to exclude them from your normal test. (For that matter, maybe the same treatment should be given to cargo-bench, add a #[bench_case] attribute, so that benchmark crates don't have to pretend to be test runners.)

djrenren commented 6 years ago

@SergioBenitez - My rebuttal:

I think your view of test runners is different from what I envisioned and that's causing some confusion but firstly, I'd like to point out that the multiple test runner case works under the current implementation and looks roughly like so:

#![cfg_attr(quickcheck, test_runner(quickcheck::runner))]
#![cfg_attr(criterion, test_runner(criterion::runner))]

#[cfg(quickcheck)]
mod quickcheck_tests {
    #[quickcheck]
    fn identity(a: i32) -> bool { ... }
}

#[cfg(criterion)]
mod criterion_benches {
    #[criterion_bench]
    fn benchmark(c: &mut Criterion) { ... }
}

This misrepresents the relationship between a test and its runner though. In general, tests should know how to run themselves. In fact, this is how #[test] works today.

#[test]
fn foo(){}

expands to:

fn foo(){}

#[test_case]
const foo_gensym: TestDescAndFn = TestDescAndFn {
    desc: TestDesc {
        name: TestName::Static("foo"),
        ignore: false,
        should_panic: ShouldPanic::No
    },
    testfn: TestFn::StaticTestFn(|| {
         assert_test_result(foo())
    }),
}

Where testfn captures how to run the test. It enforces the interface that tests panic if they return a non-zero termination.

Test runners are concerned with structural information about the test. Anything needed for reporting, filtering, or dispatching. This is why quickcheck can reduce to a #[test] function with a simple () -> () interface (and transitively into a TestDescAndFn.

It's reasonable for a minimal test interface to look like:

trait Testable {
     // panic on failure
     fn run(&self);
     fn name(&self) -> String;
     fn is_bench(&self) -> bool;
}

Then your example:

#[quickcheck]
fn identity(a: i32) -> bool { ... }

#[criterion_bench]
fn benchmark(c: &mut Criterion) { ... }

Would expand to:

fn identity(a: i32) -> bool { ... }

#[test_case]
const identity_test: SimpleTest = SimpleTest {
   test_fn: || { quickcheck_run(identity) },
   test_name: "identity",
   is_bench: false
}

fn benchmark(c: &mut Criterion) { ... }

#[test_case]
const benchmark_test: CriterionTest = CriterionTest {
   bench_fn: benchmark,
   test_name: "benchmark"
}

And there would be respective impls for the rest runner:

impl Testable for SimpleTest {
   fn run(&self) {
      self.test_fn()
   }

  fn name(&self) -> String {
     self.name.to_string()
  }

  fn is_bench(&self) -> bool {
     self.is_bench
  }
}

impl Testable for CriterionTest {
  fn run(&self) {
    self.bench_fn(criterion::global_instance())
  }

  fn name(&self) -> String {
    self.name.to_string()
  }

  fn is_bench() -> bool {
    true
  }

Explictly: there should not be a 1:1 relationship between test runners and test declaration styles

This allows the test runner to manage the execution of tests that were not created with it in mind.

Under the proposed addition of #[test_case(some_runner)] then the test declaration-style author (for example, quickcheck) would make the call of the test runner.

Why is the user restricted to selecting a single test runner? It would be nice to compile one binary that executes test cases from multiple runners.

A test runner is essentially a main function. It doesn't make sense to have two for the same reasons that it doesn't make sense to have two main functions. Because a test runner defines the command line interface for a test binary, having multiple would result in contention on this interfaces.

For example, @CAD97's suggestion is actually what I implemented at first. The issue here is that if ::test::run_tests and ::quickcheck::run allow for filtering by test name. What happens when I run:

$ ./test_bin mod_a::mod_b::test_name

You'd need a more complex mediator that did something like:

$ ./test_bin quickcheck mod_a::mod_b::test_name

If we allow multiple test runner then this mediation would have to be provided by the compiler, which puts us back into the realm of being opinionated about how test binaries should work. Such a setup could be implemented in user-space with the current implementation.

SergioBenitez commented 6 years ago

In general, tests should know how to run themselves. In fact, this is how #[test] works today.

I'm not sure what you could mean by "tests should know how to run themselves". A #[test] clearly doesn't know how to run itself: it requires the testing library to parse things like should_panic, to execute the function, and to catch panics. A #[quickcheck] test also clearly doesn't know how to run itself: it requires quickcheck to run the function against a myriad of inputs and then prune to find a minimal failure, which again, are foreign to the test itself.

Explictly: there should not be a 1:1 relationship between test runners and test declaration styles

Sure, but I don't see how my proposal enjoins this.

Under the proposed addition of #[test_case(some_runner)] then the test declaration-style author (for example, quickcheck) would make the call of the test runner.

I think this is the desire in almost every case. I understand the principal behind this plug-and-play, swappable runner, trait-based testing, but I don't see it getting use in practice. To think about what such use would mean: one crate would propose a testing interface, and some other crate would say "Oh, sure, I can run that!". It won't happen serendipitously, since the interface will become part of the public API, so one crate is making a decision to allow other runners to run "its" tests. If that's the case, then that crate can simply take in a runner in its own attribute:

#[quickcheck(some::runner)]

A test runner is essentially a main function. It doesn't make sense to have two for the same reasons that it doesn't make sense to have two main functions.

This is only true because you're defining it that way.

Because a test runner defines the command line interface for a test binary, having multiple would result in contention on this interfaces.

I think this is an interesting design point: should a test runner be able to define the CLI for the binary? I can see arguments both ways. If not, then this is a non-issue, but if so, I wouldn't be opposed to passing test-runner specific arguments, as you suggest, with something like:

$ ./test_bin --quickcheck quickcheck args here --criterion some other args

This would mean that either the binary can't accept any -- args, or the top-level CLI parser is clever enough to group things appropriately.

Overall, I think your design optimizes for the wrong thing while making a common thing very difficult and tedious to accomplish. I don't see a compelling argument for swappable runners, and making it cumbersome to have tests for multiple runners in a single crate, a situation I postulate will be very common (something my code does now, for instance), is unfortunate.

jonhoo commented 6 years ago

I'd also like to push back a little bit here @djrenren: it's not clear to me that a "clean" division of "test runners are just main functions that deal with CLI + output" and "tests that know how to run themselves" is sufficient. It could be, but it'd be nice to see some evidence for it (similarly, @SergioBenitez, can you think of any examples where that separation is not sufficient?). There are definitely cases where a test runner should have more knowledge about the exact tests, such as "run at most this many tests in parallel within this test suite", or "this suite of tests share the following set-up, which should only be run once". It could be that these can all be dealt with using shared state among the generated tests + mutexes, but some discussion around it would be good :)

shepmaster commented 6 years ago

tests for multiple runners in a single crate, a situation I postulate will be very common

Could you share some other places where this is common?

For example, my experience has not included any language where a project has multiple test runners in a single "place". The closest I can think of is multi-language projects. For example, the Rust Playground has Rust unit tests, JS unit tests, and Ruby integration tests, each with their own test runners. Thus I postulate it is not common 😇

Even in a Rust-only project, I think I'd prefer to place benchmarks in a completely separate file (a.k.a. a separate crate) from unit tests. I never want to run tests and benchmarks together and the information reported for each is drastically different.

I understand the principal behind this plug-and-play, swappable runner, trait-based testing, but I don't see it getting use in practice

I agree that it seems unlikely that there will be One True Trait for defining a test. I'd expect there to be multiple test runners, each with their own highly-specific trait.

What I would expect is to see adapters for tools like proptest to connect to popular test runners, leveraging each runners strengths. Thus, the crate proptest-stainless would provide the macro support needed to define and connect a proptest-based test with the stainless test runner. Ideally, these macros would have high overlap with other adapters like proptest-builtin and proptest-some-new-cool-runner, but they would also allow tying into runner-specific functionality (e.g. the test output can show the number of runs or the failing inputs in a better way).

It's possible that over time some basic commonality can be discovered (perhaps something along the lines of TAP?), but there will always be something special a test runner will want to provide.

SergioBenitez commented 6 years ago

@jonhoo I tried to give a couple of example indicative of this in my previous comment, but perhaps we have different notions of what "separation" means in this case. Could you perhaps expand on the question?

@shepmaster In Rocket, I'd love to have the following in a single crate:

CLI Tests
HTTP Client/Server Integration Tests
Quickcheck Type Tests
Fuzzers, with a time limit
Regular #[test] tests

I'm totally okay with benchmarks being in another crate. The first four are currently regular #[test] tests, since custom test frameworks don't exist, but I can see a nice interface for all four given this new functionality. It'd be odd to split them up into four (or five, if #[test] uses #[test_case]) crates, and having to manually specify a cfg to run the particular tests means not only more code, but also that with high probability, users won't run them before they submit a PR.

I agree that it seems unlikely that there will be One True Trait for defining a test. I'd expect there to be multiple test runners, each with their own highly-specific trait.

Note that my proposal continues to make this possible, albeit by making it explicit in the code.

mathstuf commented 5 years ago

Apologies if this is the wrong place for this, but it seems to be the furthest along the road I've traveled in search of skipping tests by detecting support for what is being tested at runtime. Basically something like #[ignore(lacks_support)] where fn lacks_support() -> bool (or maybe -> Result<(), String> to give a reason support is missing) is a function within the relevant scope. My use case is that I'm testing features tightly tied to the kernel that is being run (this is for the keyutils crate) and I'd like to have the tests indicate that they are not actually giving useful information rather than just bailing out and saying "yep, totally works".

This was mentioned in https://internals.rust-lang.org/t/pre-rfc-make-test-and-bench-more-flexible/1328, but that discussion has died down.

So, is this something that is basically under this RFC impl (as something like an additional method on the trait for whether the test should be ignored or not) or is it something that should go into a new RFC? Thanks.

shepmaster commented 5 years ago

@mathstuf I think Is there a simple way to conditionally enable or ignore entire test suites in Rust? and Is it possible to conditionally enable an attribute like derive? cover your needs if it's statically determinable, like a feature flag.

You can run a build script to test your system and set those feature flags, as well.

mathstuf commented 5 years ago

Using a build.rs to determine it is probably sufficient for now (though the Rust bindings for performing these checks is what is being built…so that seems awkward…unless tests can have their own build.rs?), but I think other things also fit under this umbrella. This also has precedent in other test frameworks (e.g., pytest has support for skipping based on runtime checks).

mathstuf commented 5 years ago

Actually, build.rs would completely break for cross-compiling because what matters is the target machine's feature set, not the host machine's, so runtime-determined skipping still seems very relevant to me in the wider ecosystem.

e-oz commented 5 years ago

Thank you very much, now I can't compile my project using Rust 1.38.0-beta.1:

"use of unstable library feature 'test': bench is a part of custom test frameworks which are unstable"

And it's not even inside my code (library I use have this error), so I can't change it.

idubrov commented 5 years ago

In general, tests should know how to run themselves. In fact, this is how #[test] works today.

I'll give a counter-example. In https://crates.io/crates/datatest, each "test" is a template for tests which is "rendered" by the test runner, either based on scanning the filesystem for files matching the pattern or by scanning given YAML file for test cases.

So, there are no "tests" unless test runner renders them based on external data.

I could instead run all the test cases in a single "#[test]", but large part of datatest UX is that each individual test case is treated as a separate test (you can run it individually, for example, the same way you can run individual test with cargo test -- <test>).

It works well with nightly features #[test_case] and #[test_runner(datatest::runner)] (though, I have to explicitly support "regular" #[test]-tests). Stable is a different story... 😅

killercup commented 5 years ago

@idubrov, just discussed this in the dev tools meeting. As far as we could tell there was not much progress on this over the last months. We do want this to pick up speed again, though! If you'd like to help out from the rustc side too, let us know! It's always good to have actual use cases to work towards :)

@gnzlbg, it seems you worked on libtest/extracting libtest? Can you maybe give a quick summary of the state of this?

idubrov commented 5 years ago

I'm happy to help, but I'm a bit confused about the scope of the problem. I've read RFC, I have some basic understanding of other pieces, however, the big picture is something I'm still not clear about.

To me, the whole testing framework seems like a combination of the following aspects:

Test runners. Things like scheduling threads, capturing input and output, collecting statistics (counts of failed tests, and so on). Seems like requirements could very vary across different test runners:
1. Benchmark-like tests (criterion) would perhaps want to do all the scheduling themselves. I'm not very familiar with benchmarking, but I would expect they don't want multiple random threads unnecessarily as that might affect results?
2. Fuzzing tests could run in parallel, but they are essentially infinitely long, so they, perhaps, need their own scheduling, too?
3. "Datatest" tests want to 100% piggy-back on the standand test runner, though, as they are like regular tests, but "rendered" at runtime from test declaration pattern.
4. Not sure about "quickcheck", but maybe it falls in the same category as "datatest"?
API to tests execution. This would be cargo-based command-line API as well as IDE discovering tests (which, perhaps, should work via cargo anyways). Given the previous item, it seems like different tests should have different entry points. However, at the same time, if custom test kinds are allowed and, essentially, unlimited, there still should be a single entry point to the whole bag of test runners somehow? Otherwise, how would you run them? Create cargo-xyz-test for each test framework?

Another reason to allow everything in one crate would be so you don't have to compile same binary for each test runner. On the other hand, might be necessary for some types of tests (like, benchmarking tests requiring "release" build)?
Declaring tests. In my opinion, ideal scenario would be if you can mix and match tests of all various kinds in the same crate (in other words, you should not have to select only one single test runner for your crate).

My personal preference would be something like https://docs.rs/linkme/0.1.3/linkme/ or https://internals.rust-lang.org/t/idea-global-static-variables-extendable-at-compile-time/9879. I think, it would be a very nice generalization of what compiler currently achieves with #[test_case].

But then again, I don't know if that would work, for example, for fuzzing tests? They might want to do compilation their own way?
Rendering test "instances" based on their static declaration in code. This is to take test declarations from 3 and generate input for test runners in 1.

I make this very artificial separation between test "declaration" (what you see in code) and actual test instance (what you see in the output of cargo test -- --list). However, this separation would server "datatest" the best. Currently I override standard test runner and "render" tests before delegating to it, but in fact, I don't really want to intercept test runner. Even more, I would love to be able to support other test runners (say, criterion) by "expanding" to their test declarations from my test "patterns".

I guess, for "datatest" in particular, this might be generalized as simple as saying that these test declarations should be lists of functions that return test declarations rather than list of declarations themselves (and then the format of these declarations, like https://doc.rust-lang.org/test/struct.TestDescAndFn.html, should be stable public API).

djrenren commented 5 years ago

Hey @idubrov, I can answer a lot of these questions.

1. Test Runners

Every executable needs a single entrypoint, so when we build a test executable there needs to be some main and that main needs access to all the tests it's going to run. This inherently couples three problems: What is a test? How do I run it? How do I report it?

The test runner exists to answer both questions. For libtest a test is a TestDescAndFn, it runs them by invoking the wrapped function, and it reports it using its built-in reporting infrastructure.

This is okay but as you've noted it leaves out a lot of possibilities, for example, this is a very flat structure, if you want test suites with setup and teardown doing it in this model is awkward. Or if you want to test without enable panic unwinding you're out of luck (this is actually still a problem with the current test_runner setup but it should be fixable).

So the test runner has to answer these questions, but it can delegate them to the test through the use of some trait. Imagine it had trait like MyTestTrait with the method report(). Then this test runner becomes flexible.

2. Command line API

In order to allow CTF's to work in all environments some of the logic should be in rustc. You raise a valid point that building for test shouldn't imply dev.

Right now, rustc can accept a test_runner parameter the same way it receives any other crate attribute. By making rustc flexible enough, we can allow Cargo to orchestrate these options and shape convention. For this reason we left the cargo command-line interface mostly out of scope. But note, that test_runner can be used even in the Cargo integration test files.

3. Declaring tests

As stated above, there needs to a single entrypoint to the executable so "multiple test runners" essentially means "one super generic test runner" which you can build today. If there ends up being a right™ way for such a thing to work, we could consider stabilizing it. For now, the tools to build it exist.

The other option is compiling different executables for each test runner. This is also doable today using #[cfg(..)] so if you put #[cfg(quickcheck)] over each test that's meant to run with quickcheck and you put #![cfg_attr(quickcheck, test_runner(quickcheck::runner)] in your lib.rs then you're off to the races. The former can be hidden macros and the latter could eventually be orchestrated by Cargo profiles.

Solutions like linkme struggle from not being particularly cross-platform, whereas test_case works by manipulating the rust AST ensuring that its aggregation works everywhere rustc works. When I was designing the CTF approach, people were rightfully concerned about creating a truly generic aggregation technique and opening pandora's box to all sorts of weird shenanigans in the ecosystem. Also the affects of aggregation on incremental compilation were (and probably still are?) not well understood. #[test_case] works well because it could eventually be reimplemented in terms of such a primitive (should it ever come to exist) in much the same way #[test] is now built with #[test_case]

4. Test Manifests

All that cargo test -- --list does is build the test executables and pass in --test to them. This means you could build any similar behavior into a test runner today. Case and point: libtest has it. It could be implemented for a more generic interface than TestDescAndFn but ultimately that's all there is to it.

P.S. While I've covered a lot of it here, my user-centric writeup of the current implementation is here: https://blog.jrenner.net/rust/testing/2018/08/06/custom-test-framework-prop.html

idubrov commented 5 years ago

@djrenren how this proposal relates to the whole RFC? Is it one possible direction?

I still can't make proper connections in my head; your proposal seems to be at least partially implemented in nightly Rust (and custom test runner is what I use in my "datatest" on nightly).

However, RFC itself is very confusing.

djrenren commented 5 years ago

Ah yeah, the RFC is very wishy-washy (by nature of being an eRFC). It basically says something along the lines of "find use cases, make them work".

The test_runner approach is fully implemented in nightly rustc (though, there's some weirdness around no_std to work out). The current approach in nightly is one possible direction, but developed with a good deal of input from the tools team. From a design approach it represents a "maximally generic, minimal surface area" approach. In my opinion, it represents the best path forward, but at the very least, it provides a really powerful base on which to experiment with different APIs if you have different ideas.

The RFC really struggles from a lack scoping. For example, there will always be some kinds of testing we don't support, but the RFC doesn't draw this line. For my approach, I ended up ruling out doctests and compiletests, because these rely on concepts above the language the level. (Though a a variation of each could be expressed)

The purpose of the current implementation was to provide a grounding for a specific RFC (which I arguably should have written, but then grad school caught up with me). Through implementation we were able to learn a lot about what's possible, what's advisable, etc. that wasn't known when the eRFC was written. So that's the state of things.

Does that make sense?

mathstuf commented 5 years ago

Ah yeah, the RFC is very wishy-washy (by nature of being an eRFC). It basically says something along the lines of "find use cases, make them work".

Does my above comment count as something which counts as a "found use case" that can be made to work in an official RFC? Would it help if support for such a thing were added to your experimental branch?

e-oz commented 5 years ago

So you just don't care that you brought breaking change into language? Don't break your own promise, call it Rust 2.0 then.

djrenren commented 5 years ago

@mathstuf, for sure! Here's a sketch of how you'd solve your problem using the currently tools available in nightly:

Create a type to describe tests that allows them to report a reason for skipping.
Create a test macro that produces a #[test_case] const .... value that generates values of this type.
Create a test runner that accepts these and knows how to report reason for skipping.

Right now this is a little painful if you're migrating from existing libtest stuff. Ideally you'd just declare a new trait to describe your SkippableTest type and implement it for TestDescAndFn and then you'd have backwards compatibility with regular #[test] functions. Unfortunately, we can't do this because TestDescAndFn isn't stable. This is one of the reasons @gnzlbg's work on extracting libtest is so important. If libtest is just a regular crate you can add to your dependencies then you can rely TestDescAndFn from that crate which has far-weaker stability requirements than anything that ships with rustc.

shepmaster commented 5 years ago

@e-oz please keep discussion on-topic and civil. Nothing being discussed here involves breaking changes as far as I can see. If there are breaking changes, we appreciate them being pointed out as we do not wish to break backwards compatibility.

djrenren commented 5 years ago

@e-oz it sounds like your issue is the result of a library you consume taking advantage of unstable rust features, which is not allowed outside of nightly. This is a long-standing policy of the compiler to have nightly-only features to allow for experimentation with APIs before they are stabilized.

If #[bench] was working on stable or beta rust then it was a bug, and it got fixed. If this happened because you moved from nightly to beta then it is expected behavior. You can solve your problem by going back to nightly, though in general I don't recommend depending on unstable nightly features as they are subject sudden change.

idubrov commented 5 years ago

Does my above comment count as something which counts as a "found use case" that can be made to work in an official RFC? Would it help if support for such a thing were added to your experimental branch?

Yes, this is something that is already supported with nightly features / @djrenren proposal. The whole "datatest" thing I'm doing is basically generating test instances at runtime. Ignoring at runtime works, too (here I ignore by path, but as well I can ignore by anything else): https://github.com/commure/datatest/blob/114ea40c983ae108f6d7b70e6bb6fc5f3de56e0f/tests/tests/mod.rs#L74-L86

One thing, though, is that this requires overriding test runner. I wonder if lot of use-cases could be covered by allowing list of tests to be fn() -> &[TestDescriptor] instead of TestDescriptor. Or any other way (compatible with no_std) of allowing to enumerate test descriptions at runtime rather than providing them directly?

mathstuf commented 5 years ago

OK. If we have docs for doing that, that sounds fine to me. I really just want the test report to explicitly say "skipped because $reasons" rather than:

not showing up at all (compile-time decisions)
spurious passing (because nothing was actually tested)
spurious failure (really annoying)

Right now, my solution is to have a "sentinel" test whose output is basically "this module's tests are going to fail because the features aren't supported".

Thanks; I'll take a look at nightly when I get a chance to see what I can do there.

djrenren commented 5 years ago

Oh and just in case you aren't aware, if you can determine that you want to ignore the test at compile time, you can use the ignore attribute either with attr_if or your own macro to conditionally insert it.

On Aug 21, 2019, at 11:45, Ben Boeckel notifications@github.com wrote:

OK. If we have docs for doing that, that sounds fine to me. I really just want the test report to explicitly say "skipped because $reasons" rather than:

not showing up at all (compile-time decisions) spurious passing (because nothing was actually tested) spurious failure (really annoying) Right now, my solution is to have a "sentinel" test whose output is basically "this module's tests are going to fail because the features aren't supported".

Thanks; I'll take a look at nightly when I get a chance to see what I can do there.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

mathstuf commented 5 years ago

I'm aware; it's just that for cross-compiling, compile-time detection is kind of useless for detecting kernel features available :) .

rust-lang / rust