rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
96.82k stars 12.5k forks source link

Track which tests are run in CI #95995

Open jyn514 opened 2 years ago

jyn514 commented 2 years ago

Currently, it's pretty easy to accidentally miss tests in CI - e.g. I recently found that compiler doctests haven't been enforced in a while: https://github.com/rust-lang/rust/issues/95994. It would be great to track which tests are being run.

@mark-simulacrum did you have some ideas for this already in mind? How were you expecting the generated output to look / what sort of granularity did you want? I think just tracking Step runs should be pretty easy, but maybe not as useful as tracking cargo invocations? Having our own build system gives us some flexibility here.

Originally posted by @Mark-Simulacrum in https://github.com/rust-lang/rust/issues/95994#issuecomment-1097489474

jyn514 commented 2 years ago

@rustbot label +A-testsuite +A-rustbuild

Mark-Simulacrum commented 2 years ago

I think there's been several off hand comments but nothing super well documented. Likely https://github.com/rust-lang/rust/pull/93717 or something like it would be done to prepare files, though for this we'd want to integrate into compiletest and/or libtest to get per-test results.

Generally my thinking is that each test would get an identifier of some kind and then we'd just upload files to the CI bucket for "ran these tests". After that something like toolstate - though maybe running externally rather than on master CI in this repo - would do a diff and file issues/alert/something if there's a negative delta.

Shouldn't be too hard but a good bit of plumbing to do for sure.

A second aspect would be for the negative diff tool to have some asserts - e.g., I would like to know that tier 2 targets are actually getting built and distributed, tier 1 tested, etc.

A third aspect would be a dry run of some kind to give us a complete list of possible tests - ideally pretty "simplistic" (like find) to validate that we're running all tests at least once or so; we've had e.g. gdb version requirements that are never satisfied or the rustdoc missing invocations you note before. This is probably hard and hopefully negative diffs help catch most of these - at least in the regression sense - but they don't help as much when adding tests.

jyn514 commented 2 years ago

Ooh, https://github.com/rust-lang/rust/pull/93717 looks really cool :) I'll keep an eye on that.

Oh wow, per-test granularity! That would be awesome if it worked :D what would we compare it to? Is the idea to make sure that all CI runners are running the same tests? More tests are being added constantly so we couldn't use a golden reference or anything like that.

Mark-Simulacrum commented 2 years ago

(updated comment).

jyn514 commented 2 years ago

Generally my thinking is that each test would get an identifier of some kind and then we'd just upload files to the CI bucket for "ran these tests". After that something like toolstate - though maybe running externally rather than on master CI in this repo - would do a diff and file issues/alert/something if there's a negative delta.

Got it, makes sense :) I am worried though about what happens if tests are renamed - git rename detection is pretty imprecise, and it's not terribly rare to rename tests (e.g. https://github.com/rust-lang/rust/pull/91037).

Doctests are also likely to cause trouble, although we probably don't need to know each individual code block being run, just whether the test runner itself is built / run.

A third aspect would be a dry run of some kind to give us a complete list of possible tests - ideally pretty "simplistic" (like find) to validate that we're running all tests at least once or so; we've had e.g. gdb version requirements that are never satisfied or the rustdoc missing invocations you note before.

Just to clarify - is this meant to be "all tests that we could possibly support"? Or "all tests that x.py currently supports"? We can get a pretty coarse idea of the latter currently with the help paths (x test --help -v). Neither "all current tests" nor "all paths and crates" are perfect subsets of each other, unfortunately - e.g. things like html-checker and expand-yaml-anchors that use cargo run instead of cargo test will be pretty hard to detect automatically I think. Maybe those are less important than the stuff in src/test though?

An example of "possible tests not currently supported" are the codegen_xxx crates: https://github.com/rust-lang/rust/issues/95518

Mark-Simulacrum commented 2 years ago

My hope is that libtest integration (probably environment variable or so, and dumping JSON into a file) is enough to work for all of compiletest, rustdoc, and direct libtest usage - which would work out great for us in terms of ease of deploying it. (Those files would be consumed and repackaged by rustbuild to associate with target and other metadata, probably).

I'm not particularly worried about renames. If we get a false positive, it just means closing the issue - no big deal for an obvious PR. If the rate is high enough we can explore how to reduce it, but I suspect this is less common than you might think.

For the third aspect, I would say best effort. The more tests we pick up, the better, of course. Note that this is intended to be very granular - e.g., each ui test, not ui tests as a concept. (And maybe with added manual matrices of "compare-mode"...?)

My hope is that once the underlying infra is in place - even with the simplest of tests - it will be easy for contributors to help plug gaps, and going forward we can try to mandate PRs to plug gaps by linting against unknown tests that were run or so.

bjorn3 commented 2 years ago

Maybe we could have a fuse filesystem which checks that all files are accessed at least once and if not prints out a list of those that haven't? That wouldn't help with doc tests, but it may help with other kinds of tests and dead files in general.

jyn514 commented 2 years ago

@bjorn3 fuse sounds fun but it seems like strictly more work since we already have pretty intimate access to compiletest and libtest. Also I think FUSE only works on linux, and one of the goals here is to compare tests between different OSs.

jyn514 commented 1 year ago

@pietroalbini mentions that since https://github.com/rust-lang/rust/pull/111936 we track individual test names in build-metrics, which should make this a lot easier :)