pre-release testing strategy

Mark-Simulacrum commented 2 years ago

@joshtriplett suggested in the 1.55 thread:

Also, I think we may want to change the recommended test procedure for pre-releases. Rather than having people do an early upgrade of their stable toolchain, I think we should have people do rustup toolchain add 1.55.0 and then test with cargo +1.55.0 or similar. That way, they still have their normal stable toolchain until the actual stable release, and if they forget to update again right away, they don't end up continuing to run the pre-release for a while.

Wanted to open this issue so we're not trying to discuss this on internals threads on every release, when the issue is somewhat decoupled in practice.

Mark-Simulacrum commented 2 years ago

The main worry I have is that if we're not modifying the default channel, then I think we lose out on testing of e.g. clippy, etc. that will happen over the course of the few days (tuesday/wednesday) in between having dev-static out and shipping the release itself. I think it's unlikely people will switch to using a versioned release in editor configurations, etc.

That could be mitigated with a rustup default 1.55.0, but I think that would be something folks are somewhat unwilling to do and moves them away from rustup update as an upgrade path (needing a default change back to the regular train).

That way, they still have their normal stable toolchain until the actual stable release, and if they forget to update again right away, they don't end up continuing to run the pre-release for a while.

I am curious on this aspect -- in theory, the dev-static artifacts are mostly equivalent to the real stable release. So I'm not sure that it's actually that bad to be using them. Perhaps I am weighing this trade-off of daily usage against probability of finding problems differently from you, though?

ckaran commented 2 years ago

I agree with the need for testing, but I also agree that something like rustup default 1.55.0 is a bad idea for the reasons @Mark-Simulacrum outlined above.

How about setting an environment variable within a particular terminal that all tools respect? Once you close your terminal, you drop that environment variable which resets everything back to the normal default state. It would also let you run multiple terminals side by side, each with different toolchains running for comparison purposes.

joshtriplett commented 2 years ago

@ckaran Setting RUSTUP_TOOLCHAIN=1.xy would have that effect, and that seems like a reasonable recommendation for slightly more "sticky" testing.

Even without that, I do think people who already run clippy are likely to test the new release with cargo clippy in addition to cargo check and cargo build and cargo test; adding +1.xy to that doesn't seem onerous.

I'd prefer to see people intentionally testing, and knowing that was what they were doing, rather than accidentally testing without having "I'm trying a pre-release that I need to report bugs about but I can always revert if there's an issue" at the top of their mind. I'd prefer that even at the expense of less testing, because it seems important that people's use of "stable" match the latest stable we've actually released.

I agree that generally our pre-releases end up matching the final release in practice, but we ask for testing to catch last-minute bugs, which if found would be one case where they might not match.

Mark-Simulacrum commented 2 years ago

It might be helpful to take a step back and describe the goals for the pre-release builds. As I see them, given the time window, there's several goals:

Give folks an opportunity to make sure there tools are somewhat present and working (e.g., if we accidentally stop shipping a tier-1 target's clippy or something like that, or ship a broken one). This is largely covered by intentional testing as you suggest.
Give distros a possibility of early-kicking their own pre-release builds off, and potentially noting any problems with the index.
Make sure infrastructure is working on the artifacts we plan to release (e.g., if binaries got too large or some other awkward thing happened, fail with a few days leeway).

I am not sure either of these are sufficiently worthwhile for the test to make much sense, as it's currently placed. It may be better for us to just make a post, say, Monday after the release reminding users to update to latest beta release -- that may be better placed in terms of getting us testing, and removes the need for the release team to coordinate dev-static (which does not seem prominently useful, as the two goals above are ~largely covered by beta releases already, IMO).

Are there other goals we might have in mind for the dev-static release?

ckaran commented 2 years ago

Setting RUSTUP_TOOLCHAIN=1.xy would have that effect, and that seems like a reasonable recommendation for slightly more "sticky" testing.

Sounds like the problem is (mostly) solved then! If there is good enough documentation for this, then I think everyone will be happy.

joshtriplett commented 2 years ago

@Mark-Simulacrum Reminding people to install/update the beta channel and to test their code with it might well be more productive than doing a last-minute "please try the stable pre-release".

Have we, historically, gotten critical bug reports out of the pre-release testing that we haven't gotten from beta?

Mark-Simulacrum commented 2 years ago

Did a survey of the internals posts that I could quickly find:

"Potentially critical" breakage:

1.22
- Broken cross-compilation to 32-bit macOS cargo#4699
- target_feature broken with system LLVM rust#46181
1.41.1
- Not feedback on thread: incomplete fix for miscompilation, additional backport rust#69444
1.36
- HashMap changed implicit UnwindSafe impl rust#62301

Non-critical breakage:

1.17
- Breakage for self-bootstrap (i.e., building 1.x with 1.x). Not release-blocking, somewhat expected, generally fixed as a class of "bugs" as of the last ~year:
1.26
- fn main() -> impl Trait interaction rust#50595
1.38
- Upstream distros using system libgit2 for Cargo did not work right; official artifacts were fine
1.50
- Builds with out of tree LLVM with llvm-dwp fail
1.53
- Clippy ICE
- rustfmt stops working on modules in inline modules

No 'bad' feedback before the release, modulo non-per-release discussion and release note change suggestions:

1.7
1.8
1.9
1.10
1.11
1.12
1.13
1.14
1.15
1.16
1.18
1.19
1.20
1.21
1.23
1.24
1.25
1.26.1
1.27
1.27.1
1.28
1.29
1.29.2
1.30.1
1.31
1.32
1.33
1.34
1.35
1.37
1.39
1.40
1.41
1.42
1.43
1.43.1
1.45.1
1.46
1.47
1.48
1.49
1.51
1.52
1.54
1.55
1.56
1.57 (optimistically, perhaps :)

So we've had 3/58 releases with 'critical' feedback on the internals thread, and 8/58 releases have had some feedback given. I could not find threads for 1.0-1.6 (internals search is failing me), but optimistically am assuming all was largely OK.

Not sure what conclusions to draw yet, but it's likely also instructive to look at point releases, which are likely a superset of critical breakage / "missed" problems, primarily adding in rare security fixes.

Going by the blog to try and find all these point releases. If there's a fix that was previously "unknowable" (e.g., security point release), I'll call that out, as that means it is not really relevant for the discussion here (since prerelease artifact testing can't have found it). Issues are X/Y, where Y is the total number of issues "fixed", and X is the number of critical issues (i.e., not just incidental backports due to the ocurrence of the release), as quickly assessed by me in hindsight.

1.12.1 - 9/9 issues fixed (all due to move to MIR)
1.15.1 - 2/2 issues fixed (API signature, fPIC missing on 32-bit)
1.22.1 - 1/1 issue fixed (Cargo macOS breakage)
1.24.1 - 1/4 issues fixed (revert ffi-unwind = abort, linking fix for non-ASCII paths on Windows, couple smaller bits)
1.26.1 - 3/6 issues fixed (...)
1.26.2 - 1/1 issue fixed (match ergonomics unsoundness)
1.27.1 - 2/2 issue fixed (match ergonomics unsoundness, non-beta: rustdoc security)
1.27.2 - 1/1 issue fixed (match ergonomics unsoundness)
1.29.1 - 1/1 issue fixed (non-beta: str::repeat unsoundness)
1.29.2 - 1/1 issue fixed (LLVM miscompilation, missing rls artifacts for windows-gnu)
1.30.1 - 1/1 issue fixed (rustdoc ICE)
1.31.1 - 2/3 issues fixed (powerpc-unknown-netbsd build failure, 2 RLS bugs)
1.34.1 - 3/3 issues fixed (false positives in clippy lints)
1.34.2 - 1/1 issue fixed (non-beta-ish: Error::typeid unsoundness)
1.41.1 - 3/3 issues fixed ('static soundness holes, miscompiled bounds checks)
1.43.1 - 1/3 issues fixed (cpu feature detection, package listing, OpenSSL upgrade)
1.44.1 - 2/4 issues fixed (tool fixes)
1.45.1 - 1/4 issues fixed (1 unsoundness in const prop, 3 tool fixes)
1.45.2 - 2/2 issues fixed (only in prerelease: revert of 1.45.1 patch, track_caller on trait objects miscompiles)
1.52.1 - 1/1 issue fixed (disabling incremental)
1.56.1 - 1/1 issue fixed (non-beta: bidi lints)

Overall it seems like there's basically been one(!) case where a point release was issued for a fix that couldn't have been detected on betas, ignoring cases where we have fairly last minute backports -- and no beta is produced with those commits. My gut feeling is that those commits rarely introduce breakage, so I am not too worried, but it's worth calling out.

In general, this data I think shows that prerelease artifact testing is not particularly useful. We very rarely get reports of breakage on those threads; 2-3 cases at most in the last ~5 years. The point release data also suggests a significantly (~10x) higher rate of bugs which are not found by the prerelease testing we currently do, or at least, not fixed.

Of the 21 point releases since 1.0, it seems like the vast majority of issues could have been detected on beta releases -- and so, it is unclear that having dedicated artifacts is that instrumental. We currently have those artifacts on Tuesday morning if there are no problems, per the release process, as they're produced Monday evening (US). This gives us a roughly 48 maximum window to rebuild those artifacts, which is quite short, considering an issue needs to get found, noticed by someone on the team(s), a fix identified + developed, and finally artifacts kicked off. This involves typically at least 3-4 people, which is a lot to engage in such a short timespan.

It also creates additional overhead for the release team, currently adding ~2 public posts which need to be made (internals + inside Rust) that need to be done ASAP to maximize effectiveness, giving less slack for babysitting artifacts and such.

All that to say: I think we need to revisit the strategy of asking to test these artifacts at the last minute like this. Currently, it is the case that we typically do not have those artifacts until the last minute (we almost always have a final round of beta backports included in the stable artifact PR), so moving this testing earlier essentially amounts to an ask to test beta, IMO.

How we motivate users to test beta has never been all that clear, as it yields no real benefit for most users, beyond helping the Rust project avoid possible regressions; it's not clear whether a regular call to test will be impactful. It might be that the right solution lies closer to a push to get Rust users onto beta for all local development -- through inclusion of asks to switch to beta in release blog posts, maybe rustup suggestions after stable upgrades, maybe some other initiatives.

It's also suggestive that just dropping this additional step from our release process would likely not seriously harm us, in practice, and so may be advisable in that regard -- something to consider, at least.

Open to hearing thoughts, obviously lots to digest here, and I'll be thinking more on this as well.

ckaran commented 2 years ago

Is there a way of setting an environment variable that builds projects using all available toolchains? E.g., I have the nightly, beta, and stable channels on my machine, but I only ever build using stable unless I need something that is only available on nightly. Instead of selecting and switching between the various toolchains actively, what about an opt-in mode (RUST_ALL_TOOLCHAINS=true) that performs the build using each installed toolchain, separating the results into toolchain-specific directories (e.g., instead of target/debug, we get target/nightly-2021-06-03-x86_64-unknown-linux-gnu/debug). With the variable set, all the various commands might automatically rerun with the various toolchains each time they are called. E.g., cargo build automatically does cargo +stable build ; cargo +beta build ; cargo +nightly build. That will guarantee that every toolchain is exercised each time on every volunteer's machine.

As an added sanity measure, we could also print out a warning each time the tools are used when RUST_ALL_TOOLCHAINS is set. That way when end users start looking at their build times they can be reminded that they volunteered to do this, and that the solution is simply to delete RUST_ALL_TOOLCHAINS from their environment.

joshtriplett commented 2 years ago

@ckaran I don't know how easily we could support something like RUST_ALL_TOOLCHAINS; having cargo build do multiple separate builds would be possible but awkward, and it'd get even more problematic for something like cargo run.

However, the concept of "print a warning each time" is a helpful one. We could, without too much difficulty, make rustup support RUST_TOOLCHAIN=warn:1.xy, which would cause rustup to print a banner each time it dispatches to that version ("Warning: testing toolchain 1.xy because RUST_TOOLCHAIN=warn:1.xy"). That would help people not forget they were testing that.

If we wanted to move the call for testing earlier, that could just as easily become RUST_TOOLCHAIN="warn:beta".

ckaran commented 2 years ago

@joshtriplett OK, if RUST_ALL_TOOLCHAINS is problematic, then how about something like the following:

For each tool, create a test driver script (e.g cargo has the script test_cargo).
The test driver script knows how to find the tools without requiring the shell to resolve the path (not sure how to do this part).

When the test version is called, it calls the real tool, but with the appropriate arguments to make the output go to a reasonable directory. E.g., test_cargo <<other options>> might be:

#!/usr/bin/env bash
echo "Warning: Testing toolchain 'nightly' because you called the 'set_test' script in this terminal." 
    && cargo +nightly --target-dir target/nightly/ $@` #my bash is rusty, that might not be the right variable

A set_test script whose job is to set aliases for all the commands. E.g., it will set an alias in your shell so that cargo points to test_cargo.

If you want to test the new tools, call set_test, and all of your tools get the aliased definitions. That means that any scripts you've created will continue to use the new aliases, but only while within that terminal. The original tools don't need to know anything about the testing framework, and the aliases will disappear once you close the terminal.

Notes

I'm being very Linux/Bash specific here, and the idea would need to be made general enough that it supported all platforms that rust wants to support. I suspect that there are other, better ways of handling this than I've outlined, it's just a starting point.
I'm really, really bad at naming things. Someone else should probably be in charge of coming up with good names for the various scripts.

joshtriplett commented 2 years ago

@ckaran Are you just trying to prevent people from having to rebuild their code when they switch back to stable? Because they'll have to rebuild it anyway when they upgrade to the stable release.

ckaran commented 2 years ago

@joshtriplett No, I'm trying to ensure that they don't forget that they're testing and then wonder why their build times are slow, etc. By using a script that sets ephemeral shell variables, as soon as that shell is discarded, you're back to your good old default toolchain. I also think that having a script that you have to call within the shell each time you want to set the variables will help avoid the problems of setting the default toolchain, and then forgetting that you did that.

Maybe instead of all of the above, we need something like rustup default --ephemeral nightly, or something more clever than that.

conradludgate commented 2 years ago

The trouble with environmental variables is that IDEs will not always see them, and won't run the correct language server.

I'd maybe suggest using a marker file rust.prerelease perhaps. That way it's an obvious flag in the workspace about the nature of the rust toolchain being used, and works across shells and IDEs

lukexor commented 2 years ago

I don't interact all that much with pre-release, but has the obvious question been asked: how often people are forgetting they're on a pre-release toolchain? Is this a big problem or pain point for a lot of folks?

cuviper commented 2 years ago

On the original question of having users override their "stable" channel -- What does rustup actually look at when deciding if an update is needed? The manifests like channel-rust-stable.toml include the git_commit_hash and a hash of every downloadable file, so in theory it should be able to tell if the bits you downloaded from dev-static are different than the bits we finally release. In the common case where we don't actually change the build after that call for testing, then there's no observable difference in continuing to use the "pre-release" toolchain.

On the broader question whether this call for testing is useful, I don't know. I do often use that src tarball for a scratch build on Fedora infrastructure, but I don't remember ever having an issue to report back from that. Usually if anything, I just find some small build changes that I need to adapt in my rpm spec. That's nice for me to get ahead of release day, at least. :)

joshtriplett commented 2 years ago

Coming back to this: I'm increasingly finding myself agreeing with @Mark-Simulacrum's assessment that maybe we just don't need to ask people to test the release artifacts in advance.

(If we do, I continue to think we should suggest that people install it by version.)

CAD97 commented 1 year ago

An interesting midway alternative could be a prestable release branch that usually points to the same thing as stable but points to the prerelease artifact for the prerelease period.

But I do generally agree that testing against beta is sufficient to catch anything that would be caught in a prerelease. If it's not otherwise problematic, we could consider having beta point to the prerelease artifacts for the period between cutting the prerelease and cutting the fresh beta. (Having -Vv say "stable" on the beta toolchain for that period might be too weird, though.)

rust-lang / release-team

pre-release testing strategy #16

Notes