Caching-friendly generation of "current" rustdoc JSON

obi1kenobi commented 1 year ago

Describe your use case

For workspaces that semver-check many crates, generating the "current" rustdoc JSON takes a significant portion of time: almost an hour across all of rust-libp2p, for example.

To ensure rustdoc JSON stability and correctness, "current" (also "baseline") rustdoc is generated by:

making a synthetic project with a path dependency on the target crate
generating the rustdoc for that path dependency

This synthetic project is generated at cargo-semver-checks runtime. Because of https://github.com/rust-lang/cargo/issues/6529, cached build artifacts for that project can't be used: cargo will ignore them since their mtime is older than the project. The hack of applying a synthetic mtime in the future to the build artifacts didn't seem to work, for unknown reasons: #399

It would appear that build artifact caching for the current rustdoc JSON is just not possible at the moment. It would be nice to make that a supported use case.

Describe the solution you'd like

A mechanism that allows caching of build artifacts of "current" rustdoc JSON builds
Use of that mechanism by default in the official cargo-semver-checks GitHub Action

Alternatives, if applicable

It may be possible to do this as a step-by-step process:

Generate the current rustdoc separately, e.g. via a separate subcommand or tool:
- Generate the synthetic project with the path dependency, e.g. via the tool.
- Then, say via a manual script, update the mtimes of the generated project.
- Then, either via the tool or via a manual script, generate the needed rustdoc from the synthetic project.
Then use this rustdoc in cargo-semver-checks via the --current-rustdoc <path> CLI option.

This is certainly not as elegant, but may be substantially faster for large repos.

Additional Context

No response

thomaseizinger commented 1 year ago

What is the rationale for the synthetic project?

obi1kenobi commented 1 year ago

It's due to issues like this one: #167

TL;DR: The user's copy of their project is not guaranteed to be in a state amenable to valid rustdoc generation. To get it there we need a fresh lockfile -- but we don't want to overwrite the user's own lockfile, so we make a synthetic project and lock there instead.

It's a really unfortunate but really serious problem, because it can trivially lead to all the flavors of semver-major false-positives one could imagine.

thomaseizinger commented 1 year ago

TL;DR: The user's copy of their project is not guaranteed to be in a state amenable to valid rustdoc generation. To get it there we need a fresh lockfile -- but we don't want to overwrite the user's own lockfile, so we make a synthetic project and lock there instead.

What if we backup the lockfile, run cargo semver-checks and then move it back after?

Like:

mv Cargo.lock Cargo.lock.bkp
cargo semver-checks ...
rm Cargo.lock; mv Cargo.lock.bkp Cargo.lock

thomaseizinger commented 1 year ago

The solution to https://github.com/obi1kenobi/cargo-semver-checks/issues/167 seems a bit excessive. In our workspace, this means recompiling the same dependencies for every crate because the individual synthetic crates don't share a target directory!

thomaseizinger commented 1 year ago

We could also state as a requirement that the repository needs to be versioned with git.

You could then check for uncommitted changes and fail in that case.

Assuming no committed changes, run cargo update and later git reset --hard.

obi1kenobi commented 1 year ago

I totally get the frustration. We definitely need to do better — this issue isn't the only one like that.

In our workspace, this means recompiling the same dependencies for every crate because the individual synthetic crates don't share a target directory!

This can be changed today: the build should still obey cargo configuration. If you specify CARGO_TARGET_DIR (or the equivalent config setting) then you can make all the crates share a target directory.

We could also state as a requirement that the repository needs to be versioned with git.

I'm worried that might get in the way of merging cargo-semver-checks into cargo. cargo publish does check for dirty repo state, but cargo check doesn't and shouldn't — and I think cargo-semver-checks should be closer to the latter: usable with uncommitted state and useful not just when publishing.

It's also a horrifically bad user experience if cargo-semver-checks crashes and leaves the project without a lockfile (and with a Cargo.lock.bkp).

Please let me know if you get a chance to try the shared target directory trick I suggested! I'm curious how much of a difference it might make.

Like I mentioned in the issue summary, I'm open to adding a "secondary" mechanism to generate the rustdoc that does a cargo update and generates the rustdoc in-place. I just don't think it should be the default, because it's more fiddly and easier to accidentally misuse, and we don't have anywhere near the bandwidth in the project to make that approach safe for general use.

thomaseizinger commented 1 year ago

This can be changed today: the build should still obey cargo configuration. If you specify CARGO_TARGET_DIR (or the equivalent config setting) then you can make all the crates share a target directory.

Oh that is an interesting approach. I'll see if copying the target dir makes a difference. It is all compiled already so we'd save that bit of compute by just reusing the existing cache.

thomaseizinger commented 1 year ago

This can be changed today: the build should still obey cargo configuration. If you specify CARGO_TARGET_DIR (or the equivalent config setting) then you can make all the crates share a target directory.

Oh that is an interesting approach. I'll see if copying the target dir makes a difference. It is all compiled already so we'd save that bit of compute by just reusing the existing cache.

Uh, what if I point it to the existing target dir? :exploding_head:

thomaseizinger commented 1 year ago

This can be changed today: the build should still obey cargo configuration. If you specify CARGO_TARGET_DIR (or the equivalent config setting) then you can make all the crates share a target directory.

Oh that is an interesting approach. I'll see if copying the target dir makes a difference. It is all compiled already so we'd save that bit of compute by just reusing the existing cache.

Uh, what if I point it to the existing target dir? exploding_head

I guess that kind of circumvents the fix for #167 :joy:

thomaseizinger commented 1 year ago

Actually, no, because it still uses a different lockfile.

thomaseizinger commented 1 year ago

In our workspace, this means recompiling the same dependencies for every crate because the individual synthetic crates don't share a target directory!

This can be changed today: the build should still obey cargo configuration. If you specify CARGO_TARGET_DIR (or the equivalent config setting) then you can make all the crates share a target directory.

In my local testing, this is not the case. Environment variables are not passed through to the inner cargo invocation.

thomaseizinger commented 1 year ago

I think it doesn't work because we are using the commandline option to specify the target directory: https://github.com/obi1kenobi/cargo-semver-checks/blob/97f8a9bdc671040542e150bfb602409492356f7b/src/rustdoc_cmd.rs#L78-L79

thomaseizinger commented 1 year ago

I think it doesn't work because we are using the commandline option to specify the target directory:

https://github.com/obi1kenobi/cargo-semver-checks/blob/97f8a9bdc671040542e150bfb602409492356f7b/src/rustdoc_cmd.rs#L78-L79

If I remove these two, I can set the target directory via the env variable but then it doesn't find the JSON file because it is probably expecting it somewhere else. My question would be however, whether we can't just always use the original target directory? As long as we have a separate lockfile, #167 should not be a problem?

thomaseizinger commented 1 year ago

It is a bit of a hack but with this patch, I can successfully reuse most of the build artifacts from my usual build:

Index: src/rustdoc_cmd.rs
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/rustdoc_cmd.rs b/src/rustdoc_cmd.rs
--- a/src/rustdoc_cmd.rs    (revision 97f8a9bdc671040542e150bfb602409492356f7b)
+++ b/src/rustdoc_cmd.rs    (date 1677229148044)
@@ -48,14 +48,7 @@
             .manifest_path(manifest_path)
             .no_deps()
             .exec()?;
-        let manifest_target_directory = metadata
-            .target_directory
-            .as_path()
-            .as_std_path()
-            // HACK: Avoid potential errors when mixing toolchains
-            .join(crate::util::SCOPE)
-            .join("target");
-        let target_dir = manifest_target_directory.as_path();
+        let target_dir = metadata.workspace_root.ancestors().nth(2).unwrap().as_std_path();

         let stderr = if self.silence {
             std::process::Stdio::piped()

thomaseizinger commented 1 year ago

With:

a complete debug build of the workspace
the above patch
a cached registry JSON doc

A full run of cargo semver-checks check-release --workspace takes just over 3 minutes!

This is roughly equivalent to what I'd have in CI as I'd be restoring a cache that has most things built already, plus a cache for the registry JSON.

thomaseizinger commented 1 year ago

A full run of cargo semver-checks check-release --workspace takes just over 3 minutes!

This also works if I clean out all the local- directories, i.e. the mtime of the generated project doesn't seem to matter, cargo will happily reuse the existing build artifacts.

obi1kenobi commented 1 year ago

Thanks for doing all these experiments! We'll definitely incorporate the findings into the V2 action that @mgr0dzicki is in the process of building.

Context

IIRC, there were two concerns with target dirs that led to making the separate target dir:

mixing Rust toolchain versions -- this no longer applies, since cargo-semver-checks no longer needs nightly
compiling both baseline and current, both with --all-features -- IIRC we weren't sure if the two compilations might overwrite something in the regular target dir. We didn't want to risk "cargo-semver-checks slows down my iteration loop" due to some unfortunate and unpredictable interaction between features or between the two builds (e.g. with proc macros or something), so we went for the safe route.

Short of doing a deep dive into how cargo build artifacts are generated, I feel that:

Caution-by-default is still preferable, for the benefit of users that use the tool locally. There are too many variables to test everything, and we don't want to adopt a policy where local users have to supply extra config / env vars / CLI options just to have a working local workflow.
The CI environment is much more controlled, and shared target dirs probably make sense there.

Proposal

How about adding a --target-dir CLI option, same as in cargo itself? Then local users get to keep the "careful" setup but CI can reuse the cached target dir.

In fact, CI perhaps wants to use save-if: false on the target dir caching, just in case the baseline wasn't cached (e.g. in the first build after a new release) so that its build doesn't poison the cache. Can you see if save-if: false still results in good cache hits on the current rustdoc generation?

obi1kenobi commented 1 year ago

A full run of cargo semver-checks check-release --workspace takes just over 3 minutes!

If you have it handy, I'd love to see the timing data for this. I'm particularly curious how much time is spent running the lints vs building rustdoc (even with cache hits), so I know roughly how much impact the upcoming Trustfall optimizations API (demoed in my blog post) might have here.

thomaseizinger commented 1 year ago

compiling both baseline and current, both with --all-features -- IIRC we weren't sure if the two compilations might overwrite something in the regular target dir. We didn't want to risk "cargo-semver-checks slows down my iteration loop" due to some unfortunate and unpredictable interaction between features or between the two builds (e.g. with proc macros or something), so we went for the safe route.

In my experience, cargo has become pretty good at separating build output artifacts and then reusing what it needs, even if the same crate is re-compiled with different features.

Even if it were a problem, I am not sure how common it is to run cargo build and cargo semver-checks in an alternating fashion to actually negatively affect the iteration loop. If anything, not reusing the target directory massively slows down iteration because I need to wait for a rebuild of my entire workspace to run cargo semver-checks.

I'd consider any unnecessary re-compilation a cargo bug that needs to be fixed upstream. We are just running cargo doc. That really shouldn't mess with the other compilation outputs.

Proposal

How about adding a --target-dir CLI option, same as in cargo itself? Then local users get to keep the "careful" setup but CI can reuse the cached target dir.

Unless the default is to use the same target dir, I don't think there is much value in another config option. On the contrary, I always dislike when you have to dig into the details of a tool's config options to make it perform well. The default should perform well and if it that causes problems, I'd consider those as bugs.

In fact, CI perhaps wants to use save-if: false on the target dir caching, just in case the baseline wasn't cached (e.g. in the first build after a new release) so that its build doesn't poison the cache. Can you see if save-if: false still results in good cache hits on the current rustdoc generation?

I've only done local testing so far because I'd have to build my fork in CI to use it there and the target dir generated by the current approach is so big for our workspace (> 22GB) that I can't cache it in CI.

thomaseizinger commented 1 year ago

A full run of cargo semver-checks check-release --workspace takes just over 3 minutes!

If you have it handy, I'd love to see the timing data for this. I'm particularly curious how much time is spent running the lints vs building rustdoc (even with cache hits), so I know roughly how much impact the upcoming Trustfall optimizations API (demoed in my blog post) might have here.

I'll see to look into that.

obi1kenobi commented 1 year ago

@epage I'd love your take on this

TL;DR: @thomaseizinger has found that not creating a separate target dir specific to semver-checking results in significant speedup for large workspaces (~hour -> few minutes), and is proposing making that the default behavior. A long time ago, we considered this and felt it was too risky at the time, but the underlying assumptions have changed in the meantime (see above) and perhaps it's time to re-evaluate.

thomaseizinger commented 1 year ago

A full run of cargo semver-checks check-release --workspace takes just over 3 minutes!

If you have it handy, I'd love to see the timing data for this. I'm particularly curious how much time is spent running the lints vs building rustdoc (even with cache hits), so I know roughly how much impact the upcoming Trustfall optimizations API (demoed in my blog post) might have here.

I read the blogpost but I am not sure I understood from it how to gather the timing data.

From memory, it took about 20s to "update index" (sparse registry will hopefully fix that soon!). All checks were skipped across the entire workspace because every crate had received a major version bump already when I ran that. So I think the time is mostly spent in creating files, parsing rustdoc and the like:

180 seconds minus ~20 makes around 160 net tool run time
160 / 33 crates = ~5 seconds per crate

5 seconds is not amazing but an okay time, considering we are launching a couple of processes in the background. If you wanted to analyze that in detail, I'd probably run it through flamegraph to see where it spends its time. tracing might also be a good idea to capture spans and see where we maybe wait for IO to complete or other non-CPU intensive things that don't show up in flamegraphs.

epage commented 1 year ago

@obi1kenobi can you be more specific for what you are wanting to point at in this thread as being resolved?

obi1kenobi commented 1 year ago

@obi1kenobi can you be more specific for what you are wanting to point at in this thread as being resolved?

This is the context + my proposal: https://github.com/obi1kenobi/cargo-semver-checks/issues/401#issuecomment-1444980671

@thomaseizinger in the subsequent reply makes the case that we should reuse the target directory by default, instead of merely supporting it as an option, and that we should trust cargo to not rebuild unnecessarily.

thomaseizinger commented 1 year ago

Friendly bump. Unless I am mistaken, with 0.20 shipped, having to rebuild the current crate on every run is now the dominating factor for the CI runtime of cargo semver-checks.

thomaseizinger commented 1 year ago

What is necessary to push this forward? With unrelated improvements to our CI, we are now spending about 50% of our time in cargo semver-checks of the current crate: https://github.com/libp2p/rust-libp2p/actions/runs/4851553760/jobs/8645517496?pr=3715

In wall-clock time that is only 1min30s but it would still be nice if we could reuse the target dir here and thus benefit from the caching that we already do.

obi1kenobi commented 1 year ago

https://github.com/rust-lang/cargo/issues/12103 suggests that there might be another reason why the cache isn't being hit even if we manually moved the target files there. Going to keep an eye on it, it should make a positive difference or at least make the workaround of just copying in the target files work better.

thomaseizinger commented 1 year ago

We've now migrated to the provided GitHub action and semver-checking takes about 15 minutes for us. See https://github.com/libp2p/rust-libp2p/actions/runs/6068173889/job/16460787944#step:4:210. We have some interoperability tests that also take similarly long but the rest of CI finishes in < 5 minutes. It would be great if cargo semver-checks would be among that.

At the moment, we are completely unnecessarily recompiling the entire workspace, both locally and in CI. This results in high disk usage and slow feedback loops because everything is compiled twice.

Can something like https://github.com/obi1kenobi/cargo-semver-checks/pull/402 please be considered?

obi1kenobi commented 1 year ago

Yes, definitely. I also found git-restore-mtime, a tool that can restore mtimes based on the last time a file was touched in a commit, which may produce additional speedup. I want to look into both and see how we can get the biggest wins at the lowest maintenance costs.

I'm going to be travelling for a few weeks for RustConf and a couple of other conferences, but I look forward to picking thhis back up when I get back.

obi1kenobi / cargo-semver-checks