Open thomaseizinger opened 1 year ago
I'm not familiar with your setup for caching cargo doc
— does it work for only the current state of the package, or does it also cache the cargo doc
run for the baseline (crates.io) version as well? Based on the log, it seems like the actual checking of the rustdoc files finished in 0.044s, so the slow parts are building one or both of the rustdoc JSON files.
cargo install
step.Since the baseline version isn't changing all that often (especially not with every PR), it should be possible to cache the baseline rustdoc JSON file across runs. This is how the GitHub Action builds that file, and in principle it could be improved to cache the file as well. cargo-semver-checks
can be directed to use a pre-generated baseline rustdoc JSON file with the --baseline-rustdoc <file>
option.
After running cargo check --all-features
or cargo test --all-features
, building the rustdoc JSON for the current version should be quite fast since everything is already mostly checked and built. So running cargo-semver-checks
in the same runner where the tests already ran would probably make it much faster.
In general, I believe your cargo-semver-checks
setup is the most sophisticated one in existence (at least that I know of). If you're interested, I'd be thrilled to work with you on building all these capabilities into the cargo-semver-checks GitHub Action itself rather than each of us maintaining a separate workflow.
My workflow doesn't build it for the baseline version, no. I thought that one was downloaded from docs.rs or something?
Does it build the baseline version also locally?
I am happy to collaborate on an/the action. Fitting that one out with some pre-configured caching would be really nice actually :)
docs.rs doesn't yet host the JSON, so for now the baseline is built locally (based on the registry, a gitrev, or a filesystem path) unless the baseline JSON is explicitly provided via the --baseline-rustdoc
flag.
This is what the action does at the moment: https://github.com/obi1kenobi/cargo-semver-checks-action/blob/main/action.yml#L29
In a v2 of the action, we'd want to remove the version-selection logic from the action (since it incorrectly selects pre-releases, unlike the logic already present in cargo-semver-checks
) and make cargo-semver-checks
aware of the possibility of the baseline rustdoc JSON already being generated so it only rebuilds when necessary.
Perhaps something like a --baseline-cache <PATH>
option, where cargo-semver-checks
will move built baseline JSON files into that directory, or use existing files if they are there?
What do you think? Also, looping in @epage in case he has ideas.
make
cargo-semver-checks
aware of the possibility of the baseline rustdoc JSON already being generated so it only rebuilds when necessary.
I think this is the MVP. As long as it is somewhere in the target
directory, CI caching should pick that up. Configuration options to control the path can come later once someone asks for it!
I just saw that there is already target/semver-checks
but that directory is 21GB after a cargo clean
and running cargo semver-checks check-release --workspace
. We can't cache that :sweat_smile:
Perhaps something like a
--baseline-cache <PATH>
option, wherecargo-semver-checks
will move built baseline JSON files into that directory, or use existing files if they are there?
Just separating them from the current compilation output would already be good. That would allow us to clean that target directory of artifacts and cache it. These should never need to change so unless someone messes with the target
directory, I don't think there should be any bugs.
I just saw that there is already
target/semver-checks
but that directory is 21GB after acargo clean
and runningcargo semver-checks check-release --workspace
. We can't cache that 😅
I'm a bit disk-space-limited on the machine I have with me right now, so if you don't mind, I'd love to ask you to check something in that target/semver-checks
directory:
target/semver-checks/target
directory entirely.target/semver-checks
should be prefixed with registry
, right? Those are registry (crates.io) based baseline builds, which each contain both the usual compilation artifacts (debug
directory) and the rustdoc JSON doc
directory.doc
directories? That's the only part we absolutely need, and we can avoid caching everything else.The cache would be compressed with zstd in GitHub Actions, so it would probably be reduced to around one-third or one-fourth the raw JSON files' size.
Just separating them from the current compilation output would already be good. That would allow us to clean that target directory of artifacts and cache it.
We can put the cache somewhere like target/semver-checks/cache
. How does that sound?
- How big are all the JSON files from those
doc
directories? That's the only part we absolutely need, and we can avoid caching everything else.
> stat -c %s target/semver-checks/**/*.json | paste -sd+ | bc
110568535
That is a 110MB!
Just separating them from the current compilation output would already be good. That would allow us to clean that target directory of artifacts and cache it.
We can put the cache somewhere like
target/semver-checks/cache
. How does that sound?
Sounds good to me. I guess to make this really effective, cargo semver-checks
would have to move the JSON files into that cache
directory itself and clean the target directory afterwards otherwise we have a manual step again of cleaning those target directories.
That is a 110MB!
Whew, I'm glad it's that little :) It's also JSON so it should be highly compressible with zstd which GitHub Actions will use by default.
to make this really effective,
cargo semver-checks
would have to move the JSON files into thatcache
directory itself and clean the target directory afterwards
I think cargo-semver-checks should move the JSON files into the cache directory, but I think we should leave the cleaning to its GitHub Action I don't think that cargo-semver-checks itself should clean the target directory, since that might be undesirable for local use.
Let me know if you have any concerns!
That is a 110MB!
Whew, I'm glad it's that little :) It's also JSON so it should be highly compressible with zstd which GitHub Actions will use by default.
I'll test it locally!
to make this really effective,
cargo semver-checks
would have to move the JSON files into thatcache
directory itself and clean the target directory afterwardsI think cargo-semver-checks should move the JSON files into the cache directory, but I think we should leave the cleaning to its GitHub Action I don't think that cargo-semver-checks itself should clean the target directory, since that might be undesirable for local use.
Let me know if you have any concerns!
Hmm, I think that is inconsistent. We are creating those JSON files locally because we can't download from docs.rs.
The fact that we need to build them is an implementation detail IMO. Using as little disk space as possible is also desirable locally, especially when we are talking as much as 22GB for a single repository!
Alternatively, we can also say that the program is completely ignorant of where the data is stored and the action moves around the JSON files and passes them in as baseline etc
That is a 110MB!
Whew, I'm glad it's that little :) It's also JSON so it should be highly compressible with zstd which GitHub Actions will use by default.
I'll test it locally!
Default zstd
yields an archive 13 MB in size :)
to make this really effective,
cargo semver-checks
would have to move the JSON files into thatcache
directory itself and clean the target directory afterwardsI think cargo-semver-checks should move the JSON files into the cache directory, but I think we should leave the cleaning to its GitHub Action I don't think that cargo-semver-checks itself should clean the target directory, since that might be undesirable for local use. Let me know if you have any concerns!
Hmm, I think that is inconsistent. We are creating those JSON files locally because we can't download from docs.rs.
The fact that we need to build them is an implementation detail IMO. Using as little disk space as possible is also desirable locally, especially when we are talking as much as 22GB for a single repository!
Alternatively, we can also say that the program is completely ignorant of where the data is stored and the action moves around the JSON files and passes them in as baseline etc
I am happy to have a go at implementing this if you agree that it should happen within cargo semver-checks
. It currently contributes to about 20% of our CI time which I am keen to optimize :)
That'd be great! I think it's reasonable to handle caching here, and it's also fine to delete the baseline build data from the target dir if caching the output. I'd also be open to adding zstd compression for the cached baseline JSON file if you think that's worth it independently, though perhaps that should be a separate addition just to keep things simple.
How can I help? I'm happy to review draft PRs or whatever else you need.
I'd also be open to adding zstd compression for the cached baseline JSON file if you think that's worth it independently
I am leaning towards that being separate. Compared to other build output, the JSON files are small meaning there isn't much to be gained. The only environment that would benefit from that is local builds and I am not sure how frequent local use is compared to CI runs?
How can I help? I'm happy to review draft PRs or whatever else you need.
I'll open a draft PR once I have some questions and/or an initial design, thanks :)
Having seen the internals now, I have a question: Why are we building the docs for the current revision in a separate directory? Why not just reuse the default location? That would allow users to automatically benefit from caching. We already cache the debug build for our crates, having cargo semver-checks
use the same location for the debug build of its docs would avoid unnecesary rebuilds, wouldn't it?
It'd be something to try. Unsure how well it handles any overhead from the user doing different feature sets than us.
Here's a specific example scenario to illustrate the concern that @epage is pointing toward. In principle, cargo-semver-checks
may need to change the RUSTFLAGS
to resolve situations like this: https://github.com/obi1kenobi/cargo-semver-checks-action/issues/17
In that case, I believe that will both force a full rebuild and wipe the default location's cached data, which would probably make the user quite unhappy.
Not saying we couldn't work around this if needed. Just saying that applying some caution is justified :)
Right, that makes sense. I'll see if I can optimise that on my end by just adding another cache!
I spent today working on a new optimizations API in Trustfall, and while it still needs a bit of polish, I wanted to give you a sneak peek because it's relevant to the objective of making CI faster 😄
The numbers below are the result of implementing just one optimization using the new API. I've intentionally picked some gigantic crates where the perf impact is most obvious — smaller crates will benefit proportionally less, but still noticeably.
crate `aws-sdk-s3`
before:
PASS [ 93.454s] major auto_trait_impl_removed
after:
PASS [ 0.198s] major auto_trait_impl_removed
speedup: 472x
crate `aws-sdk-ec2`
before:
PASS [2421.900s] major auto_trait_impl_removed
after:
PASS [ 1.254s] major auto_trait_impl_removed
speedup: 1931x
Needless to say, I'm excited 🤩
Damn, that is amazing! :)
Looking at our current CI, I just had another idea. At the moment, we are creating and parsing two rustdocs, only to then say 0 checks necessary because we are already doing a minor bump which allows breaking changes.
We could reverse that and avoid parsing the rustdoc altogether?
The version numbers are read from the rustdoc files, which are the ultimate source of truth since it's what Rust and cargo saw and generated. So I don't think that would work, and while I'm sure it could be "wedged in," it doesn't sound like it would be worth the trouble.
I added one more optimization just to play with the new API, and I'm pretty happy with both the ergonomics and the outcomes: on the biggest crate I could find, the most expensive lint took ~1.3s instead of 4887s before (~3759x speedup).
Here's a full perf report. I used the aws-sdk-ec2
crate (~200MB rustdoc JSON for each of baseline and current). The baseline and current rustdocs each take 30-35s to generate on my computer. The queries are all unchanged — the only changes are in trustfall_core
(the new optimizations API draft) and in trustfall-rustdoc-adapter
(adding indexes and applying them to queries).
I need to clean up the code and add quite a few tests before this is ready for prime-time. But this already seems very encouraging, and I am sure that there's even more performance that we could squeeze out here if we chose to.
I did a slightly better job with the index construction, using HashMap
instead and making them point to the item itself rather than to its Id
(eliminating one more pointer hop), and this got down to a nice round number so I'm going leave it here.
The slowest lint in the suite is now taking just a hair over one second to run, and is a nice round 4500x faster than before.
When we start running the lints in parallel with rayon, on most hardware it will take approximately the same time to parse (not generate! just parse!) the rustdoc JSON with serde as to run all the lints.
I'm pretty happy with that.
More details on how all of the above works here: https://predr.ag/blog/speeding-up-rust-semver-checking-by-over-2000x/
Now just a small matter of finishing the implementation (and testing!) of the new API so we can hit these numbers outside of a test setup.
Describe your use case
We are using this in CI and it contributes about ~1-2min to the CI run of each crate despite:
cargo semver-checks
being installedcargo doc
being cachedSee https://github.com/thomaseizinger/rust-libp2p/actions/runs/3653859145/jobs/6173748557 for an example.
Is there a way for how this can be improved? When I run it locally, it is a lot faster (after the first invocation) so I am guessing there is a cache somewhere? Which files do I need to cache?
Describe the solution you'd like
cargo semver-checks
to finish within seconds, not minutes.Alternatives, if applicable
No response
Additional Context
No response