(Option to) Fingerprint by file contents instead of mtime

illicitonion commented 5 years ago

Describe the problem you are trying to solve The primary problem I have is that when building my code on travis, the actual code in my workspace builds every time, even though much of it hasn't changed and I have target directory caching on. The reason is that travis makes a new clone of my git repo, which doesn't preserve mtimes. This can add about 5 minutes to every travis run. My project is mixed rust and non-rust code, so this adds 5 minutes to those runs even if no rust code has been affected. I started futzing with mtimes, but that seems fragile and not solving the root of the problem.

Additionally, edit-undo loops cause re-compilation locally, which is a little annoying.

Describe the solution you'd like Add a new LocalFingerprint::ContentBased(Digest, PathBuf) variant to https://github.com/rust-lang/cargo/blob/b84e625c4401be3893a4d6d44d5dbac82b33f1c4/src/cargo/core/compiler/fingerprint.rs#L204-L209 which reads the content of the PathBuf, passes it through a SipHasher, and mixes that into any aggregate fingerprints. Use this instead of LocalFingerprint::MtimeBased.

Notes This will probably slow down no-op builds slightly (in some circumstances, such as with large build script inputs over NFS, significantly), so may want to be behind a flag (perhaps --fingerprint-strategy={mtime,content}).

This would probably also make more shared caching (that people talk about a lot, most recently at https://internals.rust-lang.org/t/idea-cargo-global-binary-cache/9002) easier.

I'd be happy to implement this if the PR is likely to be accepted :)

This would probably also fix

4425
3076
7775
5918
8868 ?
10175 ?
12060
13119

adam-azarchs commented 6 months ago

Most of our team has their home directory mounted over NFS; NFS can have great throughput but will generally have pretty terrible metadata latency. Regardless, even a naïve implementation of sha256 can easily handle 500 MiB/s on a modern system; the hashing time should still be negligible on any reasonably-sized codebase, especially since it's trivially parallelizable.

vlovich commented 6 months ago

It's a very odd setup to have a git repo running on a NFS share. Even still, you'd expect the source code to be sitting in the page cache which again still invalidates the whole "I/o latency dominates" argument because you're going to be doing a local memcpy+hash. The best choice is designing such that performance is optimized in cases where hash speed could be the bottleneck.

Speaking of hash algorithms, came across gxhash which is a Pareto frontier faster by a lot than xxh3 at smaller inputs and ahash for large inputs.

As for whether or not it's embarrassingly parallel, this kind of stuff can surprisingly be less so because there's coordination required to send and receive the result from a background thread. Given that most of the checks will stay the same not requiring a rebuild, the overhead of that distribution can easily be the bottleneck vs having a fast hash function running inline.

ClementTsang commented 5 months ago

@overlookmotel here's a tool that should have similar logic to the retimer tool, though it's a bit of a bandaid solution: https://crates.io/crates/mtime-travel

overlookmotel commented 5 months ago

@ClementTsang Thank you! Yes, a bandaid solution, but some kind of solution at least...

briansmith commented 4 months ago

We're not really too interested in cryptographic hashing here for its security properties, so I think picking any reasonable fast algorithm should be fine (and I think SipHasher is reasonably fast as well).

I do think it is worth doing a complete security analysis. Consider a file foo.rs that contains a security bug, which has hash(foo.rs) = X. Now somebody submits a patch for foo.rs that contains a "fix" for a bug, such that the hash of the patched foo.rs also has the value X. Then when one rebuilds the project to get the security "fix," they silently keep using the broken version. This could be made to happen easily if the security bug is intentional and the developer of the security bug plans for this.

I know such a think can seem far-fetched, but AFAICT this is why Bazel and other things that rely heavily on (distributed) caching for incremental rebuilds use strong hashes.

bjorn3 commented 4 months ago

A partially mitigating factor for that attack would be that the fingerprint for non-local dependencies (anything from a registry like crates.io or a git dependency) rebuilding doesn't happen because the fingerprint changes (in fact it isn't checked at all) Instead it is rebuilt because the package identity changes due to a different package version (and in case of git dependencies the git commit) is used, which by definition needs to be different to even pull in your changes. As such only for files in the project you are building can this attack be pulled off. I can imagine that we do still want to check the mtime for equality before checking the hash to improve performance, which would fully mitigate the attack as the attacker can't force the mtime to be identical between the original and the new file. It wouldn't be enough for a distributed cache like Bazel uses though.

adam-azarchs commented 4 months ago

I would like to emphasize once again that if you're using cargo vendor (as our team does, exclusively), cargo is already verifying sha256 sums against .cargo-checksum.json. It just then goes on to ignore that when deciding whether the file has changed.

michaelwoerister commented 4 months ago

For reference, sccache started using BLAKE3 for fingerprinting a while ago and I don't think that causes any performance problems. BLAKE3 should be fast enough for this kind of bulk-hashing scenario for the actual hashing to not factor into performance too much.

adam-azarchs commented 4 months ago

But, it's already doing sha256 hashing. Why would we add additional hashing on top of that?

bjorn3 commented 4 months ago

The .cargo-checksum.json check is done by cargo, while the file content fingerprinting would be done by rustc. Also cargo doesn't actually check if the source files of non-local dependencies (which includes vendored dependencies) are modified. As such there is no hashing that would be saved by rustc using sha256 too. For vendored dependencies when building the first time both rustc and cargo do their own independent hashing either way and when checking if rebuilding is necessary cargo will immediately consider the vendored dependency to not be modified and check neither .cargo-checksum.json nor the hash calculated by rustc. In other words it doesn't matter what hash is used by rustc. It is never used at all by cargo for vendored dependencies.

adam-azarchs commented 4 months ago

Cargo absolutely does verify the checksums of files against .cargo-checksum.json. And I'm pretty sure no one here was actually suggesting adding content hashing to rustc; the request is very specifically to have at least an option for cargo to not use absolute path and mtime as part of the fingerprint that it uses for determining whether rebuilding is required, and use a content hash instead. By the time rustc sees the file in question, it's too late - we've already started the rebuild.

bjorn3 commented 4 months ago

Cargo absolutely does verify the checksums of files against .cargo-checksum.json.

It does when building the first time. It does not check it when the crate has already been built. It will immediately consider it not needing any rebuild way before checking .cargo-checksum.json. I actually tried it myself.

And I'm pretty sure no one here was actually suggesting adding content hashing to rustc

The first attempt at implementing this did actually modify rustc to include the source hashes in the dep info file and then lets cargo hash it again on later builds to check if the file is changed as this is the only race free way of implementing it. I haven't seen any suggestion to do it differently.

adam-azarchs commented 4 months ago

It does when building the first time.

Well, yes, but the problem is that cargo's definition of "the first time" includes some situations that it really shouldn't, and then it still uses the mtime + absolute path as the fingerprint for resulting outputs, so it still ends up unnecessarily invalidating downstream caches.

The first attempt at implementing this did actually modify rustc to include the source hashes in the dep info file and then lets cargo hash it again on later builds to check if the file is changed as this is the only race free way of implementing it.

Ok, yes, that makes sense (in general; I'm mostly focused on the CI scenario where such races should be impossible). And rustc is already reading all of the file content, so it's relatively cheap to hash as it does so. But it would still be cargo doing the hashing on subsequent builds to check whether a rebuild was required.

I think the most common use case for this feature request is to be able to do useful caching of the target directory in CI builders, e.g. GitHub actions. It's essentially pointless to cache any build outputs from within the workspace because the mtime will be different every time the repo is checked out so the cache will never be valid. We have some workspaces with dozens of crates, and most of the "core" crates that everything else depends on rarely see any changes, so it's frustrating to see cargo rebuilding them every time.

bjorn3 commented 4 months ago

Well, yes, but the problem is that cargo's definition of "the first time" includes some situations that it really shouldn't, and then it still uses the mtime + absolute path as the fingerprint for resulting outputs, so it still ends up unnecessarily invalidating downstream caches.

Non-local dependencies (including vendored dependencies) are always assumed to be up to date if any compiled artifacts exist. It will never check the mtime, absolute path, .cargo-checksums.json, file hashes in the dep info file or anything else involving the actual source files. In fact the fingerprint for non-local files doesn't even contain the list of source files. (target/debug/.fingerprint/mycrate-hash/dep-lib-mycrate is empty for non-local dependencies) And if compiled artifacts don't exist, it will check .cargo-checksum.json before building, but will rebuild independently of whether source files were modified.

Xaeroxe commented 4 months ago

Hi, I've made a tracking issue for this to go along with my two PRs. One for cargo and one for rustc. https://github.com/rust-lang/cargo/issues/14136

rust-lang / cargo

(Option to) Fingerprint by file contents instead of mtime #6529

4425

3076

7775

5918

8868 ?

10175 ?

12060

13119