rust-lang / cargo

The Rust package manager
https://doc.rust-lang.org/cargo
Apache License 2.0
12.59k stars 2.39k forks source link

(Option to) Fingerprint by file contents instead of mtime #6529

Open illicitonion opened 5 years ago

illicitonion commented 5 years ago

Describe the problem you are trying to solve The primary problem I have is that when building my code on travis, the actual code in my workspace builds every time, even though much of it hasn't changed and I have target directory caching on. The reason is that travis makes a new clone of my git repo, which doesn't preserve mtimes. This can add about 5 minutes to every travis run. My project is mixed rust and non-rust code, so this adds 5 minutes to those runs even if no rust code has been affected. I started futzing with mtimes, but that seems fragile and not solving the root of the problem.

Additionally, edit-undo loops cause re-compilation locally, which is a little annoying.

Describe the solution you'd like Add a new LocalFingerprint::ContentBased(Digest, PathBuf) variant to https://github.com/rust-lang/cargo/blob/b84e625c4401be3893a4d6d44d5dbac82b33f1c4/src/cargo/core/compiler/fingerprint.rs#L204-L209 which reads the content of the PathBuf, passes it through a SipHasher, and mixes that into any aggregate fingerprints. Use this instead of LocalFingerprint::MtimeBased.

Notes This will probably slow down no-op builds slightly (in some circumstances, such as with large build script inputs over NFS, significantly), so may want to be behind a flag (perhaps --fingerprint-strategy={mtime,content}).

This would probably also make more shared caching (that people talk about a lot, most recently at https://internals.rust-lang.org/t/idea-cargo-global-binary-cache/9002) easier.

I'd be happy to implement this if the PR is likely to be accepted :)

This would probably also fix

alexcrichton commented 5 years ago

Funnily enough this is also highly related to recent discussions on https://github.com/rust-lang/cargo/issues/2426 and https://github.com/rust-lang/cargo/pull/6484 for entirely different reasons (filesystems with second-level mtime granularity and modifications/builds all happen in the same second).

cc @Eh2406, @matklad, this might be a good tracking issue for this!

Eh2406 commented 5 years ago

At the moment I would start by removing the mtime based system and replacing it with a hash based one. (except for the corner cases where we have to use mtime, as we don't know which files to hash until after build. cc #5918) Then I would see how big a perf impact this has in reality. If it is small then grand, if not then we need to have a hybrid system that uses mtime to decide whether to hash.

I was thinking of trying to impl this soon, but would much prefer to help you do it @illicitonion. I seem to recall someone had an abandoned branch with a start... was it @ehuss?

Do we want to use SipHasher or a different hashing scheme that is designed for fingerprinting? I only ask because of this: "Although the SipHash algorithm is considered to be generally strong, it is not intended for cryptographic purposes. As such, all cryptographic uses of this implementation are strongly discouraged."

alexcrichton commented 5 years ago

We're not really too interested in cryptographic hashing here for its security properties, so I think picking any reasonable fast algorithm should be fine (and I think SipHasher is reasonably fast as well). I would personally prefer we not wholesale switch away from mtimes just yet, but having this as at least a runtime option to choose from is a great start!

dwijnand commented 5 years ago

Would be interesting to compare it to using FxHash.

ehuss commented 5 years ago

I'm not sure what I did with my old branch, so I'm just going off memory. My original attempt tried to only hash the contents if the mtime was near the start of compile time. However, since it is not known which files to hash until after compilation, it didn't work out too well. I don't know how to solve that problem. I guess one option is to pre-hash every file in the package directory and maybe hash anything that was included from outside after compilation? But that sounds like it could be very problematic. Or maybe hash protection is just not available for the very first compile (or in other words the second compile is not guaranteed to be correct). Maybe another option is to run --emit=dep-info first, which seems to be pretty fast, but there might be some races still.

I don't know of any easy answers here.

Eh2406 commented 5 years ago

I just came across a long blog post on the troubles with using mtime in build tools. It discusses using only hash, and points out that go switched to that, but recommends a hybrid approach. Witch is surprisingly similar to what we are doing with the fingerprint hash as a database.

fluffysquirrels commented 5 years ago

That blog post [1] is an excellent reference! It describes some of the problems in using mtimes and hashes in build systems and describes the approach used by the tool redo, roughly:

I found it helpful to consider the sequence number for an output as a logical time that we control, so it should avoid problems with low resolution, mtime going backward, builds being missed.

This wouldn't stop the false-positive builds we see on CI systems that don't preserve mtimes, but to avoid these we could check hashes of source files when their metadata changes. One of the problems claimed in [1] is that hashing large build output files (to confirm downstream should rebuild) is expensive, but we could benchmark this to examine the trade offs. Maybe we could get away with hashing source files only?

[1]: "mtime comparison considered harmful"

gerion0 commented 4 years ago

Please also have a look at chash. This is a mechanism for a meaningful fingerprint (based on the AST of the program). Maybe this can be adapted for cargo as well.

d33tah commented 4 years ago

I was just hit by this bug. Docker doesn't refresh timestamps of already existing files and I hacked around the need to cache dependency builds, so I did echo fn main() {} > src/main.rs. Pretty painful to debug and the touch I added to work around looks clunky.

Steps to reproduce

  1. Create the following files;

Dockerfile

FROM rust
COPY project/Cargo.toml project/Cargo.toml
RUN cd project && mkdir src && echo 'fn main() { println!("hello1"); }' > src/main.rs && cargo build --release
COPY project/src project/src
RUN cd project && cargo run --release

src/Cargo.toml


[package]
name = "hello"
version = "0.1.0"
authors = ["Jacek Wielemborek"]
edition = "2018"

[dependencies] flate2 = "*"

> project/src/main.rs
```rust
fn main() {
    println!("Hello2");
}
  1. Run docker build .

Expected:

Hello2 is printed

Actual result:

hello1 is printed

Workaround

Before building, do touch project/src/main.rs

matklad commented 4 years ago

I think this also has a chance of speeding up some dev workflows around switching branches. Specifically, I often check out branches temporary to look at the PRs (without building them). Sometimes I also run git switch master && git pull --rebase && git switch my-feature-branch && git rebase master. I think both of these workflows lead to busted mtimes for a bunch of files, which haven't actually changed between compilations. For sprawling workspaces (like rust-analyzer one), I hypothesize that this leads to some unneeded crate rebuilds.

gilescope commented 4 years ago

Has anyone got a PR or partial WIP PR on this yet? This would be huge to have the option to do this for docker caching which as rust builds get bigger becomes more important for enterprises. Would love to be able to test this out in nightly under a -Z flag.

Eh2406 commented 4 years ago

If there was progress it would have been linked here. As I recall I thought it could be done strate forwardly, but the fact that Eric's branch did not work suggest I was dunning-kruger myself. My memory of the approach was to change the code that reads the mtime to return a new type (mtime, Option<hash>). If the -Z flag is set fill in the hash. Thread that type all the way thru. When we get to the compare step, check the -Z flag for witch to use. If you are offering to try, that is where I would start.

gilescope commented 4 years ago

Was just checking there's no PR out there to build on. If the mtime's not the same we could check cheaply if the file size was the same before doing an expensive hash contents check - that could cut the perf costs down a bit.

slonopotamus commented 4 years ago

I believe this problem was already extensively studied in the past, so there's no need to reinvent the solution and just pick a good existing one. For example, git's index is known to store mtime+ctime+device+inode+hash.

gilescope commented 4 years ago

ctime, device and inode will thwart trying to do clever things like copying the target dir and putting it back but on a different machine/dir. When mtime check fails it's probably best to rely on intrinsic properties of the content (size/hash). For a non-cryptographic hash, redox's seahash seems pretty quick.

ibaryshnikov commented 3 years ago

mtime comparison considered harmful linked above is an excellent blog post! I've ran into one of the problems described there today:

If you accidentally set your system clock ahead by a day and build some stuff, then set your clock back to the present, all the stuff you built during that time will show up as "in the future" and therefore newer than source files you edit today, preventing all rebuilds

illicitonion commented 3 years ago

I finally got around to taking a look at this, and have put together a draft branch. It's a relatively simple change, and not particularly invasive. The top three commits each have timings of doing clean builds of cargo (which transitively contains 588 files) at that commit averaged over 150x each on a 13" MacBook Pro with the source code on an SSD. The tl;dr is that the extra time taken to digest files is in the noise. The top commit is optional, and probably not worth proposing. The second commit is the implementation of this feature.

A quick summary of the approach is: When parsing .d files, digest each file mentioned and include its digest in the generated dep-info file. When a fingerprint is invalid due to mtime, compare the current digest with that in the dep-info file and don't re-build if the digests are the same.

I don't think that branch is ready to propose as-is. There's one issue I think we should work out, a "how to roll out" question, and one corner I think we can leave unfinished but maybe folks have some ideas on how to approach:

The blocking issue

The current idea is that when cargo is parsing the .d files produced by rustc, for each file which is mentioned, cargo will read and digest that file, and store the digest in the dep-info file for the target along with the list of paths. But this causes a race condition - what if the file was changed on-disk after rustc consumed it? cargo currently avoids this race-condition by using the mtime at the start of the build, so that it may get spurious re-builds if the file was changed mid-build but won't miss changes.

I think the correct solution here is to teach rustc how to optionally emit a digest of the file it actually read. It already read the file anyway, so this adds only a trivial cost (my benchmarks showed that time spent reading the files dominated time spent hashing the files by orders of magnitude). There's a question as to how rustc should communicate this digest to cargo though. Some options that come to mind:

How to roll out

The cargo dep-info files currently have a hard-to-migrate schema (they store length+bytes raw bytes for each path with no tags or similar). I propose the simplest way forwards is to create schema version 2, have a flag to flip between them, and simply look at dep-2-lib-crate_name if the flag is set to 2 and dep-lib-crate_name if using the current behaviour. We won't bother with falling back to old schemas - you wouldn't be able to share cache hits across schema versions.

Alternatively we could start using a more sophisticated schema for this data file (e.g. some existing or novel approach which would allow for optional fields), but it's probably not worth that much thought.

The unfinished corner

Currently for build scripts, we don't digest files emitted as rerun-if-changed - similar to with rustc, there's a race condition where we can't be sure we're digesting the right file. Accordingly, build scripts have the same mtime-based rerun behaviour as they currently do. I don't think we can easily automatically solve this, but I think that fallback is also fine.

In the long-term, build scripts may evolve into something more declarative rather than being so imperative (where they could pre-declare what the build script would need to access, so we could properly track the files), or we could provide some mechanism for build scripts to tell cargo what digest should be used for a file which is rerun-if-changed (which needn't be a pure content digest - e.g. a build script could decide to strip comments out of files it reads before digesting them).

Eh2406 commented 3 years ago

Thank you for the clear explanation of the blocking issue with this approach! I think there was some work towards a "rustc in the loop" version at https://github.com/rust-lang/cargo/pull/8623, bushing that over the finish line would be wonderful!

gilescope commented 3 years ago

@illicitonion maybe we can join forces. I got a bit distracted but got quite far down this line. Have a look at https://github.com/rust-lang/rust/pull/75594 - I will have a look at yours too.

I got a bit bogged down on the cargo side dealing with the rerun if changed bits, but it was looking pretty promising and am very happy with the rustc changes.

illicitonion commented 3 years ago

Gosh, I'm not sure how I missed that #8623 came into existence - I'll do some catching up on it soon and get back to you, very happy to work together on things if we can!

gilescope commented 3 years ago

I just want it solved, I don't mind how we do it. One thing I did notice though is that to stop rebuild cacades we do need to get the rerun-if-changed stuff right. I don't think there's much we can do about old style rerun-if-changed, but the new style I think we can be pretty efficient with (and we can then PR blitz major crates that still use the old style - E.g. https://github.com/rust-lang/libc/pull/2138/files ).

adam-azarchs commented 2 years ago

A particular(ly infuriating) edge case is this for me is when using vendored dependencies, where cargo already has, and verifies, a JSON file with cryptographic (sha256) hashes of every file in the crate. And yet still rebuilds based on mtime.

cipriancraciun commented 2 years ago

Currently for build scripts, we don't digest files emitted as rerun-if-changed - similar to with rustc, [...]

I got to this issue due to the fact that I generate some css files from scss in the build.rs. Thus rerun-if-changed using fingerprinting for change detection is absolutely necessary for such a use-case. (I wouldn't mind just outputting something like cargo:rerun-if-changed-checksum=some-hash:7e0f5f3b24ce5f889eb2a27ed00dae49:some-path.)

BTW, since #8623 was just closed, I'm assuming there is no progress on this topic anymore?

gilescope commented 2 years ago

I’m tired. The ideas are fine but it really needs chopping up into a few easier to land PRs.

On Thu, 31 Mar 2022 at 11:39, Ciprian Dorin Craciun < @.***> wrote:

Currently for build scripts, we don't digest files emitted as rerun-if-changed - similar to with rustc, [...]

I got to this issue due to the fact that I generate some css files from scss in the build.rs. Thus rerun-if-changed using fingerprinting for change detection is absolutely necessary for such a use-case. (I wouldn't mind just outputting something like cargo:rerun-if-changed-checksum=some-hash:7e0f5f3b24ce5f889eb2a27ed00dae49:some-path .)

BTW, since #8623 https://github.com/rust-lang/cargo/pull/8623 was just closed, I'm assuming there is no progress on this topic anymore?

— Reply to this email directly, view it on GitHub https://github.com/rust-lang/cargo/issues/6529#issuecomment-1084401572, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGEJCBHUSPNSMA5764QSX3VCV6GPANCNFSM4GO5D5XQ . You are receiving this because you commented.Message ID: @.***>

Xanewok commented 2 years ago

@gilescope I could help doing some of that work.

gilescope commented 1 year ago

Feel free to give it a go. I think the first PR is in the rustc pr - having a cut down pr that just adds in the hash when creating the libraries. I don’t think landing that would need a cargo pr to land.

On Thu, 14 Jul 2022 at 15:31, Igor Matuszewski @.***> wrote:

@gilescope https://github.com/gilescope I could help doing some of that work.

— Reply to this email directly, view it on GitHub https://github.com/rust-lang/cargo/issues/6529#issuecomment-1184520947, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGEJCHMXSLE5RRPKSCVN4DVUAQFJANCNFSM4GO5D5XQ . You are receiving this because you were mentioned.Message ID: @.***>

strophy commented 1 year ago

I think I'm seeing this issue when trying to restore the cargo cache from S3 to a Docker GitHub Actions runner hosted in EC2. After restoring the cache mount to the buildkit cache context using this approach, cargo unpacks a lot of files into the $CARGO_HOME/registry/src directory. These files are newer by less than half a second relative to the cache age. Is there currently any way I can work around this? Example of the libz-sys dep getting marked old even though it isn't:

#21 1.604 [2023-05-22T11:43:00Z DEBUG cargo::core::compiler::fingerprint] new local fingerprints deps "/usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/libz-sys-1.1.9"
#21 1.604 [2023-05-22T11:43:00Z DEBUG cargo::core::compiler::fingerprint] max output mtime for "/usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/libz-sys-1.1.9" is "/platform/target/release/build/libz-sys-5ecc07734e87e07a/build_script_build-5ecc07734e87e07a" 1684755709.566429337s
#21 1.604 [2023-05-22T11:43:00Z DEBUG cargo::core::compiler::fingerprint] max dep mtime for "/usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/libz-sys-1.1.9" is "/platform/target/release/deps/libpkg_config-aff5b77a7472bb47.rlib" 1684755709.314425018s
#21 1.604 [2023-05-22T11:43:00Z DEBUG cargo::core::compiler::fingerprint] max dep mtime for "/usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/libz-sys-1.1.9" is "/platform/target/release/deps/libcc-afd26af49447c5d9.rlib" 1684755709.418426800s
#21 1.604 [2023-05-22T11:43:00Z DEBUG cargo::core::compiler::fingerprint] all paths up-to-date relative to "/platform/target/release/.fingerprint/libz-sys-5ecc07734e87e07a/dep-build-script-build-script-build" mtime=1684755708.994419534s
#21 1.604 [2023-05-22T11:43:00Z DEBUG cargo::core::compiler::fingerprint] filesystem up-to-date "/usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/libz-sys-1.1.9"
#21 1.604 [2023-05-22T11:43:00Z DEBUG cargo::core::compiler::fingerprint] max output mtime for "/usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/libz-sys-1.1.9" is "/platform/target/release/build/libz-sys-79794549bcb2869b/output" 1684755709.198423031s
#21 1.604 [2023-05-22T11:43:00Z DEBUG cargo::core::compiler::fingerprint] max dep mtime for "/usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/libz-sys-1.1.9" is "/platform/target/release/build/libz-sys-5ecc07734e87e07a/build_script_build-5ecc07734e87e07a" 1684755709.566429337s
#21 1.604 [2023-05-22T11:43:00Z INFO  cargo::core::compiler::fingerprint] dependency on `build_script_build` is newer than we are 1684755709.566429337s > 1684755709.198423031s "/usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/libz-sys-1.1.9"
scottlamb commented 1 year ago

My workaround is an external tool. It's currently in a private repo, but here's the tl;dr:

/// `retimer` saves modification times (mtimes) of source files, and restores them only if the
/// contents match a hash.
///
/// `retimer`'s intended workflow is as follows:
///
/// 1. restore `.retimer-state` and `target` from CI cache, if available
/// 2. `retimer restore`
/// 3. `cargo build`
/// 4. `retimer save`
/// 5. save `.retimer-state` and `target` in CI cache for the next run
///
/// This allows `cargo`'s mtime-based caching to work properly with CI/git.
adam-azarchs commented 1 year ago

My workaround is more aggressive but doesn't require storing a state file:

  1. set all mtimes in source directories to a specific date in the past
  2. set all mtimes for cached files in the target directory to another specific date in the past (1 day later than the first)

This does however require that one be careful about which files are included in the cache, as it can result in cargo not rebuilding things that it should.

strophy commented 1 year ago

@scottlamb I read your description of retimer here (and your detailed comments on #2644) with great interest, I am currently struggling with mtimes while restoring cargo cache from S3 into a docker cache mount. Are you planning on publishing this tool, or would you be able to share more details/code so I could try and work it into a GitHub Action I'm working on for this purpose?

obi1kenobi commented 1 year ago

git-restore-mtime might also be of interest and possibly address part of this need: https://github.com/MestreLion/git-tools/blob/main/git-restore-mtime

From its docs:

Change the modification time (mtime) of files in work tree, based on the
date of the most recent commit that modified the file, including renames.
scottlamb commented 1 year ago

@strophy Hmm, it's a work done for my employer, so I should do a quick check, but I can probably publish it.

@obi1kenobi I wouldn't recommend git-restore-mtime. Consider building branch A with a cache from branch B. If branch A's max timestamp for a given package is before the cache time, cargo incorrectly considers the build result up-to-date. The build result will be incorrect.

obi1kenobi commented 1 year ago

At least within GitHub Actions, I don't think that can happen: GitHub Actions' caches are organized so that all branches see the cache of the default branch (e.g. main) plus their own branch only, and no other branches' caches. I also believe which main branch's caches are visible also involves a merge-base check too. So in order to trigger that bug:

While that could happen, it doesn't seem that likely unless done in adversarial fashion.

In any case @scottlamb, I'd love to use retimer as well! I love using tools built by people that genuinely care about bugs and edge cases, which is also why I love writing Rust 😁

strophy commented 11 months ago

@obi1kenobi I didn't know GitHub Actions restricted visibility of cache like that, but it makes good sense. Unfortunately we run builds in docker containers on EC2, and use S3 buckets as cache for performance reasons, so this visibility restriction doesn't exist. I set up a trivially small repo which also runs git-restore-mtime in CI and can easily reproduce the issue with a simple println!("output"); statement. If I push to branch B with "new output", then push to branch A with "old output", then run CI from branch B again, I get an incorrect build result with "old output".

@scottlamb I'm inexperienced with Rust (I'm in devops) so your comment above with the branch A/B example now makes sense to me after working through it as an example. I think my only way forward is to try to re-implement your retimer concept, but I thought I'd check one more time if you are able to share the tool with us here to tide the community over until cargo supports caching by file hash? If not, do you have any further pointers, is it implemented in Rust or shell script, etc?

alexpyattaev commented 7 months ago

Just to calm everyone concerned about performance - Scons uses content hash to determine if files need to be rebuilt, and it seems to work quite well for them. Modern NVMe disks are crazy fast, and rustc is not exactly fast. So wasting a bit of IO bandwidth should not matter on machines that build large rust codebases.

cipriancraciun commented 7 months ago

Just to calm everyone concerned about performance [...] modern NVMe disks are crazy fast [...]

Not to mention that on most operating systems, the source code most likely lives in the kernel file-system cache, thus the I/O doesn't even reach the actual disk (be it NVMe or rotational).

(For example I use watchexec and rsync with checksumming to copy the source code to a remote server where the actual build is being done, and I don't see any slowdown due to the actual checksumming.)

bjorn3 commented 7 months ago

On Windows reading a lot of small files is really slow even when they are cached in memory.

alexpyattaev commented 7 months ago

Do you have numbers to back this up? I have no idea how fast/slow windows NTFS implementation is, but here are the numbers I get when running blake2 on axum codebase (25K LOC):

HOSTNAME /w/t/axum (main)> pwd
/windows/tmp/axum
HOSTNAME /w/t/axum (main)> mount | grep windows
/dev/nvme0n1p3 on /windows type fuseblk (rw,nosuid,nodev,noatime,user_id=0,group_id=0,default_permissions,allow_other,blksize=4096)
HOSTNAME /w/t/axum (main)> time find -type f -name '*.rs' -exec b2sum '{}' \; > sums.txt

________________________________________________________
Executed in  450.93 millis    fish           external
   usr time   81.04 millis    0.20 millis   80.84 millis
   sys time  258.97 millis    1.19 millis  257.78 millis

TL;DR is 260 ms for a huge crate like axum is literally rounding error, and this is done in one thread. Granted, if you are using a hard disk, it would be a bit slower, but I do not have a system that old.

bjorn3 commented 7 months ago

Isn't the correct amount for hashing 450ms? 260ms is the time inside the kernel. Also note that the cost scales with the total size of the cargo workspace + all path dependencies, so even if the crate that needs to be rebuilt gets rebuilt in 50ms, it may still take several seconds to hash all source files in the entire workspace to determine what should be rebuilt.

An option would be to do the mtime check, and only compare hashes if the mtime check indicates a change. This avoids false positives due to rounding of the timestamps by the fs, while still being pretty cheap in the common case where there would be no false positives in the mtime check.

alexpyattaev commented 7 months ago

Correct, my bad. Point was that even NTFS is not terrible enough for content hashing to be infeasible (at least for .rs files), even on a relatively huge crate that one should never create. Axum is actually made up of 3 smaller crates, and some of the code in that test was in examples. Checking mtime may be a nice way to avoid hashing files unnecessarily, but in this case mtime needs to be checked for change, not for greater-then. For more details I can recommend reading this part of Scons manual https://scons.org/doc/production/HTML/scons-user/ch06.html#id1309

adam-azarchs commented 7 months ago

I suspect (though haven't actually tested) that much of the slowness that NTFS experiences with accessing many small files comes down to metadata access, meaning that the correct question is less "how long does it take to hash these files?" and more "how long does it take to hash them compared to simply checking their timestamps?"

polarathene commented 6 months ago

Just chiming in that this Nov 2019 Dockerfile reproduction that was shared early in the issue is still the case today.

I wanted to understand it a bit better in regards to mtime and found the following:

FROM rust:alpine
WORKDIR /project
COPY Cargo.toml Cargo.toml

RUN <<EOF
  # Get GNU `touch` for supporting relative date input:
  apk add coreutils

  mkdir src
  echo 'fn main() { println!("hello1"); }' > src/main.rs
  touch -d '1 hour ago' src/main.rs
  cargo run --release > /tmp/log.txt
EOF

# Copy over the `src/main.rs` host copy (Hello2)
# 
COPY src/ src/
RUN <<EOF
  # Setting an mtime newer than the previous src/main.rs file is not sufficient,
  # The mtime must be newer than the mtime cached in target
  # dir when a build was previously invoked (that updated the mtime):
  # It would work if there was no offset to mtime. This is just to demonstrate
  # that the file mtime alone is not the only context involved.
  touch -d '30 seconds ago' src/main.rs
  cargo run --release >> /tmp/log.txt

  # Find the hello crate `dep-bin-hello` file,
  # then adjust it's mtime to be older than src/main.rs
  apk add fd
  fd --hidden dep-bin target/ --exec touch -d '60 seconds ago'
  cargo run --release >> /tmp/log.txt
EOF
CMD ["cat", "/tmp/log.txt"]
$ docker build --tag build-bug .

$ docker run --rm build-bug
hello1
hello1
Hello2

The original comment implied making the mtime newer than that src/main.rs created earlier during the image build would fix it (since the mtime would always be older from the host src/main.rs file when the image src/main.rs invalidates it's RUN instruction). However it's the mtime of the dep-bin-hello fingerprint file target/release/.fingerprint/hello-f86c91e176f7e778/dep-bin-hello (hash in path varies).

Similar can be done for dep-lib-*. There is also invoked.timestamp but I don't see it having any impact?


Regarding the actual issue, I've cloned a git project again and copied it's files over with the newer mtime, but it doesn't seem to make much difference with external deps, while internal will compile again due to the above logic with fingerprint dep-* timestamps being older.


So just to confirm:

The retimer approach improves on git-restore-mtime by saving an extra metadata file to cache which uses a content hash with mtime to restore, and presumably some other logic to avoid that target cache mtime mishap?

adam-azarchs commented 6 months ago

There are other use cases where restoring the mtime is impractical. mtime is also famously unreliable. Hashing, on the other hand, is guaranteed[^1] to tell you correctly whether or not the content has changed.

A compromise is also feasible: to mitigate the performance cost of hashing, one can record the mtimes (at whatever granularity seems practical) when hashes were last computed, and only go to the trouble to recalculate hashes if the mtime changes. This is more or less what git is doing under the hood to avoid needing to check the content of every file in the worktree all the time. But, still use the actual file hashes to determine if a rebuild is necessary, so a simple touch will never cause a rebuild. So you aren't paying a huge cost for hashing except on the first build in a new clone, but you still get stuff rebuilt when (and only when) needed.

[^1]: Yes, hash collisions are technically possible. But if you think that matters then you should probably not trust git, which relies heavily on sha1 checksums for representing file contents.

vlovich commented 4 months ago

I think I'm hitting this quite a bit - as part of interactive rebases fixing up my history, I frequently get compilation errors because cargo is reusing an older crate dependency even though that crate has had modified files. I'm kind of surprised though that mtime would be causing that as the file size should be changing too - is Cargo not even using the file size as a prefilter for mtime logic?

As for hashing, I would recommend XXH3 - the 64-bit version is very good and the 128-bit version is basically the same speed but lets you get even stronger protection if you give up back compat with 64-bit hashes. It's ~20x faster than SipHash.

adam-azarchs commented 4 months ago

The time it takes to hash files is usually going to be dominated by I/O. Faster hash algorithms are fun to talk about but ultimately irrelevant; I would probably go with sha1 simply because it leaves the option open to take shortcuts when using a git repository, since git already knows the hashes of the files. However, in the case of vendored crates, I really want to emphasize that it's already creating a .cargo-checksum.json file with sha256 checksums, and verifying it on build. So, we're already paying the cost of a sha256 hash, but it'll still recompile if the mtime changes.

overlookmotel commented 4 months ago

@scottlamb Can I ask, did you ever publicly release "retimer"? Or would you be able/willing to do so?

bjorn3 commented 4 months ago

I would probably go with sha1 simply because it leaves the option open to take shortcuts when using a git repository, since git already knows the hashes of the files.

I believe git adds a header to blobs before hashing them. While sha1 allows adding trailing bytes with minimal extra work (enabling length extension attacks), thus is not the case for leading bytes. Adding leading bytes requires recomputing from scratch.

vlovich commented 4 months ago

I like the idea of using git, but it only knows the sha1 of tracked & unmodified files. You'd still have to detect dirtied / unmodified files & handle them correctly. It's probably wiser to just make sure that the algorithm is stored in the metadata per file so that it can be heterogeneous & swappable (e.g. using git's metadata for the majority of files & falling back to something else for untracked / unmodified).

As for the speed difference being negligible, I'd caution that codebases will often sit within the page cache, so assuming an I/O penalty for reads may be a faulty assumption. But as I said, you'd want to only bother computing a hash when the file size is the same as it was previously.

The best approach is to have a daemon that monitors for change events live so that you don't even need to bother hashing unless the daemon restarts. That's what Bazel and Buck do.

I do wish filesystems just implemented a 64-bit counter that incremented on any mutation of a file. That would solve all these constant problems in a super-cheap way without needing a complicated daemon & not have any performance or correctness footguns (just need to save & check a mapping of inode -> version number).

bjorn3 commented 4 months ago

I do wish filesystems just implemented a 64-bit counter that incremented on any mutation of a file.

On Linux there a counter for every time a file is recreated which in practice is what most editors do to atomically write to a file. The combination of the inode id and i_generation is supposed to uniquely identify a single file for the entire lifetime of the filesystem even across deletes and recreates. Not every filesystem implements it though. It also doesn't get affected by edits that are not recreating the file.

alexpyattaev commented 4 months ago

On the other note, if you look at well-optimized IO logic, such as that found in 1BRC submissions you can quickly see that any plausible amount of source code can be loaded into RAM in seconds, if that is what is needed. Modern machines with NVMe are not IO bound in any meaningful way, and very few people would build a huge rust codebase on a machine from the 90's.