rust-lang / cargo

The Rust package manager
https://doc.rust-lang.org/cargo
Apache License 2.0
12.73k stars 2.42k forks source link

Non-reproducible -C metadata=hash passed to rustc depending on the compiling OS #8140

Open gendx opened 4 years ago

gendx commented 4 years ago

Problem

Tock is an operating system written in Rust for embedded platforms (e.g. thumbv7em-none-eabi target). We've been trying to make builds reproducible, with good progress (https://github.com/tock/tock/issues/1666). In particular:

With these we've managed to make the builds reproducible across various Linux machines.

However, the builds are different on Linux (nightly-2020-02-03-x86_64-unknown-linux-gnu) and MacOS (nightly-2020-02-03-x86_64-apple-darwin), as evidenced by CI on a GitHub workflow (https://github.com/google/OpenSK/pull/94#issuecomment-616671549).

In particular, the issue seems to stem from a different -C metadata= (and -C extra-filename=) passed to rustc (while the steps that I mentioned above make sure that the same metadata hash is passed across Linux machines for each crate in the project).

These are built with the --verbose parameter passed to cargo, and an example rustc invocation is the following (tock_cells crate of the Tock project).

Steps

  1. git clone https://github.com/tock/tock
  2. cd tock/boards/nordic/nrf52840dk
  3. V=1 make

This will invoke cargo with the following parameters: RUSTFLAGS="-C link-arg=-Tlayout.ld -C linker=rust-lld -C linker-flavor=ld.lld -C relocation-model=dynamic-no-pic -C link-arg=-zmax-page-size=512 -C link-arg=-icf=all --remap-path-prefix=/path/to/tock/= " cargo rustc --verbose --target=thumbv7em-none-eabi --package nrf52840dk --bin nrf52840dk --release -- -C link-arg=-L/path/to/tock/boards/nordic/nrf52840dk. One of the resulting rustc invocations is detailed above. The exact /path/to/tock/ depends on the machine, but this has no impact across the Linux machines we've tested.

More detailed steps are available in the GitHub workflow (https://github.com/google/OpenSK/pull/94/checks?check_run_id=602439620, https://github.com/google/OpenSK/pull/94/checks?check_run_id=602439631)

Possible Solution(s)

The various steps that we've taken have allowed to obtain the same -C metadata=hash parameter across Linux machines, at which points the builds are reproducible across these machines. Given that when running cargo in --verbose mode we observe different metadata hashes, the computation of such hashes is likely the problem.

Notes

See also https://github.com/rust-lang/rust/issues/71361 and https://github.com/google/OpenSK/pull/94

Output of cargo version (GitHub workflow machines with actions-rs/toolchain@v1):

gendx commented 4 years ago

This seems to come from some information about rustc's version being hashed in the metadata:

https://github.com/rust-lang/cargo/blob/3dcfdeff274389823256bb5aa828abc49d79c473/src/cargo/core/compiler/context/compilation_files.rs#L657

In particular:

https://github.com/rust-lang/cargo/blob/3dcfdeff274389823256bb5aa828abc49d79c473/src/cargo/core/compiler/context/compilation_files.rs#L672-L675

I also wonder why the host hashing is skipped in the following cases:

https://github.com/rust-lang/cargo/blob/3dcfdeff274389823256bb5aa828abc49d79c473/src/cargo/core/compiler/context/compilation_files.rs#L658-L664

cc @ehuss who added this hash_rustc_version function recently in #8073.

Given that my tests are with cargo 1.42.0-nightly (9d32b7b01 2020-01-26), it might be that this issue is fixed by the more recent #8073 - so I'll give a try to update to a more recent cargo as well.

ehuss commented 4 years ago

There's more discussion about this in #7873. It will likely be difficult to remove the host from the hash due to proc macros and build dependencies. One possible approach discussed in that PR is to keep the filenames different, but keep the symbols the same. I'm still uncertain if the filenames (for rlibs at least) leak into the resulting binaries.

gendx commented 4 years ago

If I understand correctly, the first concern is the following (https://github.com/rust-lang/cargo/pull/7873#issuecomment-583142337).

I'm a little concerned with how this might affect our longer term efforts to share artifacts between projects. If someone switches between different host compilers, it would overwrite the files, busting the cache. I don't really know how common that is, so it's hard to say if that's a realistic concern.

And the second one (https://github.com/rust-lang/cargo/pull/7873#issuecomment-585351343).

Right. My concern is that without --target when you swap between host compilers (like the example of gnu vs msvc, or gnu vs musl), it will cause a full rebuild if the host is not in the metadata. If it is not in the fingerprint, then it won't recompile at all which I think is the wrong behavior.

In our case:

So if I understood well the concerns, the fact that we provide an explicit --target should allow to find a good middle ground.

But if our use case is too specific, could there at least be a flag to tell Cargo to only hash the nightly-XXXX-YY-ZZ part of the host compiler version into the metadata? This flag could be unstable if needed - and we'd be happy to try it.

alexcrichton commented 4 years ago

FWIW at least from an abstract point of view I believe that we should fix this issue. If you're using the same source code for the entire toolchain (e.g. same rust version, same linker version, etc), then any host should compile the same result for a particular target.

That being said what goes into filenames/metadata/etc is pretty tricky and nuanced, so this may not be an easy "flip the switch" fix. It's a good goal to work towards, but I suspect it will require even more investigation beyond what's already at the top of @ehuss's head right now.

gendx commented 4 years ago

FWIW at least from an abstract point of view I believe that we should fix this issue. If you're using the same source code for the entire toolchain (e.g. same rust version, same linker version, etc), then any host should compile the same result for a particular target.

That being said what goes into filenames/metadata/etc is pretty tricky and nuanced, so this may not be an easy "flip the switch" fix. It's a good goal to work towards, but I suspect it will require even more investigation beyond what's already at the top of @ehuss's head right now.

Definitely makes sense. I was trying to understand what the blockers were, and indeed it seems trickier than I thought.

I'm still wondering whether it would be possible and make sense to have some experimental/unstable flag in Cargo to remove the host from the metadata hash, so that we can experiment with it? For example, it would help understand whether the metadata hash is the only blocker left to have reproducible builds across hosts, or if there are also other things to investigate.

I'm not familiar with developing Cargo, but would be happy to help adding such a flag if it makes sense.

In any case, being able to have reproducible builds for a given host compiler is already a very nice milestone, so thanks to everyone who made it possible!

saurik commented 4 years ago

I'm also staring at this; we compile our project for many architectures via many cross compile toolchains (such as for Android). We use a single "reference" copy of the compiler from a specific revision of clang to target all platforms (we are currently using the one from the Android NDK for this to make it very easy for everyone to download and install it and so that "doctoring it" would require messing with a very large number of downstream developers and the NDK itself is open source so people can get the source code and compile it from scratch), as well as reproducible sysroots (manually extracted by downloading packages from distribution repositories such as MinGW/CentOS or simple "use a specific known copy of Xcode", etc.) in order to ensure that people can reproduce all of our build artifacts.

With respect to this issue, one of the tests that we do--as in, this is actually part of our CI system (something that apparently is also what @gendx was working on)--is to compile the code on Linux and on macOS for the same target and verify that the resulting binary is the same. Since I recently started using a library written in Rust, it was a bit frustrating trying to figure out how to get our build to be reproducible correctly again. (The symptom I was running into was that because all of the symbols had hashes in them, they were being put in random orders, which caused the imports and string constants they caused to end up in slightly different orders.)

FWIW, I did figure out a workaround for the metadata issue that is working just barely enough for me, and maybe it will work for some other people: what I'm doing is setting RUSTC_WRAPPER to a shell script that iterates the command line arguments, finds the metadata=* argument, and replaces it with metadata=path/to/file.rs (which is still needed at minimum as sometimes the same crate is included multiple times). I think this technique could be used for https://github.com/google/OpenSK/pull/94, if they are still trying to go down that road.

#!/bin/bash
set -e

file=
for arg in "$@"; do
    if [[ ${arg} == ${CARGO_HOME}/* && -z ${file} ]]; then
        file=${arg##${CARGO_HOME}}
    fi
done

args=()
for arg in "$@"; do
    if [[ ${arg} == metadata=* ]]; then
        args+=(metadata="${file}")
    else
        args+=("${arg}")
    fi
done

exec "${args[@]}"

It isn't clear to me that this is really the only problem, though... the object files I'm getting out are internally in a different order, with high-level symbol groups staying together but being shuffled. I only happen to care about the reproducibility of the final stripped release production binary, and so this happens to be working well enough for me right now, but other people with different requirements--particularly if you care about the output from Cargo itself, as you are shipping the library you get in a package instead of linking it with a more deterministic linker and shipping that--might see the .a files you get being different as unacceptable.

To explain what I mean by "high-level symbol groups", I'm going to give an example; for reference, the crate I'm trying to use and compile right now is boringtun. One of the smaller files being output in libboringtun.a is boringtun.boringtun.3a1fbbbh-cgu.1.rcgu.o, and the symbols I'm getting when compiled from a Linux machine are clumped into the following order, but when compiled on macOS the symbols are in the order C B E A D. Maybe this might mean that https://github.com/rust-lang/rust/issues/71361 will need to be re-opened?

Group A:

0000000000000000 D anon.5298a2dae53d0c534847a530c196e04d.10.llvm.2086496205581072300
0000000000000000 R anon.5298a2dae53d0c534847a530c196e04d.11.llvm.2086496205581072300
0000000000000000 D anon.5298a2dae53d0c534847a530c196e04d.12.llvm.2086496205581072300
0000000000000000 D anon.5298a2dae53d0c534847a530c196e04d.13.llvm.2086496205581072300
0000000000000000 R anon.5298a2dae53d0c534847a530c196e04d.6.llvm.2086496205581072300
0000000000000000 D anon.5298a2dae53d0c534847a530c196e04d.7.llvm.2086496205581072300
0000000000000000 D anon.5298a2dae53d0c534847a530c196e04d.8.llvm.2086496205581072300
0000000000000000 R anon.5298a2dae53d0c534847a530c196e04d.9.llvm.2086496205581072300
                 U anon.7d68503fbc91ec043faab1e83b6b403a.10.llvm.4076463041423191387
                 U anon.7d68503fbc91ec043faab1e83b6b403a.12.llvm.4076463041423191387
                 U anon.7d68503fbc91ec043faab1e83b6b403a.7.llvm.4076463041423191387
                 U anon.7d68503fbc91ec043faab1e83b6b403a.9.llvm.4076463041423191387

Group B:

0000000000000000 r .Lanon.5298a2dae53d0c534847a530c196e04d.16

Group C:

0000000000000000 r .LCPI4_0
0000000000000010 r .LCPI4_1
0000000000000020 r .LCPI8_0

Group D:

                 U memcpy

Group E:

                 U _ZN3std9panicking11begin_panic17h69c46a66d66d4af1E
                 U _ZN42_$LT$$RF$T$u20$as$u20$core..fmt..Debug$GT$3fmt17h86eee54132a80a7aE
                 U _ZN42_$LT$$RF$T$u20$as$u20$core..fmt..Debug$GT$3fmt17hde65a14d000f637dE
                 U _ZN4core3fmt8builders10DebugTuple5field17h83d3c03943a0923aE
                 U _ZN4core3fmt8builders10DebugTuple6finish17h7f5c64cfea21e040E
                 U _ZN4core3fmt9Formatter11debug_tuple17hf4b58c7fd2af7471E
0000000000000000 t _ZN4core3ptr13drop_in_place17h94a781421553b149E
                 U _ZN4core5slice20slice_index_len_fail17he661f5dd1689ef3bE
                 U _ZN4core5slice22slice_index_order_fail17ha18df75636493875E
                 U _ZN4core9panicking18panic_bounds_check17h111737a547b9c0f6E
                 U _ZN4core9panicking9panic_fmt17h1ac71ad045d55416E
                 U _ZN59_$LT$core..fmt..Arguments$u20$as$u20$core..fmt..Display$GT$3fmt17hc474e0aeb8529411E
0000000000000000 T _ZN69_$LT$boringtun..crypto..blake2s..Vec4$u20$as$u20$core..fmt..Debug$GT$3fmt17h967ed4de1972e314E
0000000000000000 T _ZN80_$LT$boringtun..crypto..blake2s..Vec4$u20$as$u20$core..ops..arith..AddAssign$GT$10add_assign17ha6afe8f9599eec25E
0000000000000000 T _ZN81_$LT$boringtun..crypto..blake2s..Vec4$u20$as$u20$core..ops..bit..BitXorAssign$GT$13bitxor_assign17hee9089db0c5ea243E
0000000000000000 T _ZN89_$LT$boringtun..crypto..blake2s..Vec4$u20$as$u20$core..ops..bit..ShrAssign$LT$u32$GT$$GT$10shr_assign17he77459210b97d943E
0000000000000000 T _ZN9boringtun6crypto7blake2s23constant_time_mac_check17h549413b6064fb613E
0000000000000000 r _ZN9boringtun6crypto7blake2s2IV17h88bd7f391b8c99f6E
0000000000000000 T _ZN9boringtun6crypto7blake2s7Blake2s10hash_block17h70fe7cd1bec36ba5E.llvm.2086496205581072300
0000000000000000 T _ZN9boringtun6crypto7blake2s7Blake2s3new17h51abdb7a6eff04fdE.llvm.2086496205581072300
0000000000000000 T _ZN9boringtun6crypto7blake2s7Blake2s4hash17h0f1357d0f5d2adeeE
0000000000000000 T _ZN9boringtun6crypto7blake2s7Blake2s7new_mac17h79a7087d828ec7d8E
0000000000000000 T _ZN9boringtun6crypto7blake2s7Blake2s8finalize17hc4ef1ce2381c68c3E
0000000000000000 T _ZN9boringtun6crypto7blake2s7Blake2s8new_hash17h4d0af0bf2973a01bE
0000000000000000 T _ZN9boringtun6crypto7blake2s7Blake2s8new_hmac17h2c60c2ae768475abE
gendx commented 4 years ago

Interesting analysis @saurik! Yes I agree that this is worth re-opening rust-lang/rust#71361.

Regarding google/OpenSK#94, for now we've agreed to have two sets of reproduced binaries, those built on Linux and those built on MacOS. Ideally it would be preferable for us to have this issue fixed upstream in Cargo rather than having to fiddle with the compilation scripts (as it may be more brittle).

Also, we only tried to reproduce binaries in a stripped form obtained via https://github.com/tock/elf2tab (no ELF packaging, no symbols, only relevant sections are extracted).

saurik commented 4 years ago

OK, so now I'm failing to replicate the symbol/section shuffling, so maybe I did something unclean there while doing a bunch of local tests... (if so, I'm sorry)? Though I also don't remember what target I was using to test that at the time; I have noticed there is a difference i'm getting with Android (which is what I thought I was using before) as the path of jnienv.rs is ending up inside of libboringtun.a, but that is something I bet I can fix with some remap flags that I've so far not had to use. I thereby actually have thankfully failed so far to find any issue in rust itself (I figured since I had invested some time into this already I'd go and come up with some simple reproduction for the other closed issue thread, even though it wasn't affecting me, but then thankfully failed ;P).

Note that I am getting .o filename changes in the resulting .a files; again, thankfully, this doesn't affect me as I only care about the final linked binary... but this would totally affect other people who are intending to ship the .a file, and having all of these intermediate files be different definitely makes debugging reproduction issues harder (and I will make that same argument about having the host toolchain hashed into extra-filename also: that should just be differentiated by the parent folder under target). Regardless, putting people in the position where they can start trying to find the more subtle issues by removing the host toolchain type from the hash (even if it is only behind some temporary unstable flag) seems very useful.

hudson-ayers commented 2 years ago

It might be nice (for use cases like Tock's) to just have a cargo flag that would disable -C metadata=hash being passed to rustc at all

brson commented 3 months ago

Is this closed by https://github.com/rust-lang/cargo/pull/14107?

brson commented 2 months ago

In my testing, https://github.com/rust-lang/cargo/pull/14107 only partially addresses this issue.

I am still seeing that proc macros and build scripts leak their host-specific metadata into wasm-target metadata. This affects projects that use e.g. wasm-bindgen (via its build script and its wasm-bindgen-macros dependency).

As a naive strawman, this diff excludes such dependencies and results in the same metadata for a crate that depends on wasm-bindgen, when compiled on either macos or linux:

--- a/src/cargo/core/compiler/build_runner/compilation_files.rs
+++ b/src/cargo/core/compiler/build_runner/compilation_files.rs
@@ -596,6 +596,7 @@ fn compute_metadata(
     let mut deps_metadata = build_runner
         .unit_deps(unit)
         .iter()
+        .filter(|dep| unit.kind == dep.unit.kind)
         .map(|dep| metadata_of(&dep.unit, build_runner, metas).meta_hash)
         .collect::<Vec<_>>();
     deps_metadata.sort();

I am sure that is not an acceptable solution as it excludes any information at all about dependencies that deviate from the "current" compilation kind, but if someone could give me guidance on what would be acceptable I am willing to work toward a patch.

cc @ehuss @alexcrichton

Tested with cargo fa646583675d7c140482bd906145c71b7fb4fc2b

alexcrichton commented 2 months ago

I think that might actually be along the lines of what to do here? It's been awhile for me and @ehuss would probably know better, but IIRC there's a decoupling of -Cmetadata and rebuild tracking with fingerprints where for example RUSTFLAGS changes trigger rebuilds but don't change -Cmetadata (or maybe I'm misremembering here). Your patch at least passes tests locally although that doesn't always capture the full matrix of cross-compiles and such.

Regardless I think something like that is the way to go, perhaps with some comments around the comparison and maybe only dropping Target != Host and otherwise including everything else.