rust-lang / crater

Run experiments across parts of the Rust ecosystem!
https://crater.rust-lang.org
629 stars 88 forks source link

Tons of spurious failures due to build scripts failing to run rustc #663

Open RalfJung opened 2 years ago

RalfJung commented 2 years ago

The report at https://crater-reports.s3.amazonaws.com/pr-100043/index.html is 100% spurious regressions. Almost all of them are due to the libc build script failing to detect the rustc version, a bunch are due to other build scripts failing to run rustc, and then there are a few other random strange things (in syn some trait is supposedly not implemented any more, and the rest is handled by https://github.com/rust-lang/crater/pull/664).

ehuss commented 2 years ago

I don't recall exactly how crater/rustwide work, but I believe they use rustup. If so, and if the crater machine has been updated to rustup 1.25, then this is likely caused by https://github.com/rust-lang/rustup/issues/3031.

RalfJung commented 2 years ago

That sounds like a deterministic problem, whereas this build script failure only occurs for a tiny fraction of all crates that use libc. (At least I would be very surprised if only 600 crates used libc.)

Mark-Simulacrum commented 2 years ago

Yes, and I manually verified that libc itself built fine on master and on try, so it's definitely not fully deterministic. It might be worth trying to run these through again to try and re-confirm and see if failures are deterministic for this subset of ~600.

Mark-Simulacrum commented 2 years ago

Interestingly, based on unpacking the all.tar.gz from both #99447 and #100043, and running rg 'Failed to get rustc version' | rg -v 'try#', it seems like only try# builds are affected. We run the start toolchain "check" task on the same worker thread & machine as the end toolchain, and in that order.

One run failed twice in libc (https://crater-reports.s3.amazonaws.com/pr-100043/try%23a8425075041d4b9af9bf32e2a267dfb9db2cfffb/gh/dalloriam.worktimer/log.txt), presumably there's a build-dep on libc and a regular dependency there? But not sure there's much information we can gain from that.

RalfJung commented 2 years ago

In the rerun, the libc issue still occurs a few times, but the majority of crates passes now. So, definitely non-deterministic.

To me this really looks like it just failed to spawn the process due to resource limits...

Mark-Simulacrum commented 2 years ago

We run on a large number of machines and don't regularly collect detailed logs from each one right now, but I'm not sure what kind of resource limit could have caused this to just recently start happening. We are adding new machine types to our build fleet -- but those aren't being oversubscribed to any great extent, at least from my perspective.

I'm going to try and locate which machines are having this error and pull logs, but our machines are ephemeral, so that might be a little hard.

Mark-Simulacrum commented 2 years ago

I located one machine which had a few cases of this, but I don't see any adjacent suspicious events in the syslog (e.g., OOM kill or similar). It's possible it's another kind of resource exhaustion (file descriptor limits or whatever), though.

I also queried the all.tar logs for which workers had problems, and there is a suspicious correlation there, presuming I'm getting reasonable data. (Note that these workers may be on different machines: I'm not sure there's any good way to recover that from the current logs and other data we have, though it seems like a good idea to add some support for that).

worker-9 I think has to be on the larger, oversubscribed, Azure machines -- the GCP machines only have 8 workers. I definitely found 3 cases on one of the GCP machines though, so it's not strictly limited.

In any case, given these results for now I've turned off the Azure capacity -- currently it looks like I can't directly ssh onto those machines, but they are definitely oversubscribed so I'd not be too surprised if there's some resourcing issue there. I don't think it fully explains the problem (we should see errors on both master# and try#, IMO, and have seen a similar rate historically if not lower now since we have more GCP capacity than we did).

$ for file in $(rg 'Failed to get rustc version' -l); do rg -o 'worker-\d+' $file; done | sort | uniq -c
     97 worker-5
   3503 worker-6
   2394 worker-9

Edit: and the worker counts don't seem related to the number of jobs that worker executed, at least for worker-9 -- so I don't think we can explain this with just "that worker is executing more work". worker-6 though is more of an unknown.

$ rg --no-filename -o 'worker-\d+' | sort | uniq -c | sort -nr
 660359 worker-7
 658404 worker-4
 658289 worker-6
 658284 worker-0
 657983 worker-5
 657168 worker-3
 656586 worker-1
 656361 worker-2
  40224 worker-9
  39600 worker-8
  38991 worker-14
  38157 worker-13
  37966 worker-10
  37608 worker-11
  37066 worker-12
Mark-Simulacrum commented 2 years ago

https://github.com/rust-lang/rust/pull/100046#issuecomment-1213518211 is another crater run, mostly run after we shut down the Azure (oversubscribed) instances. It looks like it had ~9 libc issues; I didn't look for other cases.

ehuss commented 2 years ago

My script has a few other patterns that will match other build scripts and found 38 instances in pr-100046:

build-fail/gh/charlag/rusty-bird/5d388381d1f36b4bd3182c3b41ecc030716e07df/master#59d99f56d8888b0e5a5866d6418df3ca1bb90013.txt: panicked at 'unexpected -vV format
build-fail/gh/charlag/rusty-bird/5d388381d1f36b4bd3182c3b41ecc030716e07df/try#d76952bc2a74b981479554f7e1e2c6f8a8c3e3bf.txt: panicked at 'unexpected -vV format
build-fail/gh/danpersa/rusty-team/af69f4c11df3e4fe555c31b0eac218414e4d0c73/master#59d99f56d8888b0e5a5866d6418df3ca1bb90013.txt: panicked at 'unexpected -vV format
build-fail/gh/dkastner/vault_to_env/5a4ac46ea4a5a0302a27369d6ab29b9f8b1bedec/master#59d99f56d8888b0e5a5866d6418df3ca1bb90013.txt: panicked at 'unexpected -vV format
build-fail/gh/fenollp/nnrs/0ecc83d6b83c12552ce1a667aaf1cbef213015d2/master#59d99f56d8888b0e5a5866d6418df3ca1bb90013.txt: panicked at 'unexpected -vV format
build-fail/gh/ikruglov/grpcdemo2rust/112b0f17048d2a3a51a7095e74e706d9da24b55c/master#59d99f56d8888b0e5a5866d6418df3ca1bb90013.txt: panicked at 'unexpected -vV format
build-fail/gh/nuts/scale_server_rust/acc37275442877fc69ab9aa802d11f4054828600/master#59d99f56d8888b0e5a5866d6418df3ca1bb90013.txt: panicked at 'unexpected -vV format
build-fail/gh/nuts/scale_server_rust/acc37275442877fc69ab9aa802d11f4054828600/try#d76952bc2a74b981479554f7e1e2c6f8a8c3e3bf.txt: panicked at 'unexpected -vV format
build-fail/gh/onur/at-rust/275281522d3eafc747466ea29aac31f990ca05ec/master#59d99f56d8888b0e5a5866d6418df3ca1bb90013.txt: panicked at 'unexpected -vV format
build-fail/gh/portaloffreedom/remote-music/94867f8d5f42ab604c02ab357eec3150b63ae01a/master#59d99f56d8888b0e5a5866d6418df3ca1bb90013.txt: panicked at 'unexpected -vV format
build-fail/gh/portaloffreedom/remote-music/94867f8d5f42ab604c02ab357eec3150b63ae01a/try#d76952bc2a74b981479554f7e1e2c6f8a8c3e3bf.txt: panicked at 'unexpected -vV format
build-fail/gh/rrader/EventBaboon/0ba91ee13bbc19ed871f6aea2cc780f8e9538d39/master#59d99f56d8888b0e5a5866d6418df3ca1bb90013.txt: panicked at 'unexpected -vV format
build-fail/gh/rrader/EventBaboon/0ba91ee13bbc19ed871f6aea2cc780f8e9538d39/try#d76952bc2a74b981479554f7e1e2c6f8a8c3e3bf.txt: panicked at 'unexpected -vV format
build-fail/gh/rust-los-angeles/stockfighter-skeleton-app-rs/a1a092c937ed0c977fa105f6a0b54facb18be410/master#59d99f56d8888b0e5a5866d6418df3ca1bb90013.txt: panicked at 'unexpected -vV format
build-fail/gh/rust-los-angeles/stockfighter-skeleton-app-rs/a1a092c937ed0c977fa105f6a0b54facb18be410/try#d76952bc2a74b981479554f7e1e2c6f8a8c3e3bf.txt: panicked at 'unexpected -vV format
build-fail/gh/schultyy/avm/82ab688d2f0053bf7f4bd7edda7ba1c85b51371e/master#59d99f56d8888b0e5a5866d6418df3ca1bb90013.txt: panicked at 'unexpected -vV format
build-fail/gh/schultyy/avm/82ab688d2f0053bf7f4bd7edda7ba1c85b51371e/try#d76952bc2a74b981479554f7e1e2c6f8a8c3e3bf.txt: panicked at 'unexpected -vV format
build-fail/gh/wraithan/prime-challenge/325a0214cc7cc7a7f69db60e58e381f350a90bf5/master#59d99f56d8888b0e5a5866d6418df3ca1bb90013.txt: panicked at 'unexpected -vV format
build-fail/gh/wraithan/prime-challenge/325a0214cc7cc7a7f69db60e58e381f350a90bf5/try#d76952bc2a74b981479554f7e1e2c6f8a8c3e3bf.txt: panicked at 'unexpected -vV format
build-fail/gh/zackify/silo/1550cb7cd7ad759b9ab29c0fb3726cc0767ac1ff/master#59d99f56d8888b0e5a5866d6418df3ca1bb90013.txt: panicked at 'unexpected -vV format
build-fail/gh/zackify/silo/1550cb7cd7ad759b9ab29c0fb3726cc0767ac1ff/try#d76952bc2a74b981479554f7e1e2c6f8a8c3e3bf.txt: panicked at 'unexpected -vV format
build-fail/gh/zcash-hackworks/pay-to-sudoku/e80e165ab50495606c59edded993e193dc506dc0/master#59d99f56d8888b0e5a5866d6418df3ca1bb90013.txt: panicked at 'unexpected -vV format
build-fail/gh/zcash-hackworks/pay-to-sudoku/e80e165ab50495606c59edded993e193dc506dc0/try#d76952bc2a74b981479554f7e1e2c6f8a8c3e3bf.txt: panicked at 'unexpected -vV format
build-fail/reg/provwasm-tutorial/0.1.2/try#d76952bc2a74b981479554f7e1e2c6f8a8c3e3bf.txt: Failed to get rustc
build-fail/reg/pwrs/1.1.0/try#d76952bc2a74b981479554f7e1e2c6f8a8c3e3bf.txt: Failed to get rustc
build-fail/reg/replicante_agent/0.5.0/try#d76952bc2a74b981479554f7e1e2c6f8a8c3e3bf.txt: Failed to get rustc
build-fail/reg/rlp/0.5.1/try#d76952bc2a74b981479554f7e1e2c6f8a8c3e3bf.txt: Failed to get rustc
build-fail/reg/safety-rules/0.2.2/try#d76952bc2a74b981479554f7e1e2c6f8a8c3e3bf.txt: Failed to get rustc
build-fail/reg/skywalking_rust/0.1.0/try#d76952bc2a74b981479554f7e1e2c6f8a8c3e3bf.txt: Failed to get rustc
regressed/gh/a3ch/jureba_bot/343c04dd4b945f728be41506a34335ede871c37f/try#d76952bc2a74b981479554f7e1e2c6f8a8c3e3bf.txt: Failed to get rustc
regressed/reg/qecp/0.1.6/try#d76952bc2a74b981479554f7e1e2c6f8a8c3e3bf.txt: Failed to get rustc
regressed/reg/rawbson/0.2.1/try#d76952bc2a74b981479554f7e1e2c6f8a8c3e3bf.txt: Failed to get rustc
regressed/reg/redislock/1.3.0/try#d76952bc2a74b981479554f7e1e2c6f8a8c3e3bf.txt: Failed to get rustc
regressed/reg/rust_ledger/0.8.0/try#d76952bc2a74b981479554f7e1e2c6f8a8c3e3bf.txt: Failed to get rustc
regressed/reg/secp256kfun/0.7.1/try#d76952bc2a74b981479554f7e1e2c6f8a8c3e3bf.txt: Failed to get rustc
regressed/reg/shopless-types/0.10.0/try#d76952bc2a74b981479554f7e1e2c6f8a8c3e3bf.txt: Failed to get rustc
regressed/reg/silver_animation/0.1.0/try#d76952bc2a74b981479554f7e1e2c6f8a8c3e3bf.txt: Failed to get rustc
regressed/reg/smart-wallet/0.11.1/try#d76952bc2a74b981479554f7e1e2c6f8a8c3e3bf.txt: Failed to get rustc

Which is on-par with what was seen previously.

Copying some comments from the zulip thread for posterity:


I scanned the old logs. This seems to have been an issue since at least February (the oldest one I checked).

One strange thing I noticed in pr-100043 (Ralf's PR) and pr-99880 is that there is one build script which prints the stderr when it fails to run rustc and it printed:

/opt/rustwide/rustup-home/toolchains/a8425075041d4b9af9bf32e2a267dfb9db2cfffb/bin/rustc: symbol lookup error: /opt/rustwide/rustup-home/toolchains/a8425075041d4b9af9bf32e2a267dfb9db2cfffb/lib/librustc_driver-c25a516c3d8f3eb8.so: undefined symbol: _ZN9hashbrown3raw11Fallibility17capacity_overflow17hf1df634019334dcdE

In all 7 cases it is the same symbol error.

matthiaskrgr was also observing that same exact error at https://rust-lang.zulipchat.com/#narrow/stream/131828-t-compiler/topic/master.20toolchain.20broken.3F/near/291873065 with icemaker.


Number of "failed to launch rustc -V" -like errors over time:

February is just the last one you gave me in the list, I assume it has been happening before then.

Here's the frequency over time:

2022-02-10 31 2022-02-15 22 2022-02-18 30 2022-02-21 34 2022-02-26 23 2022-03-01 15 2022-03-03 20 2022-03-07 29 2022-03-10 18 2022-03-14 37 2022-03-18 25 2022-03-25 21 2022-03-27 9 2022-04-02 32 2022-04-05 26 2022-04-07 16 2022-04-10 21 2022-04-13 24 2022-04-16 25 2022-04-18 16 2022-04-21 19 2022-04-23 22 2022-04-27 28 2022-04-30 29 2022-05-05 34 2022-05-10 19 2022-05-11 21 2022-05-15 17 2022-05-19 22 2022-05-23 30 2022-05-26 23 2022-05-29 25 2022-06-02 32 2022-06-05 34 2022-06-13 29 2022-06-14 28 2022-06-17 29 2022-06-20 29 2022-06-22 28 2022-06-23 31 2022-06-25 27 2022-06-28 24 2022-07-01 34 2022-07-05 31 2022-07-08 22 2022-07-12 13 2022-07-16 37 2022-07-17 16 2022-07-20 31 2022-07-22 13 2022-07-24 9 2022-07-26 8 2022-07-30 27 2022-08-01 114 2022-08-04 26 2022-08-06 388 2022-08-07 30 2022-08-08 22 2022-08-09 965 2022-08-10 53


Raw per-worker occurrence data: https://gist.github.com/ehuss/2a0ebce9fe9bc1440705f6fe86fe3094

Mark-Simulacrum commented 1 year ago

One crate which has incorrect parsing logic: https://gitlab.com/leonhard-llc/ops/-/blob/main/build-data/src/lib.rs#L379 (found in https://crater-reports.s3.amazonaws.com/beta-1.66-1.2/beta-2022-11-01/gh/c0repwn3r.mangrove/log.txt).

RalfJung commented 1 year ago

One unfortunate side-effect of this is that it can mask actual regressions -- e.g. in https://crater-reports.s3.amazonaws.com/pr-104429-1/index.html, skreborn.bluetooth-sys.9d5b3dcba1a98824bb765ca5f69a1a6c789589b4 should be listed as a regular regression but instead its build script failed.