rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
98.05k stars 12.69k forks source link

fatal runtime error: assertion failed: output.write(&bytes).is_ok() #125952

Open dpc opened 4 months ago

dpc commented 4 months ago

So I was recompiling a typescript nextjs project in a nix derivation, that previously worked (Nix builds are kind of reproducible, so that's very unexpected) and it failed with a weird error:

guardian-ui> @fedimint/types:build: cache miss, executing a873a14720ccc5d2
guardian-ui>  WARNING  passwd database shell="/noshell" which is not executable (ENOENT: No such file or directory), falling back to /bin/sh
guardian-ui> @fedimint/types:build: fatal runtime error: assertion failed: output.write(&bytes).is_ok()
guardian-ui> @fedimint/types:build:
guardian-ui>  Tasks:    0 successful, 1 total
guardian-ui> Cached:    0 cached, 1 total
guardian-ui>   Time:    1.095s

I suspect my kernel version might be different because I just upgraded to NixOS 24.05 recently.

I traced this panic to https://github.com/rust-lang/rust/blob/1689a5a531f1fe404944ed8c3ac6cb85a2cff7e0/library/std/src/sys/pal/unix/process/process_unix.rs#L125

I got a strace output:

guardian-ui> [pid   675] setsid()                    = 675
guardian-ui> [pid   675] ioctl(0, TIOCSCTTY, 0)      = 0
guardian-ui> [pid   675] open("/dev/fd", O_RDONLY|O_LARGEFILE|O_CLOEXEC|O_DIRECTORY) = 17
guardian-ui> [pid   675] fcntl(17, F_SETFD, FD_CLOEXEC) = 0
guardian-ui> [pid   675] getdents64(17, 0x7ffff7b2f4b8 /* 22 entries */, 2048) = 528
guardian-ui> [pid   675] getdents64(17, 0x7ffff7b2f4b8 /* 0 entries */, 2048) = 0
guardian-ui> [pid   675] close(5)                    = 0                                                                                   (6 results) 20:25:45 [2127/32726]
guardian-ui> [pid   675] close(6)                    = 0
guardian-ui> [pid   675] close(7)                    = 0
guardian-ui> [pid   675] close(8)                    = 0
guardian-ui> [pid   675] close(9)                    = 0
guardian-ui> [pid   675] close(10)                   = 0
guardian-ui> [pid   675] close(11)                   = 0
guardian-ui> [pid   675] close(12)                   = 0
guardian-ui> [pid   675] close(13)                   = 0
guardian-ui> [pid   675] close(14)                   = 0
guardian-ui> [pid   675] close(15)                   = 0
guardian-ui> [pid   675] close(16)                   = 0
guardian-ui> [pid   675] close(17)                   = -1 EBADF (Bad file descriptor)
guardian-ui> [pid   675] close(18)                   = 0
guardian-ui> [pid   601] <... recvfrom resumed>"", 8, 0, NULL, NULL) = 0
guardian-ui> [pid   675] close(24 <unfinished ...>
guardian-ui> [pid   601] close(17 <unfinished ...>
guardian-ui> [pid   675] <... close resumed>)        = 0
guardian-ui> [pid   675] execve("/build/yarn--1717471544555-0.05173506673993855/yarn", ["/build/yarn--1717471544555-0.051"..., "run", "build"], 0x7ffff7e5b1a0 /* 135 vars *
/ <unfinished ...>
guardian-ui> [pid   601] <... close resumed>)        = 0
guardian-ui> [pid   675] <... execve resumed>)       = -1 ETXTBSY (Text file busy)
guardian-ui> [pid   675] write(18, "\0\0\0\32NOEX", 8) = -1 EBADF (Bad file descriptor)
guardian-ui> [pid   675] write(2, "fatal runtime error: assertion f"..., 68) = 68
guardian-ui> [pid   675] rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1 RT_2], [], 8) = 0
guardian-ui> [pid   675] tkill(675, SIGABRT)         = 0
guardian-ui> [pid   675] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
guardian-ui> [pid   675] --- SIGABRT {si_signo=SIGABRT, si_code=SI_TKILL, si_pid=675, si_uid=1000} ---
guardian-ui> [pid   601] close(14)                   = 0
guardian-ui> [pid   601] close(15)                   = 0
guardian-ui> [pid   601] close(16)                   = 0
guardian-ui> [pid   601] fcntl(11, F_DUPFD_CLOEXEC, 0) = 14
guardian-ui> [pid   601] fcntl(14, F_SETFD, FD_CLOEXEC) = 0
guardian-ui> [pid   601] fcntl(11, F_DUPFD_CLOEXEC, 0) = 15
guardian-ui> [pid   601] fcntl(15, F_SETFD, FD_CLOEXEC) = 0

I'm not sure where to even report it, and a bit tired to dig deeper. Creating the issue just for reference.

The whole thing can be reproduced with:

nix build 'github:fedimint/ui?rev=1fc0cc6322f4ebb0f0854cd870b79c9971ff4b34#guardian-ui'

I'm going to try it on some machines and see when it fails and when works.

dpc commented 4 months ago

On a machine with Ubuntu and nix I had around with Linux 5.15.0-101-generic, this works. On my two systems with new NixOS and Linux 6.9.2 it fails.

dpc commented 4 months ago

I have verified that downgrading to linux kernel 6.8.11 makes the problem go away.

tbu- commented 4 months ago

Nix builds are kind of reproducible, so that's very unexpected

Nix does not take kernel version into account in its reproducibility guarantees.

dpc commented 4 months ago

Yes, everything else is locked in place (kind of). That's why I immediately suspected the kernel might be a problem.

So to sum up: something about very recent linux kernel version is breaking some assumptions in Rust standard code w.r.t forking/execing, which leads to this internal panic. It's hard for me to tell is it a kernel regression, or Rust's stdlib assumptions were incorrect, or maybe I'm missing something else entirely.

I am happening to witness it because I'm running as recent kernel version as NixOS can provide trying to avoid some bcachefs bugs. With time the problem might become more widespread.

workingjubilee commented 4 months ago

Ah, a relatively small diff, then! Should be easy to find the offending commit. https://github.com/torvalds/linux/compare/f610c358956229b7e5180f8c1147725d989f6b0d...c8eef17

dpc commented 4 months ago

image

:thinking: , dozens of rebuild + reboot cycles... . I'll see if I can find a time to do it. No promises. :D