Open jaraco opened 1 month ago
Wow. Something is really messed up here. Trying to investigate in lldb and I'm already lost:
(lldb) run default stable
Process 3528 launched: '/root/.cargo/bin/rustup' (x86_64)
Process 3528 stopped
* thread #1, name = 'rustup', stop reason = signal SIGSTOP
frame #0: 0xffffffffffffffff
(lldb) c
Process 3528 resuming
info: syncing channel updates for 'stable-x86_64-unknown-linux-gnu'
info: latest update on 2024-05-02, rust version 1.78.0 (9b00956e5 2024-04-29)
info: downloading component 'cargo'
info: downloading component 'clippy'
info: downloading component 'rust-docs'
info: downloading component 'rust-std'
info: downloading component 'rustc'
info: downloading component 'rustfmt'
info: installing component 'cargo'
info: installing component 'clippy'
info: installing component 'rust-docs'
Process 3528 stopped
* thread #1, name = 'rustup', stop reason = signal SIGSEGV: address not mapped to object (fault address: 0x20)
frame #0: 0xffffffffffffffff
The segfault occurs during installation of docs. You can get a working toolchain if you use the minimal profile instead of the default.
The segfault during rust-docs installation goes away if I run rustup under strace
or valgrind
.gdb
crashes:
Starting program: /root/.cargo/bin/rustup default stable
warning: linux_ptrace_test_ret_to_nx: Cannot PTRACE_GETREGS: Input/output error
warning: linux_ptrace_test_ret_to_nx: PC 0x900000000 is neither near return address 0x7ffff8f1a000 nor is the return instruction 0x555555965971!
Couldn't get CS register: Input/output error.
And rr
also crashes:
root@533d5273fa2a:~# rr record rustup default stable
rr: Saving execution to trace directory `/root/.local/share/rr/rustup-0'.
[FATAL ./src/Task.cc:3207:ptrace_if_stopped() errno: EIO]
(task 4113 (rec:4113) at time 1)
-> Assertion `!errno' failed to hold. ptrace(PTRACE_GETREGS, 4113, addr=0, data=0x7ffffffc4700) failed with errno 5
Tail of trace dump:
=== Start rr backtrace:
rr(_ZN2rr13dump_rr_stackEv+0x5e)[0x555555715e2e]
rr(_ZN2rr9GdbServer15emergency_debugEPNS_4TaskE+0x161)[0x555555601b01]
rr(+0xbf416)[0x555555613416]
rr(+0xc00ec)[0x5555556140ec]
rr(_ZN2rr4Task17ptrace_if_stoppedEiNS_10remote_ptrIvEEPv+0x115)[0x5555556eb0c5]
rr(_ZN2rr4Task11did_waitpidENS_10WaitStatusE+0x58f)[0x5555556ef2cf]
rr(_ZN2rr4Task5spawnERNS_7SessionERNS_8ScopedFdEPS3_S5_PiRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKSt6vectorISC_SaISC_EESJ_i+0x830)[0x5555556f2d80]
rr(_ZN2rr13RecordSessionC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKSt6vectorIS6_SaIS6_EESD_RKNS_20DisableCPUIDFeaturesENS0_16SyscallBufferingEiNS_7BindCPUES8_PKNS_9TraceUuidEbb+0x2e1)[0x555555643751]
rr(_ZN2rr13RecordSession6createERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS7_EESB_RKNS_20DisableCPUIDFeaturesENS0_16SyscallBufferingEhNS_7BindCPUERKS7_PKNS_9TraceUuidEbbbb+0xa91)[0x555555644921]
rr(_ZN2rr13RecordCommand3runERSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS7_EE+0x46f)[0x5555556372ff]
rr(main+0x166)[0x5555555a21f6]
/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0x7fffff06f1ca]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x7fffff06f28b]
rr(_start+0x25)[0x5555555a3975]
=== End rr backtrace
I think this is a kernel bug. The installation of the rust-docs component uses a lot of threads banging away at the filesystem laying down a lot of small files. If I had to guess, there's a data race somewhere in the whole Docker emulation setup that is corrupting the rustup process.
I suggest you file a bug with Docker.
Thanks for the analysis. I've reported the issue upstream and also learned that downgrading to Docker 4.28 works around the issue as does disabling Rosetta or reducing the cores to 1. At least one user with an M3 Max chip is unable to replicate the issue.
The issue is rather elusive. It's intermittent and difficult to replicate. Based on the analysis above, do you have any suggestions on a more minimal reproducer that might trigger faster and possibly more reliably?
I have the same issue, using Amazon Linux 2023, the build pass without passing the platform argument, when I pass --platform linux/amd64
it fails.
I am building on macos with m1 pro.
@imbolc i had a similar problem on macbook m1 pro + docker desktop 4.28.0. i disabled Rosetta and it worked... any plans to get it work without disabling Rosetta?
Same issue here m1 max + docker 4.31.0
Same issue here, m3 max + docker desktop 4.30.0
I encountered this while rebuilding jaraco/multipy-tox on Linux AMD64.
More simply replicated using Docker Desktop on macOS 14.5 on Apple Silicon:
The command fails with this in the output:
If I run the same command without
--platform linux/amd64
(or pass--platform linux/arm64
), the command completes, so it seems something about building on Linux AMD64 under virtualization causes the install to crash.This same routine worked fine about three weeks ago.