rust 1.77.1 fails to build on aarch64-unknown-netbsd with stack exhaustion

he32 commented 7 months ago

I tried building rust 1.77.1 using the internal LLVM on an emulated aarch64 system, as part of an effort of putting a new rust version through its testing cycle to keep it working on our various NetBSD platforms.

I expected to see this happen: The build should complete

Instead, this happened: The build of 1.77.1 fails with stack exhaustion. As a contrast, 1.76.0 succeeds on this same host.

Meta

rustc --version --verbose:

<version>

As this is in the middle of the build, it's a little unclear which version of rustc is running on this point, though indications are that it's 1.77.1 in one of the bootstrap stages. Also, trying to get the bootstrap compiler to run from the CLI is also proving challenging:

arm64: {7} /usr/pkgsrc/wip/rust177/work/rustc-1.77.1-src/build/bootstrap/debug/rustc --version
rustc shim: FATAL: RUSTC_STAGE was not set
rustc shim: NOTE: use `x.py build -vvv` to see all environment variables set by bootstrap
arm64: {8} env LD_LIBRARY_PATH=work/rustc-1.77.1-src/build/aarch64-unknown-netbsd/stage1/lib/ tcsh
arm64: {1} /usr/pkgsrc/wip/rust177/work/rustc-1.77.1-src/build/bootstrap/debug/rustc --version
rustc shim: FATAL: RUSTC_STAGE was not set
rustc shim: NOTE: use `x.py build -vvv` to see all environment variables set by bootstrap
arm64: {2} env RUSTC_STAGE=1 /usr/pkgsrc/wip/rust177/work/rustc-1.77.1-src/build/bootstrap/debug/rustc --version
thread 'main' panicked at src/bin/rustc.rs:57:48:
RUSTC_SYSROOT was not set
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
arm64: {3} exit

I hestitate doing x.py build -vvv, both because it iself probably requies own settings / environment variables, and for fear that will turn into a multi-hour endevour.

The build log ends with:

   Compiling gsgdt v0.1.2
     Running `/usr/pkgsrc/wip/rust177/work/rustc-1.77.1-src/build/bootstrap/debug/rustc /usr/pkgsrc/wip/rust177/work/rustc-1.77.1-src/build/bootstrap/debug/rustc --crate-name gsgdt --edition=2018 /usr/pkgsrc/wip/rust177/work/rustc-1.77.1-src/vendor/gsgdt/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --crate-type lib --emit=dep-info,metadata,link -C opt-level=3 -C embed-bitcode=no -Zunstable-options --check-cfg 'cfg()' -C metadata=4c45b577cc1da5ed -C extra-filename=-4c45b577cc1da5ed --out-dir /usr/pkgsrc/wip/rust177/work/rustc-1.77.1-src/build/aarch64-unknown-netbsd/stage1-rustc/aarch64-unknown-netbsd/release/deps --target aarch64-unknown-netbsd -L dependency=/usr/pkgsrc/wip/rust177/work/rustc-1.77.1-src/build/aarch64-unknown-netbsd/stage1-rustc/aarch64-unknown-netbsd/release/deps -L dependency=/usr/pkgsrc/wip/rust177/work/rustc-1.77.1-src/build/aarch64-unknown-netbsd/stage1-rustc/release/deps --extern serde=/usr/pkgsrc/wip/rust177/work/rustc-1.77.1-src/build/aarch64-unknown-netbsd/stage1-rustc/aarch64-unknown-netbsd/release/deps/libserde-a7ca7ee3064e8334.rmeta --cap-lints allow --cfg=windows_raw_dylib -Csymbol-mangling-version=v0 -Zunstable-options '--check-cfg=cfg(bootstrap)' '--check-cfg=cfg(parallel_compiler)' '--check-cfg=cfg(no_btreemap_remove_entry)' '--check-cfg=cfg(crossbeam_loom)' '--check-cfg=cfg(span_locations)' '--check-cfg=cfg(rustix_use_libc)' '--check-cfg=cfg(emulate_second_only_system)' '--check-cfg=cfg(windows_raw_dylib)' -Zmacro-backtrace -Clink-args=-Wl,-z,origin -Clink-args=-Wl,-rpath,/usr/pkg/lib -Zunstable-options -Csplit-debuginfo=off -Zunstable-options '-Wrustc::internal' -Cprefer-dynamic -Z binary-dep-depinfo`

thread 'rustc' has overflowed its stack
fatal runtime error: stack overflow
rustc exited with signal: 6 (SIGABRT) (core dumped)

Unfortunately, gdb decided not to play ball with this one:

arm64: {3} gdb /usr/pkgsrc/wip/rust177/work/rustc-1.77.1-src/build/bootstrap/debug/rustc work/rustc-1.77.1-src/rustc.core
GNU gdb (GDB) 8.3
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "aarch64--netbsd".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
BFD: warning: /usr/pkgsrc/wip/rust177/work/rustc-1.77.1-src/build/bootstrap/debug/rustc: unsupported GNU_PROPERTY_TYPE (5) type: 0xc0000000
Reading symbols from /usr/pkgsrc/wip/rust177/work/rustc-1.77.1-src/build/bootstrap/debug/rustc...
[New process 2]
[New process 1]
Core was generated by `rustc'.
Program terminated with signal SIGABRT, Aborted.
#0  0x0000fb48f375a32c in ?? ()
[Current thread is 1 (process 2)]
warning: Unsupported auto-load script at offset 0 in section .debug_gdb_scripts
of file /usr/pkgsrc/wip/rust177/work/rustc-1.77.1-src/build/bootstrap/debug/rustc.
Use `info auto-load python-scripts [REGEXP]' to list them.
(gdb) i reg
x0             0x0                 0
x1             0x0                 0
x2             0x1                 1
x3             0x0                 0
x4             0x10                16
x5             0xffffffffffffffc0  -64
x6             0xfb48f21fff80      276290718400384
x7             0x7f7f7f7f7f7f7f7f  9187201950435737471
x8             0x0                 0
x9             0x1003              4099
x10            0xc                 12
x11            0x1                 1
x12            0x3ed23c8801c       4317042475036
x13            0x3ed23c88035       4317042475061
x14            0x3f                63
x15            0xfb48f2200d70      276290718403952
x16            0xfb48f37c5d50      276290741230928
x17            0xfb48f375a328      276290740790056
x18            0xfb48ec23e431      276290617992241
x19            0xfb48f31bab10      276290734893840
x20            0xfb48f31babd0      276290734894032
x21            0xfb48f22011f0      276290718405104
x22            0xfb48f319e010      276290734776336
x23            0xfb48f319e028      276290734776360
--Type <RET> for more, q to quit, c to continue without paging--
x24            0x1                 1
x25            0x0                 0
x26            0xfb48f2200648      276290718402120
x27            0x5                 5
x28            0xfb48f31bac50      276290734894160
x29            0xfb48f22005e0      276290718402016
x30            0xfb48f3759ff4      276290740789236
sp             0xfb48f31bab00      0xfb48f31bab00
pc             0xfb48f375a32c      0xfb48f375a32c
cpsr           0x80000000          [ EL=0 N ]
fpsr           0x0                 0
fpcr           0x10                16
(gdb) x/i 0xfb48f375a32c
=> 0xfb48f375a32c:        Cannot access memory at address 0xfb48f375a32c
(gdb)

workingjubilee commented 7 months ago

@he32 Odd. How much memory does your emulated aarch64 have access to?

he32 commented 6 months ago

My qemu-emulated arm64 system has 8GB allocated, and it emulates 4 CPU cores. This build was done with a concurrency of 3. However, that says nothing about what the default thread stack size is on this system. The default process stack limit is 8MB, but this build is run with "unlimited" stack size (and data and virtual size -- rust is a pig), so it is possible we're running into the hard limit for the stack size.

I've looked at https://github.com/rust-lang/rust/pull/122002 and applied it to 1.71.1 and I'm currently re-trying the build with that applied, though I'm not very hopeful it will make a difference.

It seems that on NetBSD/aarch64 the maximum process stack is 64MB, ref. vmparam.h's

#define MAXSSIZ         (1L << 26)      /* max stack size (64MB) */

Though an experimentation in the shell says something slightly different:

arm64: {14} tcsh
arm64: {1} limit | grep stack
stacksize    8192 kbytes
arm64: {2} unlimit stacksize
arm64: {3} limit | grep stack
stacksize    57344 kbytes
arm64: {4} uname -p
aarch64
arm64: {5}

Turns out the difference is due to address space layout randomization slop:

kern_pax.c:340:  maxsmap = MAXSSIZ - (MAXSSIZ / PAX_ASLR_MAX_STACK_WASTE);

If I read the code correctly (not a guarantee), the default thread stack size is inherited from the process resource limit.

apiraino commented 6 months ago

WG-prioritization assigning priority

@rustbot label -I-prioritize +P-low +regression-from-stable-to-stable

workingjubilee commented 6 months ago

Hmm. This is somewhat concerning and should not be happening on an aarch64 system, but I don't know if it's a problem on a native aarch64 system.

tnn2 commented 6 months ago

Happens on native 64-core aarch64 system as well, w/ rust 1.77.2.

he32 commented 6 months ago

In the mean time I have tried to use the cross-built (from amd64 targeting aarch64) rust compiler to build the dua-cli application, and that also fails, but differently, with

   Compiling libc v0.2.153
thread 'rustc' panicked at compiler/rustc_middle/src/ty/adt.rs:163:20:
already borrowed: BorrowMutError
stack backtrace:
   0:     0xfae527361630 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h1ec55b57570aa3f9
   1:     0xfae5273b145c - core::fmt::write::heca3334cc50eedc1
   2:     0xfae52736c514 - std::io::Write::write_fmt::hf71be477d18d5a93
   3:     0xfae52736147c - std::sys_common::backtrace::print::h8347cda59418217a
   4:     0xfae52736df9c - std::panicking::default_hook::{{closure}}::h2ec5654eb659425e
   5:     0xfae52736dcd4 - std::panicking::default_hook::h132218cc9a53bfdb
   6:     0xfae527f66e50 - <alloc[7fe6970a91364213]::boxed::Box<rustc_driver_impl[a73dbef15413f2fc]::install_ice_hook::{closure#0}> as core[cf788cbd67d9cb46]::ops::function::Fn<(&dyn for<'a, 'b> core[cf788cbd67d9cb46]::ops::function::Fn<(&'a core[cf788cbd67d9cb46]::panic::panic_info::PanicInfo<'b>,), Output = ()> + core[cf788cbd67d9cb46]::marker::Send + core[cf788cbd67d9cb46]::marker::Sync, &core[cf788cbd67d9cb46]::panic::panic_info::PanicInfo)>>::call
   7:     0xfae52736e6b0 - std::panicking::rust_panic_with_hook::hd1176e54b2c4a25f
   8:     0xfae527361a14 - std::panicking::begin_panic_handler::{{closure}}::hf3632232aca1d87c
   9:     0xfae527361880 - std::sys_common::backtrace::__rust_end_short_backtrace::haf65179ce4d51b9d
  10:     0xfae52736e30c - rust_begin_unwind
  11:     0xfae527331e28 - core::panicking::panic_fmt::h4d0ba04ac0fd863e
  12:     0xfae527332494 - core::cell::panic_already_borrowed::h4b45bfca87ef6de7
  13:     0xfae52cd4343c - <rustc_middle[a17088d6ed3e85e4]::ty::adt::AdtDefData as rustc_data_structures[a070f7366e5d4d7f]::stable_hasher::HashStable<rustc_query_system[8398c61eb1ea4197]::ich::hcx::StableHashingContext>>::hash_stable
...

and the stack backtrace has 48 entries total. So ... this may not stem from the same underlying issue, or it might. It should, though, provide a test that can be replicated on other aarch64 systems relatively easily. So what is the state of testing of rust on other aarch64 targets?

workingjubilee commented 6 months ago

rustc passes the test suite on every commit for all the tests that are not specifically ignored for aarch64-unknown-linux-gnu, and while there are certainly a few of those they are not terribly numerous.

riastradh commented 4 months ago

Can the stack overflow error be convinced to report exactly what parameters (stack pointer, stack bounds, guard page addresses, whatever) led it to conclude the stack overflowed? And is there some way to determine whether this rustc thread that crashed is the process's main thread or a non-main thread created with pthread_create?

I wouldn't be surprised if there were still something wrong with the stack guard detection logic after https://github.com/rust-lang/rust/pull/122002. The resolution the PR settled on sounded fine but I didn't do anything to test it myself. It might be worthwhile to verify with a Rust program that the full range of stack space, from base to soft rlimit, can be written to and read from without crashing.

riastradh commented 4 months ago

Unfortunately, gdb decided not to play ball with this one:

@he32 I wonder if trying a newer gdb, either with build.sh -V MKCROSSGDB=yes tools, or from pkgsrc devel/gdb, might help to examine the core dump?

workingjubilee commented 4 months ago

Happens on native 64-core aarch64 system as well, w/ rust 1.77.2.

thank you for the confirmed repro, by the way! that's weird.

workingjubilee commented 4 months ago

Can the stack overflow error be convinced to report exactly what parameters (stack pointer, stack bounds, guard page addresses, whatever) led it to conclude the stack overflowed? And is there some way to determine whether this rustc thread that crashed is the process's main thread or a non-main thread created with pthread_create?

Hmm. I feel like I would hesitate before adding quite so much code to libstd, though maybe I'm just not imagining how slim it could be made. While I've expressed my thoughts about good diagnostics requiring some effort, there is still a limit to what should be done for everyone implicitly.

However, rustc is its own program, and thus has no concerns like "accommodate smaller programs that don't want a lot of chaff in their binary". It is already not a small program. It is in fact several hundred megabytes of program. A few more bytes won't hurt much as long as they're actually useful from time to time (and not in the middle of a hot path, so don't affect icache too much). So on some platforms it has its own signal handler enabled, which tries to be much more informative.

However, that handler uses backtrace_symbols_fd, so it's only available on platforms with that function in the libc. But it would make sense, to me, for NetBSD to patch in the handler if you also link in libexecinfo. You can find the controlling cfg in the Rust source that you would have to patch here:

https://github.com/rust-lang/rust/blob/4bc39f028d14c24b04dd17dc425432c6ec354536/compiler/rustc_driver_impl/src/lib.rs#L90-L98

workingjubilee commented 4 months ago

As for the "which thread is it" question, it is a spawned thread via our threadpool builder: https://github.com/rust-lang/rust/blob/d371d17496f2ce3a56da76aa083f4ef157572c20/compiler/rustc_interface/src/util.rs#L84-L117

snowkat commented 3 months ago

From testing on a VM on a native aarch64 system with NetBSD 10.0, it seems the stack exhaustion issue started with #120188. After setting has_thread_local: false for the NetBSD target, I was able to build rust 1.77.1 without any issue.

Since NetBSD does support TLS (and I believe x86_64-unknown-netbsd is fine?), I'd assume reverting the change for NetBSD wouldn't constitute a solution. Hopefully it helps narrow down where the issue is occurring though.

tnn2 commented 3 months ago

From testing on a VM on a native aarch64 system with NetBSD 10.0, it seems the stack exhaustion issue started with #120188. After setting has_thread_local: false for the NetBSD target, I was able to build rust 1.77.1 without any issue.

This is an important clue. NetBSD on aarch64 has a known bug in it's TLS implementation. There is a bug report with a patch and with the patch applied the problem is no longer reproducable for me. @riastradh can we please just commit the patch without a test case since more and more things are breaking without it?

Can has_thread_local in rust be made conditional on OS version after we know which stable NetBSD version will have the fix?

riastradh commented 3 months ago

From testing on a VM on a native aarch64 system with NetBSD 10.0, it seems the stack exhaustion issue started with #120188. After setting has_thread_local: false for the NetBSD target, I was able to build rust 1.77.1 without any issue.

This is an important clue. NetBSD on aarch64 has a known bug in it's TLS implementation. There is a bug report with a patch and with the patch applied the problem is no longer reproducable for me. @riastradh can we please just commit the patch without a test case since more and more things are breaking without it?

I added a test case, verified it crashes in the releng testbed, and committed the fix, so if anyone wants to try with an ld.elf_so built with src/libexec/ld.elf_so/arch/aarch64/rtld_start.S rev. 1.6, that would be helpful to determine whether this bug was the culprit. (@he32?)

Can has_thread_local in rust be made conditional on OS version after we know which stable NetBSD version will have the fix?

(This will almost certainly be 11.0 and 10.1, and possibly 9.5 if there is one.)

workingjubilee commented 3 months ago

@snowkat Thank you for diagnosing this!

@tnn2

Can has_thread_local in rust be made conditional on OS version after we know which stable NetBSD version will have the fix?

No. Even if you can find some rude hack to allow it, it will be fairly deeply flawed. The Rust compiler lacks a way for people to conveniently compile for a specific OS version, so we define targets in terms of the minimum version we support. This has even led to the somewhat odd case of having a "windows" and a "win7" target.

There are two realistic options here: remove has_thread_local from our aarch64-unknown-netbsd target, or immediately upgrade the NetBSD minimum to the relevant version for that architecture. We do allow targets to specify varying OS versions, even minor versions, based on the full target tuple. Saying "NetBSD on aarch64 requires NetBSD 10.1, which came out... next month?!" is fine by me. Time travelers will have our full support.

As the on-file maintainer, I am inclined to defer to what @he32 prefers here.

@riastradh Thank you for committing the fix.

riastradh commented 3 months ago

(We don't actually know yet whether what I committed fixes the issue. It could be a red herring. All I know is that it fixed the issue that we saw in Firefox. To be confirmed by a new build.)

I suggest disabling has_thread_local on the aarch64-unknown-netbsd target (or aarch64--netbsd, which is what the GNU platform triple normally is, not sure whether that discrepancy will make a difference), if the setting can't reasonably be conditional on the OS version. A small performance penalty (not even a regression since Rust didn't use TLS before on aarch64--netbsd) for some niche use cases is better than breaking the build for everyone on all released versions of NetBSD.

tnn2 commented 3 months ago

I agree it's best to disable has_thread_local, with a note that it should be re-enabled with a suitable minimum OS version bump at a future date. If we can avoid disabling it for x86 that would be preferable but we can also conditionally enable it per platform in the package manager for NetBSD's vendor builds of rust for those users who would like to have the feature.

workingjubilee commented 3 months ago

I agree it's best to disable has_thread_local, with a note that it should be re-enabled with a suitable minimum OS version bump at a future date. If we can avoid disabling it for x86 that would be preferable but we can also conditionally enable it per platform in the package manager for NetBSD's vendor builds of rust for those users who would like to have the feature.

All of our targets are defined by the full tuple. There is no such thing as a target, according to the Rust compiler, that does not have its target definition depend on both the operating system and the architecture. In particular, the spec for this target is defined here:

https://github.com/rust-lang/rust/blob/8bfcae730a5db2438bbda72796175bba21427be1/compiler/rustc_target/src/spec/targets/aarch64_unknown_netbsd.rs#L15-L21

That line:

            ..base::netbsd::opts()

fills in the remaining fields with the NetBSD defaults, and will only fill in fields that were not explicitly passed as part of the constructor.

he32 commented 3 months ago

I agree it's best to disable has_thread_local, with a note that it should be re-enabled with a suitable minimum OS version bump at a future date. If we can avoid disabling it for x86 that would be preferable but we can also conditionally enable it per platform in the package manager for NetBSD's vendor builds of rust for those users who would like to have the feature.

All of our targets are defined by the full tuple. There is no such thing as a target, according to the Rust compiler, that does not have its target definition depend on both the operating system and the architecture. In particular, the spec for this target is defined here:

https://github.com/rust-lang/rust/blob/8bfcae730a5db2438bbda72796175bba21427be1/compiler/rustc_target/src/spec/targets/aarch64_unknown_netbsd.rs#L15-L21

That line:
            ..base::netbsd::opts()
fills in the remaining fields with the NetBSD defaults, and will only fill in fields that were not explicitly passed as part of the constructor.

I don't think we are prepared to require a not-yet-released NetBSD version 11.0, 10.1 or 9.5 or a pre-release of any of those for working rust on aarch64*, while we continue at least in name to support 9.0 and onwards in general for pkgsrc.

So in order to attempt to get a working new rust on aarch64* for NetBSD, I have committed

https://github.com/NetBSD/pkgsrc-wip/commit/a90cd31d1e829f098a3010eda6e5eed0bcc94a3e

I know, two functionally disparate changes in one commit is frowned on, but at least this is what I'm running with at the moment; verification of a native build will need to wait till after the weekend if noone else beats me to it.

Hm, I probably would need to re-build 1.78.0 for aarch64 with that change applied as well, and re-upload the corresponding aarch64 bits. Or version those...

workingjubilee commented 3 months ago

@he32 Yep, that looks correct!

Please upstream the patch once you have verified it works! Or by August 22 even if you don't.

he32 commented 3 months ago

OK, I have a first indication of success: the cross-compiled aarch64 rust compiler cross-built from an amd64 system with the above change applied, managed to build the dua-cli utility, and the utility works as expected as far as I tested it.

Next on the plan is to build 1.78.0 with this fix applied, and try to build rust 1.79.0 natively on aarch64, but I think the above is sufficient proof that this is the right fix, at least for now.

rust-lang / rust

rust 1.77.1 fails to build on aarch64-unknown-netbsd with stack exhaustion #123551

Meta