Open he32 opened 7 months ago
@he32 Odd. How much memory does your emulated aarch64 have access to?
My qemu-emulated arm64 system has 8GB allocated, and it emulates 4 CPU cores. This build was done with a concurrency of 3. However, that says nothing about what the default thread stack size is on this system. The default process stack limit is 8MB, but this build is run with "unlimited" stack size (and data and virtual size -- rust is a pig), so it is possible we're running into the hard limit for the stack size.
I've looked at https://github.com/rust-lang/rust/pull/122002 and applied it to 1.71.1 and I'm currently re-trying the build with that applied, though I'm not very hopeful it will make a difference.
It seems that on NetBSD/aarch64 the maximum process stack is 64MB, ref. vmparam.h's
#define MAXSSIZ (1L << 26) /* max stack size (64MB) */
Though an experimentation in the shell says something slightly different:
arm64: {14} tcsh
arm64: {1} limit | grep stack
stacksize 8192 kbytes
arm64: {2} unlimit stacksize
arm64: {3} limit | grep stack
stacksize 57344 kbytes
arm64: {4} uname -p
aarch64
arm64: {5}
Turns out the difference is due to address space layout randomization slop:
kern_pax.c:340: maxsmap = MAXSSIZ - (MAXSSIZ / PAX_ASLR_MAX_STACK_WASTE);
If I read the code correctly (not a guarantee), the default thread stack size is inherited from the process resource limit.
WG-prioritization assigning priority
@rustbot label -I-prioritize +P-low +regression-from-stable-to-stable
Hmm. This is somewhat concerning and should not be happening on an aarch64 system, but I don't know if it's a problem on a native aarch64 system.
Happens on native 64-core aarch64 system as well, w/ rust 1.77.2.
In the mean time I have tried to use the cross-built (from amd64 targeting aarch64) rust compiler to build the dua-cli
application, and that also fails, but differently, with
Compiling libc v0.2.153
thread 'rustc' panicked at compiler/rustc_middle/src/ty/adt.rs:163:20:
already borrowed: BorrowMutError
stack backtrace:
0: 0xfae527361630 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h1ec55b57570aa3f9
1: 0xfae5273b145c - core::fmt::write::heca3334cc50eedc1
2: 0xfae52736c514 - std::io::Write::write_fmt::hf71be477d18d5a93
3: 0xfae52736147c - std::sys_common::backtrace::print::h8347cda59418217a
4: 0xfae52736df9c - std::panicking::default_hook::{{closure}}::h2ec5654eb659425e
5: 0xfae52736dcd4 - std::panicking::default_hook::h132218cc9a53bfdb
6: 0xfae527f66e50 - <alloc[7fe6970a91364213]::boxed::Box<rustc_driver_impl[a73dbef15413f2fc]::install_ice_hook::{closure#0}> as core[cf788cbd67d9cb46]::ops::function::Fn<(&dyn for<'a, 'b> core[cf788cbd67d9cb46]::ops::function::Fn<(&'a core[cf788cbd67d9cb46]::panic::panic_info::PanicInfo<'b>,), Output = ()> + core[cf788cbd67d9cb46]::marker::Send + core[cf788cbd67d9cb46]::marker::Sync, &core[cf788cbd67d9cb46]::panic::panic_info::PanicInfo)>>::call
7: 0xfae52736e6b0 - std::panicking::rust_panic_with_hook::hd1176e54b2c4a25f
8: 0xfae527361a14 - std::panicking::begin_panic_handler::{{closure}}::hf3632232aca1d87c
9: 0xfae527361880 - std::sys_common::backtrace::__rust_end_short_backtrace::haf65179ce4d51b9d
10: 0xfae52736e30c - rust_begin_unwind
11: 0xfae527331e28 - core::panicking::panic_fmt::h4d0ba04ac0fd863e
12: 0xfae527332494 - core::cell::panic_already_borrowed::h4b45bfca87ef6de7
13: 0xfae52cd4343c - <rustc_middle[a17088d6ed3e85e4]::ty::adt::AdtDefData as rustc_data_structures[a070f7366e5d4d7f]::stable_hasher::HashStable<rustc_query_system[8398c61eb1ea4197]::ich::hcx::StableHashingContext>>::hash_stable
...
and the stack backtrace has 48 entries total. So ... this may not stem from the same underlying issue, or it might. It should, though, provide a test that can be replicated on other aarch64 systems relatively easily. So what is the state of testing of rust on other aarch64 targets?
rustc passes the test suite on every commit for all the tests that are not specifically ignored for aarch64-unknown-linux-gnu, and while there are certainly a few of those they are not terribly numerous.
Can the stack overflow error be convinced to report exactly what parameters (stack pointer, stack bounds, guard page addresses, whatever) led it to conclude the stack overflowed? And is there some way to determine whether this rustc thread that crashed is the process's main thread or a non-main thread created with pthread_create?
I wouldn't be surprised if there were still something wrong with the stack guard detection logic after https://github.com/rust-lang/rust/pull/122002. The resolution the PR settled on sounded fine but I didn't do anything to test it myself. It might be worthwhile to verify with a Rust program that the full range of stack space, from base to soft rlimit, can be written to and read from without crashing.
Unfortunately, gdb decided not to play ball with this one:
@he32 I wonder if trying a newer gdb, either with build.sh -V MKCROSSGDB=yes tools, or from pkgsrc devel/gdb, might help to examine the core dump?
Happens on native 64-core aarch64 system as well, w/ rust 1.77.2.
thank you for the confirmed repro, by the way! that's weird.
Can the stack overflow error be convinced to report exactly what parameters (stack pointer, stack bounds, guard page addresses, whatever) led it to conclude the stack overflowed? And is there some way to determine whether this rustc thread that crashed is the process's main thread or a non-main thread created with pthread_create?
Hmm. I feel like I would hesitate before adding quite so much code to libstd, though maybe I'm just not imagining how slim it could be made. While I've expressed my thoughts about good diagnostics requiring some effort, there is still a limit to what should be done for everyone implicitly.
However, rustc is its own program, and thus has no concerns like "accommodate smaller programs that don't want a lot of chaff in their binary". It is already not a small program. It is in fact several hundred megabytes of program. A few more bytes won't hurt much as long as they're actually useful from time to time (and not in the middle of a hot path, so don't affect icache too much). So on some platforms it has its own signal handler enabled, which tries to be much more informative.
However, that handler uses backtrace_symbols_fd
, so it's only available on platforms with that function in the libc. But it would make sense, to me, for NetBSD to patch in the handler if you also link in libexecinfo. You can find the controlling cfg in the Rust source that you would have to patch here:
As for the "which thread is it" question, it is a spawned thread via our threadpool builder: https://github.com/rust-lang/rust/blob/d371d17496f2ce3a56da76aa083f4ef157572c20/compiler/rustc_interface/src/util.rs#L84-L117
From testing on a VM on a native aarch64 system with NetBSD 10.0, it seems the stack exhaustion issue started with #120188. After setting has_thread_local: false
for the NetBSD target, I was able to build rust 1.77.1 without any issue.
Since NetBSD does support TLS (and I believe x86_64-unknown-netbsd is fine?), I'd assume reverting the change for NetBSD wouldn't constitute a solution. Hopefully it helps narrow down where the issue is occurring though.
From testing on a VM on a native aarch64 system with NetBSD 10.0, it seems the stack exhaustion issue started with #120188. After setting
has_thread_local: false
for the NetBSD target, I was able to build rust 1.77.1 without any issue.
This is an important clue. NetBSD on aarch64 has a known bug in it's TLS implementation. There is a bug report with a patch and with the patch applied the problem is no longer reproducable for me. @riastradh can we please just commit the patch without a test case since more and more things are breaking without it?
Can has_thread_local
in rust be made conditional on OS version after we know which stable NetBSD version will have the fix?
From testing on a VM on a native aarch64 system with NetBSD 10.0, it seems the stack exhaustion issue started with #120188. After setting
has_thread_local: false
for the NetBSD target, I was able to build rust 1.77.1 without any issue.This is an important clue. NetBSD on aarch64 has a known bug in it's TLS implementation. There is a bug report with a patch and with the patch applied the problem is no longer reproducable for me. @riastradh can we please just commit the patch without a test case since more and more things are breaking without it?
I added a test case, verified it crashes in the releng testbed, and committed the fix, so if anyone wants to try with an ld.elf_so built with src/libexec/ld.elf_so/arch/aarch64/rtld_start.S rev. 1.6, that would be helpful to determine whether this bug was the culprit. (@he32?)
Can
has_thread_local
in rust be made conditional on OS version after we know which stable NetBSD version will have the fix?
(This will almost certainly be 11.0 and 10.1, and possibly 9.5 if there is one.)
@snowkat Thank you for diagnosing this!
@tnn2
Can
has_thread_local
in rust be made conditional on OS version after we know which stable NetBSD version will have the fix?
No. Even if you can find some rude hack to allow it, it will be fairly deeply flawed. The Rust compiler lacks a way for people to conveniently compile for a specific OS version, so we define targets in terms of the minimum version we support. This has even led to the somewhat odd case of having a "windows" and a "win7" target.
There are two realistic options here: remove has_thread_local
from our aarch64-unknown-netbsd target, or immediately upgrade the NetBSD minimum to the relevant version for that architecture. We do allow targets to specify varying OS versions, even minor versions, based on the full target tuple. Saying "NetBSD on aarch64 requires NetBSD 10.1, which came out... next month?!" is fine by me. Time travelers will have our full support.
As the on-file maintainer, I am inclined to defer to what @he32 prefers here.
@riastradh Thank you for committing the fix.
(We don't actually know yet whether what I committed fixes the issue. It could be a red herring. All I know is that it fixed the issue that we saw in Firefox. To be confirmed by a new build.)
I suggest disabling has_thread_local on the aarch64-unknown-netbsd target (or aarch64--netbsd, which is what the GNU platform triple normally is, not sure whether that discrepancy will make a difference), if the setting can't reasonably be conditional on the OS version. A small performance penalty (not even a regression since Rust didn't use TLS before on aarch64--netbsd) for some niche use cases is better than breaking the build for everyone on all released versions of NetBSD.
I agree it's best to disable has_thread_local
, with a note that it should be re-enabled with a suitable minimum OS version bump at a future date. If we can avoid disabling it for x86 that would be preferable but we can also conditionally enable it per platform in the package manager for NetBSD's vendor builds of rust for those users who would like to have the feature.
I agree it's best to disable
has_thread_local
, with a note that it should be re-enabled with a suitable minimum OS version bump at a future date. If we can avoid disabling it for x86 that would be preferable but we can also conditionally enable it per platform in the package manager for NetBSD's vendor builds of rust for those users who would like to have the feature.
All of our targets are defined by the full tuple. There is no such thing as a target, according to the Rust compiler, that does not have its target definition depend on both the operating system and the architecture. In particular, the spec for this target is defined here:
That line:
..base::netbsd::opts()
fills in the remaining fields with the NetBSD defaults, and will only fill in fields that were not explicitly passed as part of the constructor.
I agree it's best to disable
has_thread_local
, with a note that it should be re-enabled with a suitable minimum OS version bump at a future date. If we can avoid disabling it for x86 that would be preferable but we can also conditionally enable it per platform in the package manager for NetBSD's vendor builds of rust for those users who would like to have the feature.All of our targets are defined by the full tuple. There is no such thing as a target, according to the Rust compiler, that does not have its target definition depend on both the operating system and the architecture. In particular, the spec for this target is defined here:
That line:
..base::netbsd::opts()
fills in the remaining fields with the NetBSD defaults, and will only fill in fields that were not explicitly passed as part of the constructor.
I don't think we are prepared to require a not-yet-released NetBSD version 11.0, 10.1 or 9.5 or a pre-release of any of those for working rust on aarch64*, while we continue at least in name to support 9.0 and onwards in general for pkgsrc.
So in order to attempt to get a working new rust on aarch64* for NetBSD, I have committed
https://github.com/NetBSD/pkgsrc-wip/commit/a90cd31d1e829f098a3010eda6e5eed0bcc94a3e
I know, two functionally disparate changes in one commit is frowned on, but at least this is what I'm running with at the moment; verification of a native build will need to wait till after the weekend if noone else beats me to it.
Hm, I probably would need to re-build 1.78.0 for aarch64 with that change applied as well, and re-upload the corresponding aarch64 bits. Or version those...
@he32 Yep, that looks correct!
Please upstream the patch once you have verified it works! Or by August 22 even if you don't.
OK, I have a first indication of success: the cross-compiled aarch64
rust compiler cross-built from an amd64 system with the above change applied, managed to build the dua-cli
utility, and the utility works as expected as far as I tested it.
Next on the plan is to build 1.78.0 with this fix applied, and try to build rust 1.79.0 natively on aarch64
, but I think the above is sufficient proof that this is the right fix, at least for now.
I tried building rust 1.77.1 using the internal LLVM on an emulated aarch64 system, as part of an effort of putting a new rust version through its testing cycle to keep it working on our various NetBSD platforms.
I expected to see this happen: The build should complete
Instead, this happened: The build of 1.77.1 fails with stack exhaustion. As a contrast, 1.76.0 succeeds on this same host.
Meta
rustc --version --verbose
:As this is in the middle of the build, it's a little unclear which version of rustc is running on this point, though indications are that it's 1.77.1 in one of the bootstrap stages. Also, trying to get the bootstrap compiler to run from the CLI is also proving challenging:
I hestitate doing
x.py build -vvv
, both because it iself probably requies own settings / environment variables, and for fear that will turn into a multi-hour endevour.The build log ends with:
Unfortunately, gdb decided not to play ball with this one: