Closed DeltaF1 closed 6 months ago
Alignment issues on thread local variables... where have I heard this before?
Still, this looks like a pretty complex situation, most likely relating back to the TLS code in libctru
.
@ian-h-chamberlain since you made those changes, can you run similar tests again? It looks like nothing changed in the mean time regarding TLS in libctru
, so it's very possible we've always used misaligned pointers (and I wonder if that has anything to do with corrupted TLS destructors)...
oh no, hopefully this isn't another case of similar shenanigans, that one took ages to track down...
I am able to reproduce using the hashmaps
example, once I updated my toolchain to something recent enough. I'll see if I can figure out why the pointer is misaligned and go from there.
Edit: ok, so far this is the smallest reproduction I've found. I'm trying to inline all the calls there and see if I can narrow it down further but haven't been able to get it so far:
fn main() {
ctru::use_panic_handler();
let _s = std::collections::hash_map::RandomState::new();
}
It does seem important that the ctru::use_panic_handler
is called, so I wonder if there is some additional thread-local arrangement happening in std::panic::set_hook
or something...
I just tested without use_panic_handler
and using the debugger I could see it call the same misaligned pointer panic, so it may not be related.
Yeah, most likely the panic happens regardless but the application exits normally. That’s one of the nice features of our toolchain.
I just tested without
use_panic_handler
and using the debugger I could see it call the same misaligned pointer panic, so it may not be related.
Hmm, that's interesting - in my case I was also checking with the debugger (with a breakpoint on rust_panic
) but the app exited instead of hitting my breakpoint... I wonder what the difference there is? I was mostly using Citra instead of real hardware but I wouldn't think that should make much difference, especially since I did reproduce it when using the custom panic hook.
The next thing I haven't looked at yet is actually examining the thread local section of the executable, which might provide some clues why the TLS is misaligned. I've been busy for the last few weeks but should have some time to dig in this week.
I recommend putting a breakpoint in core/src/panicking.rs
Line ~169 panic_misaligned_pointer_dereference
, that's where it's ending up for me (only tested on hardware just now but I was using Citra to debug when I first opened this issue, so I think they're both the same).
Ok, I figured out that in the case where I wasn't reproducing the issue, it was because my "custom" thread-local was aligned correctly, but the one in std
is misaligned:
Edit: I might be mistaken about this and it's just a coincidence, need to re-learn some stuff about TLS and ELF structure I think :sweat_smile:
broken case (using arm-none-eabi-objdump --demangle --syms -s --section=.tbss ../target/armv6k-nintendo-3ds/debug/examples/thread-local-minimal.elf
):
../target/armv6k-nintendo-3ds/debug/examples/thread-local-minimal.elf: file format elf32-littlearm
SYMBOL TABLE:
0023da30 l d .tbss 00000000 .tbss
00000c20 l .tbss 00000001 std::panicking::panic_count::LOCAL_PANIC_COUNT::__getit::STATE
00000c1c l .tbss 00000004 std::panicking::panic_count::LOCAL_PANIC_COUNT::__getit::VAL
00000c24 l .tbss 0000000c std::io::stdio::OUTPUT_CAPTURE::__getit::__KEY
00000c31 l .tbss 00000001 std::sync::remutex::current_thread_unique_ptr::X::__getit::STATE
00000c30 l .tbss 00000001 std::sync::remutex::current_thread_unique_ptr::X::__getit::VAL
00000c32 l .tbss 00000001 std::sys_common::thread_info::THREAD_INFO::__getit::STATE
0023da30 g .tbss 00000000 __preinit_array_start
00000014 g .tbss 00000802 __ctru_dev_utf16_buf
00000818 g .tbss 00000401 __ctru_dev_path_buf
00000c34 g .tbss 00000020 std::collections::hash::map::RandomState::new::KEYS::__getit::__KEY
0023da30 g .tbss 00000000 __preinit_array_end
Note that the location of RandomState::new::KEYS::__getit::__KEY
is not aligned to 8 (which I think is the expected alignment for this thread_local<Cell<(u64, u64)>>
).
When I try to inline all the code of RandomState
, things are ordered differently and the alignment works out:
../target/armv6k-nintendo-3ds/debug/examples/thread-local-minimal.elf: file format elf32-littlearm
SYMBOL TABLE:
0023da2c l d .tbss 00000000 .tbss
00000018 l .tbss 00000020 thread_local_minimal::main::KEYS::__getit::__KEY
00000c44 l .tbss 00000001 std::panicking::panic_count::LOCAL_PANIC_COUNT::__getit::STATE
00000c40 l .tbss 00000004 std::panicking::panic_count::LOCAL_PANIC_COUNT::__getit::VAL
00000c48 l .tbss 0000000c std::io::stdio::OUTPUT_CAPTURE::__getit::__KEY
00000c55 l .tbss 00000001 std::sync::remutex::current_thread_unique_ptr::X::__getit::STATE
00000c54 l .tbss 00000001 std::sync::remutex::current_thread_unique_ptr::X::__getit::VAL
00000c56 l .tbss 00000001 std::sys_common::thread_info::THREAD_INFO::__getit::STATE
0023da2c g .tbss 00000000 __preinit_array_start
00000038 g .tbss 00000802 __ctru_dev_utf16_buf
0000083c g .tbss 00000401 __ctru_dev_path_buf
Still trying to figure out why it's begin generated with the wrong offset like that and how to fix it, but at first glance I don't think it's totally related to the original TLS alignment issues we had, but I can't be sure yet.
Good news! I finally made some progress here and I think I've identified a solution, which will have to get delivered upstream to https://github.com/devkitPro/devkitarm-crtls
I'll try to have a write-up and post a PR tomorrow, but I will definitely need to get some input from the devkitPro folks since the change I have planned could affect other users too. Will post updates here as it develops.
Closed as result of the upstream changes in https://github.com/devkitPro/devkitarm-crtls/pull/7.
@ian-h-chamberlain you're incredible! The whole homebrew community has benefited from your investigations 🎉
sorry for reviving a dead issue but i don't know if this is worth opening a new one or if it's a weird edge-case of this issue
i have no idea how to describe this because i have no clue about how thread_local works but here's what i got
i can't figure out how to get backtrace
to properly work here (symbols weren't showing up for some reason) and i really don't want to manually look up symbols in nm
output, but if it's necessary i can dig, all i know is that directly before main
is std::rt::lang_start
, but if the backtrace symbols would help here i'll get em
project this is happening is https://github.com/patataofcourse/Barista (project is a big mess though and i assume stinks for reproducibility), and this error apparently only happens when compiling on certain machines, but that might have changed since last time i tried...? also, i know that nightly 2022-02-14 used to work but since updating ctru-rs, nightlies that old don't work
again if there's anything i can do lmk, and my discord is under the same username as here (patataofcourse) if anyone who can help prefers a more direct method of communication thanks in advance! ^^'
project this is happening is https://github.com/patataofcourse/Barista (project is a big mess though and i assume stinks for reproducibility), and this error apparently only happens when compiling on certain machines, but that might have changed since last time i tried...? also, i know that nightly 2022-02-14 used to work but since updating ctru-rs, nightlies that old don't work
This issue was resolved as a change in the upstream packages released by @devkitPro. If it is the same problem that was reported here (as it looks for how you describe it) then the fix should come from updating the dkA packages.
dkp-pacman -Syu
If this alone doesn’t cut it, try updating and refreshing the whole env:
rustup update
cargo clean
cargo update
refreshing the whole env didn't work either, and it's not the first time i've tried to refresh it either
the devkitarm-crtls
package is also up-to-date:
$ pacman -Q devkitarm-crtls
devkitarm-crtls 1.2.2-1
I've tried running your project on my New 3DS XL using the latest nightly, and I can't seem to find the error you are talking about. Please give me more information for me to help you, since reproducibility seems quite hard.
Also, I've found an ARM panic here: https://github.com/patataofcourse/Barista/blob/2b4c4f598aba6bd21f5bfe580298539cc5f151f9/src/format/saltwater_cfg.rs#L21
I don't know whether or not you have been able to write files to the SD card using ctru::fs
, but I HEAVILY suggest you to not use that module, as it is broken, unstable and incredibly outdated (almost all code dates back to before the project rewrite/reconstruction I started years ago). You should use std::fs
when working with the SD card and the ROMFS, that module is assured to work and offers full compatibility.
Regardless, with a bit of code cutting I managed to get to the UI stage just fine. Is there any way to reproduce the error I could use?
ah crap, i'm testing on Citra so the ARM panic is probably ignored by it. i'll change to std::fs but it'd be great if you guys could deprecate ctru::fs so that it being broken wouldn't have to be transmitted via word-of-mouth i'll fix the error you mentioned, and, well, i honestly have no clue what could be causing the misalignment panic could you try using my makefile to see if the issue's caused by that, as well as testing the project on citra?
thank you for all the help btw it means a lot
ah crap, i'm testing on Citra so the ARM panic is probably ignored by it. i'll change to std::fs but it'd be great if you guys could deprecate ctru::fs so that it being broken wouldn't have to be transmitted via word-of-mouth
Yeah, I’m sorry about that. Sadly other things took priority over it (especially since the std::fs
has always been working since the std
implementation), so I wasn’t able to change it. I should’ve at least put a warning, but I guess I just didn’t really expect people to use it 😅.
i'll fix the error you mentioned, and, well, i honestly have no clue what could be causing the misalignment panic could you try using my makefile to see if the issue's caused by that, as well as testing the project on citra?
The tests I ran were built using your own makefile (especially since it was needed to compile and use those other C libraries that I didn’t look into). As for Citra, well, we are working to use it as a quick way to test our code, but the general rule of thumb in the homebrew community is that it’s not reliable enough against to testing on real hardware. Regardless, I’ll try it if I have the chance, but I can already tell that it would be quite weird if that was actually the problem (in this case). Don’t you have any hardware to test with?
The tests I ran were built using your own makefile (especially since it was needed to compile and use those other C libraries that I didn’t look into).
oops i forgot that was a thing, makes sense you'd need to use the makefile then
As for Citra, well, we are working to use it as a quick way to test our code, but the general rule of thumb in the homebrew community is that it’s not reliable enough against to testing on real hardware. Regardless, I’ll try it if I have the chance, but I can already tell that it would be quite weird if that was actually the problem (in this case). Don’t you have any hardware to test with?
yeah, that is very fair, however i can't always test on hardware, especially when i'm not home. usual rule of thumb in the modding community i'm in is: do quick tests on Citra, and a final test on hardware to see if anything happens there, especially since usually you have to make a bunch of consecutive quick tests in case something is slightly off. usually there's not errors that are only on Citra, usually it's the opposite, but of course that's on modded commercial games, not homebrew.
either way, i'll remove every instance of ctru::fs and use std::fs instead, and I'll let you know if that fixes anything
I've tested with Citra, no errors in sight. 🤷♂️
removed any ctru::fs references, still the same error
I've tested with Citra, no errors in sight. 🤷♂️
okay that's very strange. mind if i send a pre-built .3dsx and .elf in case it's something weird in my rust setup? i'm on nightly 2023-09-27 btw, that should be latest
okay that's very strange. mind if i send a pre-built .3dsx and .elf in case it's something weird in my rust setup?
As I said earlier, the most likely cause is not an outdated Rust toolchain, as it may be an outdated dkARM (or an unproperly set one?). I know you've checked the crtls
versions earlier, but there isn't much room for possibilities at this point.
Send me the file however you want, you can also contact me on discord (username "meziu"), though I have to stop working on this for today.
It's true, even the hashmaps
example doesn't work properly anymore in debug
mode. (thanks for noticing @patataofcourse). It seems like this issue just won't stop coming back to haunt us, so we've got to play cat and mouse again to see what's the problem now.
My panic info:
misaligned pointer dereference: address must be a multiple of 0x8 but is 0x2afb2c
@ian-h-chamberlain I can't ask you to look into this again, since you've done more than enough and I'm sure you have other things to do, but could you please share your minimal-reproducibility tests (if you still have them) so I can try making sense of this myself?
what was the issue that was causing my and your builds to be different?
Ok, let's see... The first thing to test is probably the C program in https://github.com/rust3ds/ctru-rs/issues/60#issuecomment-1150529440 — that could point out if there was a regression in devkitARM. We might also be able to check some of the test cases listed here which were mentioned in https://github.com/KallistiOS/KallistiOS/pull/111
Those might at least eliminate the Rust toolchain as an option if they also reproduce an issue, and possibly point to a DKP regression. If not, then it sounds like the hashmap example is another good place to start. This tiny program was enough to reproduce before: https://github.com/rust3ds/ctru-rs/issues/124#issuecomment-1557427224
A lot of what I did to debug previously was dumping the addresses of various symbols to see if the alignment looked correct or not, e.g. something like this:
$ arm-none-eabi-readelf -W --demangle --syms target/armv6k-nintendo-3ds/debug/examples/thread-local-minimal.elf | awk 'NR == 3 || /BUF|tdata|tbss|tls|TLS/'
Symbol table '.symtab' contains 34409 entries:
Num: Value Size Type Bind Vis Ndx Name
0: 00000000 0 NOTYPE LOCAL DEFAULT UND
10: 0029ebf4 0 SECTION LOCAL DEFAULT 15 .tdata
11: 0029ebf8 0 SECTION LOCAL DEFAULT 16 .tbss
1346: 00000c08 12 TLS LOCAL DEFAULT 16 pthread_3ds::thread_keys::LOCALS
3454: 00000c14 1 TLS LOCAL DEFAULT 16 std::sync::remutex::current_thread_unique_ptr::X::__getit::VAL
3455: 00000c15 1 TLS LOCAL DEFAULT 16 std::sync::remutex::current_thread_unique_ptr::X::__getit::STATE
6245: 00000c18 8 TLS LOCAL DEFAULT 16 std::panicking::panic_count::LOCAL_PANIC_COUNT::__getit::VAL
6246: 00000c20 1 TLS LOCAL DEFAULT 16 std::panicking::panic_count::LOCAL_PANIC_COUNT::__getit::STATE
9991: 00000c24 12 TLS LOCAL DEFAULT 16 std::io::stdio::OUTPUT_CAPTURE::__getit::__KEY
10259: 00000c30 16 TLS LOCAL DEFAULT 16 std::sys_common::thread_info::THREAD_INFO::__getit::VAL
10260: 00000c40 1 TLS LOCAL DEFAULT 16 std::sys_common::thread_info::THREAD_INFO::__getit::STATE
14997: 00000000 0 TLS LOCAL DEFAULT 16 $d
14998: 00000804 0 TLS LOCAL DEFAULT 16 $d
15001: 00000000 0 TLS LOCAL DEFAULT 16 _TLS_MODULE_BASE_
26733: 0029ebf4 0 NOTYPE GLOBAL DEFAULT 15 __tdata_lma
27665: 00000000 2050 TLS GLOBAL DEFAULT 16 __ctru_dev_utf16_buf
27813: 002a1b58 0 NOTYPE GLOBAL DEFAULT 21 __tls_end
29308: 00000804 1025 TLS GLOBAL DEFAULT 16 __ctru_dev_path_buf
29816: 00000c48 32 TLS GLOBAL DEFAULT 16 std::collections::hash::map::RandomState::new::KEYS::__getit::__KEY
30542: 00287940 0 NOTYPE GLOBAL DEFAULT 3 __tdata_align
32757: 002a0eec 0 NOTYPE GLOBAL DEFAULT 21 __tls_start
33555: 0029ebf4 0 NOTYPE GLOBAL DEFAULT 15 __tdata_lma_end
Usually std::collections::hash::map::RandomState::new::KEYS::__getit::__KEY
was the problematic one when trying to do stuff with RandomState
. It should always have alignment of 0x8
or higher, I think (as mentioned in the panic message) but sometimes I would see it with an address (Value
column) of 0x....4
which was a clue.
Man, it's crazy how this issue keeps coming back, we can't get a break! :sob: As you alluded to @Meziu I will probably take a back seat to debugging this time around as I'd like to focus on other areas of the project, but I can try to help if others get stuck.
I wonder... we should probably see if we can add some kind of test for this as part of #135 so we know if it happens again. I've been working on another push for that so I'm hoping to have something ready fairly soon! I think one of our tests builds a hashmap so maybe it would already catch this, but I think it's also kind of dependent on compiler optimizations so maybe an extra integration test or something wouldn't hurt.
Hi there,
I've recently run into this issue, and I think I might know why it's happening. I'm writing this at 7am not having slept, so it's possible I'm not quite aligned with reality, but:
The linker script 3dsx.ld
defines the symbol __tls_start
, which as far as I can tell is used as the starting point for thread-local storage. Previously, you fixed an issue with alignof(.tbss)
not being included for size calculations; however, I think this misses an important bug.
Again, I'm not sure I'm entirely lucid writing this, so definitely do check this - and I haven't read all the relevant code, either - but I'm fairly sure that, if .tdata
has size 0 (and perhaps in other situations, but .tdata
being 0 seems to be the trigger in my code), .tbss
ends up being placed exactly at __tls_start
. __tls_start
isn't necessarily aligned to alignof(.tbss)
. I'm sure you see the issue.
Luckily, it's quite a simple fix. Simply change line 162 of 3dsx.ld
to . = 8 + ABSOLUTE(ALIGN(ABSOLUTE(. - 8), MAX(ALIGNOF(.tdata), ALIGNOF(.tbss))));
. From my very basic testing (compiling my project and seeing if it crashes), this seems to work.
Perhaps there's a smarter solution involving aligning .tbss
manually on load, but this seems a bit easier and it's unlikely to waste more than a few bytes.
Hi there, I've recently run into this issue, and I think I might know why it's happening. I'm writing this at 7am not having slept, so it's possible I'm not quite aligned with reality, but: The linker script
3dsx.ld
defines the symbol__tls_start
, which as far as I can tell is used as the starting point for thread-local storage. Previously, you fixed an issue withalignof(.tbss)
not being included for size calculations; however, I think this misses an important bug. Again, I'm not sure I'm entirely lucid writing this, so definitely do check this - and I haven't read all the relevant code, either - but I'm fairly sure that, if.tdata
has size 0 (and perhaps in other situations, but.tdata
being 0 seems to be the trigger in my code),.tbss
ends up being placed exactly at__tls_start
.__tls_start
isn't necessarily aligned toalignof(.tbss)
. I'm sure you see the issue. Luckily, it's quite a simple fix. Simply change line 162 of3dsx.ld
to. = 8 + ABSOLUTE(ALIGN(ABSOLUTE(. - 8), MAX(ALIGNOF(.tdata), ALIGNOF(.tbss))));
. From my very basic testing (compiling my project and seeing if it crashes), this seems to work. Perhaps there's a smarter solution involving aligning.tbss
manually on load, but this seems a bit easier and it's unlikely to waste more than a few bytes.
Thank you for your research on the topic! I believe the ".tbss exactly at .__tls_start" was the original issue for some other quirks of this whole situation, so the conclusions don't seem that farfetched to me :sweat_smile:. However, the solution you proposed would also impact the usage of the whole C toolchain provided by devkitPro, so I'm going to need to analyze the situation deeper (to make sure it's the right call) before making the proposal for this change directly to them.
I would greatly appreciate if you were to delve into the problem more than you already did, but I don't want you to put your health at risk more than it seems you are doing already :laughing:. I will look into it when I have some time :+1:.
I couldn't help myself, and I looked a bit deeper into it.
I'm still trying to wrap my head around exactly what the thread initialisation code does. However, I think I might actually be massively overcomplicating it - I'm pretty sure we're interested in what happens in the main thread, i.e. the one that we're running in when main()
is called.
The libctru
initialisation process is spread over quite a few files. As far as I can tell, 3dsx_ctr0.o
comes from smealum and isn't actually in the libctru
GitHub repo. Just after clearing .bss
, it calls initSystem
, which is defined in libctru/source/system/stack_adjust.s
. That function then calls __libctru_init
, which finally is some C code, defined in libctru/source/system/ctru_init.c
.
That function initialises a few different things, but one of the functions it calls is __system_initSyscalls
. No idea why it's called that, since it doesn't seem to initialise any syscalls. Eh.
__system_initSyscalls
is defined in libctru/source/system/syscalls.c
. The first thing it does is call initThreadVars
with NULL
. When that function is called with NULL
, it sets the current thread (that's the main
thread at this point)'s thread-local storage directly to __tls_start
.
As far as I can tell without digging deep into LLVM internals, all accesses to thread-local storage, both .tdata
and .tbss
, are done by calling __aeabi_read_tp
(libctru/source/system/readtp.s
- it reads the pointer set in __system_initSyscalls
) and reading/writing relative to the returned pointer. I can't see any offsets being added to the pointer to deal with .tbss
alignment, so, again, __tls_start
is being used as the start of .tbss
incorrectly. I can't figure out how this is supposed to work, actually - is the compiler supposed to know where .tbss
is relative to .tdata
/__tls_start
and emit an add somewhere????
I have no idea how this is supposed to be working. Rust is little help, since the call to __aeabi_read_tp
is buried somewhere deep within LLVM.
Clearly, there's supposed to be some alignment happening, but I just haven't the foggiest where it's supposed to go. My workaround feels like it shouldn't work, now - what happens if .tdata
isn't size 0 and .tbss
needs to be aligned after it? - so please, please look into this if you can. I'm out of my depth at this point, I think.
I've officially decided to waste my sanity trying to solve this issue, since ctru-rs
is currently in a pretty nice state when it comes to supported features.
\~Deep breath\~... Ok, let's do this one more time.
I can't dedicate this problem much time right now, but I've started with some basic testing that was very much needed.
-Og
, not with -O0
, not with -O3
. Nothing reproduced it, so this likely isn't due to a dkP regression. Maybe Rust now makes even stricter checks when reading memory?hashmaps
example and my realtime-audio
code in debug
mode. Both yield a new non-unwinding
panic message: unsafe precondition(s) violated: ptr::read requires that the pointer argument is aligned and non-null
. I believe this error message is 1:1 to the one it yielded before, but it might've been recently added to the compiler. (the fact that it doesn't specify the actually correct alignment is... unsettling. We know it should be 0x8
though)hashmaps
example with a whole lot of different nightly
Rust toolchains too see whether the issue was present before/after it was reported and to see whether the current ctru-rs
and devkitARM
combination caused issues in "previously working" nightlies, but nothing I didn't already know popped up. The bug didn't reproduce for all of August, until in September some nightly started re-showing the message. Not too long ago the message was changed into what it is today. Still, nothing in std
or in the general inner workings for TLS variables seems to have changed since, so it's hard to pinpoint Rust as the culprit.So, from this first round of testing (and the research done by @Jhynjhiruu) I come to the conclusion that neither Rust nor devkitPro did anything that caused this behaviour. There is something fundamentally wrong (most likely in the whole setup handled by dkA
) that just needs to be tweaked. I'm still interested in understanding what was that made the issue re-appear, though.
Can you give an example of a nightly version that doesn't reproduce the crash? I have a suspicion about why that I'd like to confirm.
Can you give an example of a nightly version that doesn't reproduce the crash? I have a suspicion about why that I'd like to confirm.
I’m not home, but I definitely used “2023-09-25” that worked. I can’t really tell when they stop working, since the process was taking a lot of time and I couldn’t really “binary-search” my way to the exact date.
Agh. I hate threads. Look at this bit of assembly, generated by GCC from (a slightly modified version of) the C code from that other issue:
bl __aeabi_read_tp @ load_tp_soft
mov r3, r0
ldr r2, .L13
mov r1, #0
strb r1, [r3, r2]
...
.L13:
.word BUF_16(tpoff)
This corresponds to this line of C code:
BUF_16.inner[0] = 0;
Fairly simply, this loads the address of the thread-local storage using __aeabi_read_tp
, then reads the word at .L13
, adds it to the thread-local storage pointer, and stores a 0 there. Makes sense.
The word at .L13
is the offset of BUF_16
in a thread-local block. It's encoded in an ELF file as some offset and then an R_ARM_TLS_LE32
relocation. R_ARM_TLS_LE32
relocations are calculated as (S + A) - tp
, where S
is the symbol address, A
is the addend (0 in our case) and tp
is the thread-local storage pointer.
I've read through as many relevant specifications as I can, and as far as I can tell, the way that the symbol address relative to the thread-local storage pointer - i.e. the offset of the symbol in the thread-local storage block - is calculated isn't specified anywhere. I definitely read something saying that the .tdata
segment comes first, followed by the .tbss
section with appropriate alignment, but I can't find that any more.
Either way, the bug comes down to the way the symbol address in the thread-local storage block is calculated, i.e. the bit of code that deals with the tpoff
annotation in the assembly. I wasn't able to find the relevant code in either GNU binutils or LLVM source, but I suspect that's the place you'll want to look. I imagine that you'll want to look for where ARM relocations are handled.
I suspect that the implementation is making the assumption that the TLS block is aligned to max(.tbss, .tdata)
. I don't know whether that's a valid assumption to make - I can't find it in the spec anywhere.
If that assumption actually is backed up by the spec, then my linker script modification is actually the correct way to fix this issue. If it isn't, then, well, it works while the bug is fixed.
I've read through as many relevant specifications as I can, and as far as I can tell, the way that the symbol address relative to the thread-local storage pointer - i.e. the offset of the symbol in the thread-local storage block - is calculated isn't specified anywhere. I definitely read something saying that the
.tdata
segment comes first, followed by the.tbss
section with appropriate alignment, but I can't find that any more.
Yes, I believe this is up to the compiler (or maybe actually the linker?), to allow for optimizations with alignment, packing etc. I have anecdotally found that gcc
from the devkitPro toolchain tends to lay the variables out in declaration order (maybe this is required by the C standard or something), while rustc
is a bit more flexible (and, since the standard library includes a number of thread-locals, it can be a bit harder to isolate local changes unless you use #[no_std]
).
Either way, the bug comes down to the way the symbol address in the thread-local storage block is calculated, i.e. the bit of code that deals with the
tpoff
annotation in the assembly. I wasn't able to find the relevant code in either GNU binutils or LLVM source, but I suspect that's the place you'll want to look. I imagine that you'll want to look for where ARM relocations are handled. I suspect that the implementation is making the assumption that the TLS block is aligned tomax(.tbss, .tdata)
. I don't know whether that's a valid assumption to make - I can't find it in the spec anywhere.
Yeah, when I was working on the earlier bugfix it seemed like the best way to check on the output was to actually examine the numerical value of e.g. .L13
in your example above.
I haven't had a chance to look at this issue or your investigation too in-depth, but offhand I think your proposed solution to the linker script seems reasonable and not too likely to break anything. If you wanted to put it to a test, some of the Dreamcast homebrew folks have made a test for similar issues that might be useful to adapt, or you can also just keep trying with the same code from that other issue. https://github.com/KallistiOS/KallistiOS/pull/111/files#diff-bcbd57e8ac3c78995f3d91366794cff6a15d98784b172f07c14538878eefd776
It definitely seems to me like wasting a few bytes in favor of correctness would be worth it, in any case.
If I find some time soon I can try to look again in more detail, but it seems like you both have already done a pretty thorough investigation, and it also sounds like there is a proposed code change that might fix it, so that's good! Here's hoping this is the last bug of its kind because these have been brutal each time!
Hopefully (🤞) that devkitPro PR will fix this issue for good. Here's hoping this issue can be closed after less than a year...
This is only tangentially related to the issue at hand, but I wanted to ask in case of future problems: is the total size of tbss and tdata kept under 128 bytes (0x80) by the linker? I'm just wondering if there could be a case where so many thread local variables are declared in C or Rust that they overflow into the IPC area. Going even further, could devkitpro explicitly declare the IPC command areas as sections/symbols that can be exported to C/Rust code? Then we wouldn't have to manually get the TLS pointers anymore when writing IPC code. (I am a total noob when it comes to how linkers work so apologies if this is a dumb question)
This is only tangentially related to the issue at hand, but I wanted to ask in case of future problems: is the total size of tbss and tdata kept under 128 bytes (0x80) by the linker? I'm just wondering if there could be a case where so many thread local variables are declared in C or Rust that they overflow into the IPC area. Going even further, could devkitpro explicitly declare the IPC command areas as sections/symbols that can be exported to C/Rust code? Then we wouldn't have to manually get the TLS pointers anymore when writing IPC code. (I am a total noob when it comes to how linkers work so apologies if this is a dumb question)
Doesn't look like it's limited in size at all. It could be if needed, I think. devkitARM could probably declare the IPC command areas as something to be exported. It's possible they already do - do you know what file they're declared in? How would that avoid having to manually get TLS pointers?
I guess I'm assuming that the compiler must be able to figure out how to emit the mrc 15, 0, rX, cr13, cr0, 3
instruction as rarely as possible to get the current thread's TLS pointer and then calculate offsets from it, since it must be doing that already to access normal thread local variables.
do you know what file they're declared in?
I don't think they're declared anywhere, you just call getThreadLocalStorage
or getThreadCommandBuffer
and it uses inline asm to get a pointer to the current thread's local storage.
I guess I should start a separate issue for that feature request somewhere, but I wanted to bring up the possibility of .tbss and .tdata overflowing the IPC buffer while the linker script stuff is still fresh in everyone's mind :)
If anybody manages to overflow the TLS section I will not do anything to help them 😤. (jk, I’m just tired from all the thread related issues)
devkitarm-crtls
version 1.2.4
was released. I've tested the hashmaps
example and my realtime-audio
project and were happy to see the issue was fixed. I'll close this again, but feel free to comment again if other issues appear.
I guess I'm assuming that the compiler must be able to figure out how to emit the
mrc 15, 0, rX, cr13, cr0, 3
instruction as rarely as possible to get the current thread's TLS pointer and then calculate offsets from it, since it must be doing that already to access normal thread local variables.do you know what file they're declared in?
I don't think they're declared anywhere, you just call
getThreadLocalStorage
orgetThreadCommandBuffer
and it uses inline asm to get a pointer to the current thread's local storage.I guess I should start a separate issue for that feature request somewhere, but I wanted to bring up the possibility of .tbss and .tdata overflowing the IPC buffer while the linker script stuff is still fresh in everyone's mind :)
Getting the thread pointer is done with a function defined in libctru somewhere (it's called __aeabi_something_something
I think). How are you thinking that the thread-local variable area might get overflowed? Anything related to their position in memory should be handled by the linker script.
How are you thinking that the thread-local variable area might get overflowed?
According to https://www.3dbrew.org/wiki/Thread_Local_Storage, the TLS area is 0x200 bytes in size, but only the first 0x80 bytes are available for general purpose use. The IPC command area takes up the remaining space. I didn't see anything in the .ld script that @Meziu edited that references either the 0x80 or 0x200 size limits. As I said, I don't really know the details of how the linker works so maybe those limits are declared somewhere else but I can't find them. If they are not defined then I believe you could declare a large thread local object and end up overlapping with the IPC buffer.
Update: I did some testing by declaring a large [u8; 0x84]
array with the thread_local!
macro and filled it with a marker value. I then printed out the first 0x80 bytes of the thread local storage area. Most of them were 0, but there was a pointer in the first few bytes that pointed to my marker array. Adding more thread local variables did not add any more pointers to this area, so I guess that all the thread local variables are stored in a struct on the heap somewhere and then a pointer to that struct is stored in TLS. I guess it's possible that other code somewhere could put things into that first 0x80 bytes but I now see that this is not likely to ever be a problem in practice.
Yeah, exactly, TLS via the thread_local
macro uses the STD TLS default implementation, which actually just holds a single vector in the TLS. Anything specified with #[thread_local]
might give issues, but at least we have alternatives.
Wow, I feel so much better seeing that TLS has been as much of a PITA for you guys as it was for me in KallistiOS for the Sega Dreamcast... I've done a LOT of coding in our kernel and for our toolchains and SDK, but implementing this with verification for all of the corner-cases still sticks out as one of the roughest things I've ever done...
...but now the inner masochist in me wants to see if I can implement the fully dynamic TLS model for loading dynamic libraries at runtime that use thread_local
... Do you guys have support for that? Any plans to support it? Maybe we can hate ourselves together... lmfao.
Wow, I feel so much better seeing that TLS has been as much of a PITA for you guys as it was for me in KallistiOS for the Sega Dreamcast... I've done a LOT of coding in our kernel and for our toolchains and SDK, but implementing this with verification for all of the corner-cases still sticks out as one of the roughest things I've ever done...
A part of me feels like this isn't really a closed issue, but I digress...
...but now the inner masochist in me wants to see if I can implement the fully dynamic TLS model for loading dynamic libraries at runtime that use
thread_local
... Do you guys have support for that? Any plans to support it? Maybe we can hate ourselves together... lmfao.
Well, on the 3DS there is no "official" support for dynamic libraries (with the exception of a few OS-related cases, which aren't easily accessible through homebrew), and while there have been some efforts to dinamically load code (see 3ds_dl) the overall homebrew scene isn't interested in supporting anything that isn't statically linked. Also, SD card read speeds on the 3DS are atrociously slow, so there wouldn't be that much gain either way :smile:.
It does leave me to wonder how such cases might have to be handled though :thinking:.
Creating a hashmap causes the panic
misaligned pointer dereference: address must be a multiple of 0x8 but is ...
. Discovered this while trying to run Meizu's realtime-audio example https://github.com/Meziu/realtime-audio/issues/1This issue and this seem to be related. They're talking about static variables being linked in un-aligned ways. It's a bit over my head but it might be useful context.
I tracked it to the Symphonia library creating a HashMap.
HashMap::new
initializesRandomState
with a static variable insidethread_local!
. This is where the crash happens:/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/collections/hash/map.rs:3127
The minimal test case appears to be "creating a hashmap". Can someone else please run
examples/hashmaps.rs
in debug and confirm that it panics?