Misaligned pointers in thread_local static variables upon creation of a HashMap

DeltaF1 commented 1 year ago

Creating a hashmap causes the panic misaligned pointer dereference: address must be a multiple of 0x8 but is .... Discovered this while trying to run Meizu's realtime-audio example https://github.com/Meziu/realtime-audio/issues/1

This issue and this seem to be related. They're talking about static variables being linked in un-aligned ways. It's a bit over my head but it might be useful context.

I tracked it to the Symphonia library creating a HashMap. HashMap::new initializes RandomState with a static variable inside thread_local!. This is where the crash happens:

/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/collections/hash/map.rs:3127

        KEYS.with(|keys| {
            let (k0, k1) = keys.get();
            keys.set((k0.wrapping_add(1), k1));
            RandomState { k0, k1 }
        })

The minimal test case appears to be "creating a hashmap". Can someone else please run examples/hashmaps.rs in debug and confirm that it panics?

Meziu commented 1 year ago

Alignment issues on thread local variables... where have I heard this before?

Still, this looks like a pretty complex situation, most likely relating back to the TLS code in libctru.

@ian-h-chamberlain since you made those changes, can you run similar tests again? It looks like nothing changed in the mean time regarding TLS in libctru, so it's very possible we've always used misaligned pointers (and I wonder if that has anything to do with corrupted TLS destructors)...

ian-h-chamberlain commented 1 year ago

oh no, hopefully this isn't another case of similar shenanigans, that one took ages to track down...

I am able to reproduce using the hashmaps example, once I updated my toolchain to something recent enough. I'll see if I can figure out why the pointer is misaligned and go from there.

Edit: ok, so far this is the smallest reproduction I've found. I'm trying to inline all the calls there and see if I can narrow it down further but haven't been able to get it so far:

fn main() {
    ctru::use_panic_handler();
    let _s = std::collections::hash_map::RandomState::new();
}

It does seem important that the ctru::use_panic_handler is called, so I wonder if there is some additional thread-local arrangement happening in std::panic::set_hook or something...

DeltaF1 commented 1 year ago

I just tested without use_panic_handler and using the debugger I could see it call the same misaligned pointer panic, so it may not be related.

Meziu commented 1 year ago

Yeah, most likely the panic happens regardless but the application exits normally. That’s one of the nice features of our toolchain.

ian-h-chamberlain commented 1 year ago

I just tested without use_panic_handler and using the debugger I could see it call the same misaligned pointer panic, so it may not be related.

Hmm, that's interesting - in my case I was also checking with the debugger (with a breakpoint on rust_panic) but the app exited instead of hitting my breakpoint... I wonder what the difference there is? I was mostly using Citra instead of real hardware but I wouldn't think that should make much difference, especially since I did reproduce it when using the custom panic hook.

The next thing I haven't looked at yet is actually examining the thread local section of the executable, which might provide some clues why the TLS is misaligned. I've been busy for the last few weeks but should have some time to dig in this week.

DeltaF1 commented 1 year ago

I recommend putting a breakpoint in core/src/panicking.rs Line ~169 panic_misaligned_pointer_dereference, that's where it's ending up for me (only tested on hardware just now but I was using Citra to debug when I first opened this issue, so I think they're both the same).

ian-h-chamberlain commented 1 year ago

Ok, I figured out that in the case where I wasn't reproducing the issue, it was because my "custom" thread-local was aligned correctly, but the one in std is misaligned:

Edit: I might be mistaken about this and it's just a coincidence, need to re-learn some stuff about TLS and ELF structure I think :sweat_smile:

broken case (using arm-none-eabi-objdump --demangle --syms -s --section=.tbss ../target/armv6k-nintendo-3ds/debug/examples/thread-local-minimal.elf):

../target/armv6k-nintendo-3ds/debug/examples/thread-local-minimal.elf:     file format elf32-littlearm

SYMBOL TABLE:
0023da30 l    d  .tbss  00000000 .tbss
00000c20 l       .tbss  00000001 std::panicking::panic_count::LOCAL_PANIC_COUNT::__getit::STATE
00000c1c l       .tbss  00000004 std::panicking::panic_count::LOCAL_PANIC_COUNT::__getit::VAL
00000c24 l       .tbss  0000000c std::io::stdio::OUTPUT_CAPTURE::__getit::__KEY
00000c31 l       .tbss  00000001 std::sync::remutex::current_thread_unique_ptr::X::__getit::STATE
00000c30 l       .tbss  00000001 std::sync::remutex::current_thread_unique_ptr::X::__getit::VAL
00000c32 l       .tbss  00000001 std::sys_common::thread_info::THREAD_INFO::__getit::STATE
0023da30 g       .tbss  00000000 __preinit_array_start
00000014 g       .tbss  00000802 __ctru_dev_utf16_buf
00000818 g       .tbss  00000401 __ctru_dev_path_buf
00000c34 g       .tbss  00000020 std::collections::hash::map::RandomState::new::KEYS::__getit::__KEY
0023da30 g       .tbss  00000000 __preinit_array_end

Note that the location of RandomState::new::KEYS::__getit::__KEY is not aligned to 8 (which I think is the expected alignment for this thread_local<Cell<(u64, u64)>>).

When I try to inline all the code of RandomState, things are ordered differently and the alignment works out:

../target/armv6k-nintendo-3ds/debug/examples/thread-local-minimal.elf:     file format elf32-littlearm

SYMBOL TABLE:
0023da2c l    d  .tbss  00000000 .tbss
00000018 l       .tbss  00000020 thread_local_minimal::main::KEYS::__getit::__KEY
00000c44 l       .tbss  00000001 std::panicking::panic_count::LOCAL_PANIC_COUNT::__getit::STATE
00000c40 l       .tbss  00000004 std::panicking::panic_count::LOCAL_PANIC_COUNT::__getit::VAL
00000c48 l       .tbss  0000000c std::io::stdio::OUTPUT_CAPTURE::__getit::__KEY
00000c55 l       .tbss  00000001 std::sync::remutex::current_thread_unique_ptr::X::__getit::STATE
00000c54 l       .tbss  00000001 std::sync::remutex::current_thread_unique_ptr::X::__getit::VAL
00000c56 l       .tbss  00000001 std::sys_common::thread_info::THREAD_INFO::__getit::STATE
0023da2c g       .tbss  00000000 __preinit_array_start
00000038 g       .tbss  00000802 __ctru_dev_utf16_buf
0000083c g       .tbss  00000401 __ctru_dev_path_buf

Still trying to figure out why it's begin generated with the wrong offset like that and how to fix it, but at first glance I don't think it's totally related to the original TLS alignment issues we had, but I can't be sure yet.

ian-h-chamberlain commented 1 year ago

Good news! I finally made some progress here and I think I've identified a solution, which will have to get delivered upstream to https://github.com/devkitPro/devkitarm-crtls

I'll try to have a write-up and post a PR tomorrow, but I will definitely need to get some input from the devkitPro folks since the change I have planned could affect other users too. Will post updates here as it develops.

Meziu commented 1 year ago

Closed as result of the upstream changes in https://github.com/devkitPro/devkitarm-crtls/pull/7.

@ian-h-chamberlain you're incredible! The whole homebrew community has benefited from your investigations 🎉

patataofcourse commented 11 months ago

sorry for reviving a dead issue but i don't know if this is worth opening a new one or if it's a weird edge-case of this issue i have no idea how to describe this because i have no clue about how thread_local works but here's what i got i can't figure out how to get backtrace to properly work here (symbols weren't showing up for some reason) and i really don't want to manually look up symbols in nm output, but if it's necessary i can dig, all i know is that directly before main is std::rt::lang_start, but if the backtrace symbols would help here i'll get em

project this is happening is https://github.com/patataofcourse/Barista (project is a big mess though and i assume stinks for reproducibility), and this error apparently only happens when compiling on certain machines, but that might have changed since last time i tried...? also, i know that nightly 2022-02-14 used to work but since updating ctru-rs, nightlies that old don't work

again if there's anything i can do lmk, and my discord is under the same username as here (patataofcourse) if anyone who can help prefers a more direct method of communication thanks in advance! ^^'

Meziu commented 11 months ago

project this is happening is https://github.com/patataofcourse/Barista (project is a big mess though and i assume stinks for reproducibility), and this error apparently only happens when compiling on certain machines, but that might have changed since last time i tried...? also, i know that nightly 2022-02-14 used to work but since updating ctru-rs, nightlies that old don't work

This issue was resolved as a change in the upstream packages released by @devkitPro. If it is the same problem that was reported here (as it looks for how you describe it) then the fix should come from updating the dkA packages.

dkp-pacman -Syu

If this alone doesn’t cut it, try updating and refreshing the whole env:

rustup update
cargo clean
cargo update

patataofcourse commented 11 months ago

refreshing the whole env didn't work either, and it's not the first time i've tried to refresh it either

the devkitarm-crtls package is also up-to-date:

$ pacman -Q devkitarm-crtls
devkitarm-crtls 1.2.2-1

Meziu commented 11 months ago

I've tried running your project on my New 3DS XL using the latest nightly, and I can't seem to find the error you are talking about. Please give me more information for me to help you, since reproducibility seems quite hard.

Also, I've found an ARM panic here: https://github.com/patataofcourse/Barista/blob/2b4c4f598aba6bd21f5bfe580298539cc5f151f9/src/format/saltwater_cfg.rs#L21

I don't know whether or not you have been able to write files to the SD card using ctru::fs, but I HEAVILY suggest you to not use that module, as it is broken, unstable and incredibly outdated (almost all code dates back to before the project rewrite/reconstruction I started years ago). You should use std::fs when working with the SD card and the ROMFS, that module is assured to work and offers full compatibility.

Regardless, with a bit of code cutting I managed to get to the UI stage just fine. Is there any way to reproduce the error I could use?

patataofcourse commented 11 months ago

ah crap, i'm testing on Citra so the ARM panic is probably ignored by it. i'll change to std::fs but it'd be great if you guys could deprecate ctru::fs so that it being broken wouldn't have to be transmitted via word-of-mouth i'll fix the error you mentioned, and, well, i honestly have no clue what could be causing the misalignment panic could you try using my makefile to see if the issue's caused by that, as well as testing the project on citra?

thank you for all the help btw it means a lot

Meziu commented 11 months ago

ah crap, i'm testing on Citra so the ARM panic is probably ignored by it. i'll change to std::fs but it'd be great if you guys could deprecate ctru::fs so that it being broken wouldn't have to be transmitted via word-of-mouth

Yeah, I’m sorry about that. Sadly other things took priority over it (especially since the std::fs has always been working since the std implementation), so I wasn’t able to change it. I should’ve at least put a warning, but I guess I just didn’t really expect people to use it 😅.

i'll fix the error you mentioned, and, well, i honestly have no clue what could be causing the misalignment panic could you try using my makefile to see if the issue's caused by that, as well as testing the project on citra?

The tests I ran were built using your own makefile (especially since it was needed to compile and use those other C libraries that I didn’t look into). As for Citra, well, we are working to use it as a quick way to test our code, but the general rule of thumb in the homebrew community is that it’s not reliable enough against to testing on real hardware. Regardless, I’ll try it if I have the chance, but I can already tell that it would be quite weird if that was actually the problem (in this case). Don’t you have any hardware to test with?

patataofcourse commented 11 months ago

The tests I ran were built using your own makefile (especially since it was needed to compile and use those other C libraries that I didn’t look into).

oops i forgot that was a thing, makes sense you'd need to use the makefile then

As for Citra, well, we are working to use it as a quick way to test our code, but the general rule of thumb in the homebrew community is that it’s not reliable enough against to testing on real hardware. Regardless, I’ll try it if I have the chance, but I can already tell that it would be quite weird if that was actually the problem (in this case). Don’t you have any hardware to test with?

yeah, that is very fair, however i can't always test on hardware, especially when i'm not home. usual rule of thumb in the modding community i'm in is: do quick tests on Citra, and a final test on hardware to see if anything happens there, especially since usually you have to make a bunch of consecutive quick tests in case something is slightly off. usually there's not errors that are only on Citra, usually it's the opposite, but of course that's on modded commercial games, not homebrew.

either way, i'll remove every instance of ctru::fs and use std::fs instead, and I'll let you know if that fixes anything

Meziu commented 11 months ago

I've tested with Citra, no errors in sight. 🤷‍♂️

patataofcourse commented 11 months ago

removed any ctru::fs references, still the same error

I've tested with Citra, no errors in sight. 🤷‍♂️

okay that's very strange. mind if i send a pre-built .3dsx and .elf in case it's something weird in my rust setup? i'm on nightly 2023-09-27 btw, that should be latest

Meziu commented 11 months ago

okay that's very strange. mind if i send a pre-built .3dsx and .elf in case it's something weird in my rust setup?

As I said earlier, the most likely cause is not an outdated Rust toolchain, as it may be an outdated dkARM (or an unproperly set one?). I know you've checked the crtls versions earlier, but there isn't much room for possibilities at this point.

Send me the file however you want, you can also contact me on discord (username "meziu"), though I have to stop working on this for today.

Meziu commented 11 months ago

It's true, even the hashmaps example doesn't work properly anymore in debug mode. (thanks for noticing @patataofcourse). It seems like this issue just won't stop coming back to haunt us, so we've got to play cat and mouse again to see what's the problem now.

My panic info:

misaligned pointer dereference: address must be a multiple of 0x8 but is 0x2afb2c

@ian-h-chamberlain I can't ask you to look into this again, since you've done more than enough and I'm sure you have other things to do, but could you please share your minimal-reproducibility tests (if you still have them) so I can try making sense of this myself?

patataofcourse commented 11 months ago

what was the issue that was causing my and your builds to be different?

ian-h-chamberlain commented 11 months ago

Ok, let's see... The first thing to test is probably the C program in https://github.com/rust3ds/ctru-rs/issues/60#issuecomment-1150529440 — that could point out if there was a regression in devkitARM. We might also be able to check some of the test cases listed here which were mentioned in https://github.com/KallistiOS/KallistiOS/pull/111

Those might at least eliminate the Rust toolchain as an option if they also reproduce an issue, and possibly point to a DKP regression. If not, then it sounds like the hashmap example is another good place to start. This tiny program was enough to reproduce before: https://github.com/rust3ds/ctru-rs/issues/124#issuecomment-1557427224

A lot of what I did to debug previously was dumping the addresses of various symbols to see if the alignment looked correct or not, e.g. something like this:

$ arm-none-eabi-readelf -W --demangle --syms target/armv6k-nintendo-3ds/debug/examples/thread-local-minimal.elf | awk 'NR == 3 || /BUF|tdata|tbss|tls|TLS/'

Symbol table '.symtab' contains 34409 entries:
   Num:    Value  Size Type    Bind   Vis      Ndx Name
     0: 00000000     0 NOTYPE  LOCAL  DEFAULT  UND 
    10: 0029ebf4     0 SECTION LOCAL  DEFAULT   15 .tdata
    11: 0029ebf8     0 SECTION LOCAL  DEFAULT   16 .tbss
  1346: 00000c08    12 TLS     LOCAL  DEFAULT   16 pthread_3ds::thread_keys::LOCALS
  3454: 00000c14     1 TLS     LOCAL  DEFAULT   16 std::sync::remutex::current_thread_unique_ptr::X::__getit::VAL
  3455: 00000c15     1 TLS     LOCAL  DEFAULT   16 std::sync::remutex::current_thread_unique_ptr::X::__getit::STATE
  6245: 00000c18     8 TLS     LOCAL  DEFAULT   16 std::panicking::panic_count::LOCAL_PANIC_COUNT::__getit::VAL
  6246: 00000c20     1 TLS     LOCAL  DEFAULT   16 std::panicking::panic_count::LOCAL_PANIC_COUNT::__getit::STATE
  9991: 00000c24    12 TLS     LOCAL  DEFAULT   16 std::io::stdio::OUTPUT_CAPTURE::__getit::__KEY
 10259: 00000c30    16 TLS     LOCAL  DEFAULT   16 std::sys_common::thread_info::THREAD_INFO::__getit::VAL
 10260: 00000c40     1 TLS     LOCAL  DEFAULT   16 std::sys_common::thread_info::THREAD_INFO::__getit::STATE
 14997: 00000000     0 TLS     LOCAL  DEFAULT   16 $d
 14998: 00000804     0 TLS     LOCAL  DEFAULT   16 $d
 15001: 00000000     0 TLS     LOCAL  DEFAULT   16 _TLS_MODULE_BASE_
 26733: 0029ebf4     0 NOTYPE  GLOBAL DEFAULT   15 __tdata_lma
 27665: 00000000  2050 TLS     GLOBAL DEFAULT   16 __ctru_dev_utf16_buf
 27813: 002a1b58     0 NOTYPE  GLOBAL DEFAULT   21 __tls_end
 29308: 00000804  1025 TLS     GLOBAL DEFAULT   16 __ctru_dev_path_buf
 29816: 00000c48    32 TLS     GLOBAL DEFAULT   16 std::collections::hash::map::RandomState::new::KEYS::__getit::__KEY
 30542: 00287940     0 NOTYPE  GLOBAL DEFAULT    3 __tdata_align
 32757: 002a0eec     0 NOTYPE  GLOBAL DEFAULT   21 __tls_start
 33555: 0029ebf4     0 NOTYPE  GLOBAL DEFAULT   15 __tdata_lma_end

Usually std::collections::hash::map::RandomState::new::KEYS::__getit::__KEY was the problematic one when trying to do stuff with RandomState. It should always have alignment of 0x8 or higher, I think (as mentioned in the panic message) but sometimes I would see it with an address (Value column) of 0x....4 which was a clue.

Man, it's crazy how this issue keeps coming back, we can't get a break! :sob: As you alluded to @Meziu I will probably take a back seat to debugging this time around as I'd like to focus on other areas of the project, but I can try to help if others get stuck.

I wonder... we should probably see if we can add some kind of test for this as part of #135 so we know if it happens again. I've been working on another push for that so I'm hoping to have something ready fairly soon! I think one of our tests builds a hashmap so maybe it would already catch this, but I think it's also kind of dependent on compiler optimizations so maybe an extra integration test or something wouldn't hurt.

Jhynjhiruu commented 8 months ago

Hi there, I've recently run into this issue, and I think I might know why it's happening. I'm writing this at 7am not having slept, so it's possible I'm not quite aligned with reality, but: The linker script 3dsx.ld defines the symbol __tls_start, which as far as I can tell is used as the starting point for thread-local storage. Previously, you fixed an issue with alignof(.tbss) not being included for size calculations; however, I think this misses an important bug. Again, I'm not sure I'm entirely lucid writing this, so definitely do check this - and I haven't read all the relevant code, either - but I'm fairly sure that, if .tdata has size 0 (and perhaps in other situations, but .tdata being 0 seems to be the trigger in my code), .tbss ends up being placed exactly at __tls_start. __tls_start isn't necessarily aligned to alignof(.tbss). I'm sure you see the issue. Luckily, it's quite a simple fix. Simply change line 162 of 3dsx.ld to . = 8 + ABSOLUTE(ALIGN(ABSOLUTE(. - 8), MAX(ALIGNOF(.tdata), ALIGNOF(.tbss))));. From my very basic testing (compiling my project and seeing if it crashes), this seems to work. Perhaps there's a smarter solution involving aligning .tbss manually on load, but this seems a bit easier and it's unlikely to waste more than a few bytes.

Meziu commented 8 months ago

Hi there, I've recently run into this issue, and I think I might know why it's happening. I'm writing this at 7am not having slept, so it's possible I'm not quite aligned with reality, but: The linker script 3dsx.ld defines the symbol __tls_start, which as far as I can tell is used as the starting point for thread-local storage. Previously, you fixed an issue with alignof(.tbss) not being included for size calculations; however, I think this misses an important bug. Again, I'm not sure I'm entirely lucid writing this, so definitely do check this - and I haven't read all the relevant code, either - but I'm fairly sure that, if .tdata has size 0 (and perhaps in other situations, but .tdata being 0 seems to be the trigger in my code), .tbss ends up being placed exactly at __tls_start. __tls_start isn't necessarily aligned to alignof(.tbss). I'm sure you see the issue. Luckily, it's quite a simple fix. Simply change line 162 of 3dsx.ld to . = 8 + ABSOLUTE(ALIGN(ABSOLUTE(. - 8), MAX(ALIGNOF(.tdata), ALIGNOF(.tbss))));. From my very basic testing (compiling my project and seeing if it crashes), this seems to work. Perhaps there's a smarter solution involving aligning .tbss manually on load, but this seems a bit easier and it's unlikely to waste more than a few bytes.

Thank you for your research on the topic! I believe the ".tbss exactly at .__tls_start" was the original issue for some other quirks of this whole situation, so the conclusions don't seem that farfetched to me :sweat_smile:. However, the solution you proposed would also impact the usage of the whole C toolchain provided by devkitPro, so I'm going to need to analyze the situation deeper (to make sure it's the right call) before making the proposal for this change directly to them.

I would greatly appreciate if you were to delve into the problem more than you already did, but I don't want you to put your health at risk more than it seems you are doing already :laughing:. I will look into it when I have some time :+1:.

Jhynjhiruu commented 8 months ago

I couldn't help myself, and I looked a bit deeper into it. I'm still trying to wrap my head around exactly what the thread initialisation code does. However, I think I might actually be massively overcomplicating it - I'm pretty sure we're interested in what happens in the main thread, i.e. the one that we're running in when main() is called. The libctru initialisation process is spread over quite a few files. As far as I can tell, 3dsx_ctr0.o comes from smealum and isn't actually in the libctru GitHub repo. Just after clearing .bss, it calls initSystem, which is defined in libctru/source/system/stack_adjust.s. That function then calls __libctru_init, which finally is some C code, defined in libctru/source/system/ctru_init.c. That function initialises a few different things, but one of the functions it calls is __system_initSyscalls. No idea why it's called that, since it doesn't seem to initialise any syscalls. Eh. __system_initSyscalls is defined in libctru/source/system/syscalls.c. The first thing it does is call initThreadVars with NULL. When that function is called with NULL, it sets the current thread (that's the main thread at this point)'s thread-local storage directly to __tls_start. As far as I can tell without digging deep into LLVM internals, all accesses to thread-local storage, both .tdata and .tbss, are done by calling __aeabi_read_tp (libctru/source/system/readtp.s - it reads the pointer set in __system_initSyscalls) and reading/writing relative to the returned pointer. I can't see any offsets being added to the pointer to deal with .tbss alignment, so, again, __tls_start is being used as the start of .tbss incorrectly. I can't figure out how this is supposed to work, actually - is the compiler supposed to know where .tbss is relative to .tdata/__tls_start and emit an add somewhere???? I have no idea how this is supposed to be working. Rust is little help, since the call to __aeabi_read_tp is buried somewhere deep within LLVM. Clearly, there's supposed to be some alignment happening, but I just haven't the foggiest where it's supposed to go. My workaround feels like it shouldn't work, now - what happens if .tdata isn't size 0 and .tbss needs to be aligned after it? - so please, please look into this if you can. I'm out of my depth at this point, I think.

Meziu commented 8 months ago

I've officially decided to waste my sanity trying to solve this issue, since ctru-rs is currently in a pretty nice state when it comes to supported features.

\~Deep breath\~... Ok, let's do this one more time.

I can't dedicate this problem much time right now, but I've started with some basic testing that was very much needed.

I've tried re-running the original C code we used to track down the issue for #60, and it yielded no bug reproduction. Not with -Og, not with -O0, not with -O3. Nothing reproduced it, so this likely isn't due to a dkP regression. Maybe Rust now makes even stricter checks when reading memory?
I've tried re-running the hashmaps example and my realtime-audio code in debug mode. Both yield a new non-unwinding panic message: unsafe precondition(s) violated: ptr::read requires that the pointer argument is aligned and non-null. I believe this error message is 1:1 to the one it yielded before, but it might've been recently added to the compiler. (the fact that it doesn't specify the actually correct alignment is... unsettling. We know it should be 0x8 though)
Just for the sake of trying, I've started running the hashmaps example with a whole lot of different nightly Rust toolchains too see whether the issue was present before/after it was reported and to see whether the current ctru-rs and devkitARM combination caused issues in "previously working" nightlies, but nothing I didn't already know popped up. The bug didn't reproduce for all of August, until in September some nightly started re-showing the message. Not too long ago the message was changed into what it is today. Still, nothing in std or in the general inner workings for TLS variables seems to have changed since, so it's hard to pinpoint Rust as the culprit.

So, from this first round of testing (and the research done by @Jhynjhiruu) I come to the conclusion that neither Rust nor devkitPro did anything that caused this behaviour. There is something fundamentally wrong (most likely in the whole setup handled by dkA) that just needs to be tweaked. I'm still interested in understanding what was that made the issue re-appear, though.

Jhynjhiruu commented 8 months ago

Can you give an example of a nightly version that doesn't reproduce the crash? I have a suspicion about why that I'd like to confirm.

Meziu commented 8 months ago

Can you give an example of a nightly version that doesn't reproduce the crash? I have a suspicion about why that I'd like to confirm.

I’m not home, but I definitely used “2023-09-25” that worked. I can’t really tell when they stop working, since the process was taking a lot of time and I couldn’t really “binary-search” my way to the exact date.

Jhynjhiruu commented 8 months ago

Agh. I hate threads. Look at this bit of assembly, generated by GCC from (a slightly modified version of) the C code from that other issue:

    bl  __aeabi_read_tp @ load_tp_soft
    mov r3, r0
    ldr r2, .L13
    mov r1, #0
    strb    r1, [r3, r2]
    ...
.L13:
    .word   BUF_16(tpoff)

This corresponds to this line of C code:

BUF_16.inner[0] = 0;

Fairly simply, this loads the address of the thread-local storage using __aeabi_read_tp, then reads the word at .L13, adds it to the thread-local storage pointer, and stores a 0 there. Makes sense. The word at .L13 is the offset of BUF_16 in a thread-local block. It's encoded in an ELF file as some offset and then an R_ARM_TLS_LE32 relocation. R_ARM_TLS_LE32 relocations are calculated as (S + A) - tp, where S is the symbol address, A is the addend (0 in our case) and tp is the thread-local storage pointer. I've read through as many relevant specifications as I can, and as far as I can tell, the way that the symbol address relative to the thread-local storage pointer - i.e. the offset of the symbol in the thread-local storage block - is calculated isn't specified anywhere. I definitely read something saying that the .tdata segment comes first, followed by the .tbss section with appropriate alignment, but I can't find that any more. Either way, the bug comes down to the way the symbol address in the thread-local storage block is calculated, i.e. the bit of code that deals with the tpoff annotation in the assembly. I wasn't able to find the relevant code in either GNU binutils or LLVM source, but I suspect that's the place you'll want to look. I imagine that you'll want to look for where ARM relocations are handled. I suspect that the implementation is making the assumption that the TLS block is aligned to max(.tbss, .tdata). I don't know whether that's a valid assumption to make - I can't find it in the spec anywhere. If that assumption actually is backed up by the spec, then my linker script modification is actually the correct way to fix this issue. If it isn't, then, well, it works while the bug is fixed.

ian-h-chamberlain commented 8 months ago

I've read through as many relevant specifications as I can, and as far as I can tell, the way that the symbol address relative to the thread-local storage pointer - i.e. the offset of the symbol in the thread-local storage block - is calculated isn't specified anywhere. I definitely read something saying that the .tdata segment comes first, followed by the .tbss section with appropriate alignment, but I can't find that any more.

Yes, I believe this is up to the compiler (or maybe actually the linker?), to allow for optimizations with alignment, packing etc. I have anecdotally found that gcc from the devkitPro toolchain tends to lay the variables out in declaration order (maybe this is required by the C standard or something), while rustc is a bit more flexible (and, since the standard library includes a number of thread-locals, it can be a bit harder to isolate local changes unless you use #[no_std]).

Either way, the bug comes down to the way the symbol address in the thread-local storage block is calculated, i.e. the bit of code that deals with the tpoff annotation in the assembly. I wasn't able to find the relevant code in either GNU binutils or LLVM source, but I suspect that's the place you'll want to look. I imagine that you'll want to look for where ARM relocations are handled. I suspect that the implementation is making the assumption that the TLS block is aligned to max(.tbss, .tdata). I don't know whether that's a valid assumption to make - I can't find it in the spec anywhere.

Yeah, when I was working on the earlier bugfix it seemed like the best way to check on the output was to actually examine the numerical value of e.g. .L13 in your example above.

I haven't had a chance to look at this issue or your investigation too in-depth, but offhand I think your proposed solution to the linker script seems reasonable and not too likely to break anything. If you wanted to put it to a test, some of the Dreamcast homebrew folks have made a test for similar issues that might be useful to adapt, or you can also just keep trying with the same code from that other issue. https://github.com/KallistiOS/KallistiOS/pull/111/files#diff-bcbd57e8ac3c78995f3d91366794cff6a15d98784b172f07c14538878eefd776

It definitely seems to me like wasting a few bytes in favor of correctness would be worth it, in any case.

If I find some time soon I can try to look again in more detail, but it seems like you both have already done a pretty thorough investigation, and it also sounds like there is a proposed code change that might fix it, so that's good! Here's hoping this is the last bug of its kind because these have been brutal each time!

Jhynjhiruu commented 6 months ago

Hopefully (🤞) that devkitPro PR will fix this issue for good. Here's hoping this issue can be closed after less than a year...

DeltaF1 commented 6 months ago

This is only tangentially related to the issue at hand, but I wanted to ask in case of future problems: is the total size of tbss and tdata kept under 128 bytes (0x80) by the linker? I'm just wondering if there could be a case where so many thread local variables are declared in C or Rust that they overflow into the IPC area. Going even further, could devkitpro explicitly declare the IPC command areas as sections/symbols that can be exported to C/Rust code? Then we wouldn't have to manually get the TLS pointers anymore when writing IPC code. (I am a total noob when it comes to how linkers work so apologies if this is a dumb question)

Jhynjhiruu commented 6 months ago

This is only tangentially related to the issue at hand, but I wanted to ask in case of future problems: is the total size of tbss and tdata kept under 128 bytes (0x80) by the linker? I'm just wondering if there could be a case where so many thread local variables are declared in C or Rust that they overflow into the IPC area. Going even further, could devkitpro explicitly declare the IPC command areas as sections/symbols that can be exported to C/Rust code? Then we wouldn't have to manually get the TLS pointers anymore when writing IPC code. (I am a total noob when it comes to how linkers work so apologies if this is a dumb question)

Doesn't look like it's limited in size at all. It could be if needed, I think. devkitARM could probably declare the IPC command areas as something to be exported. It's possible they already do - do you know what file they're declared in? How would that avoid having to manually get TLS pointers?

DeltaF1 commented 6 months ago

I guess I'm assuming that the compiler must be able to figure out how to emit the mrc 15, 0, rX, cr13, cr0, 3 instruction as rarely as possible to get the current thread's TLS pointer and then calculate offsets from it, since it must be doing that already to access normal thread local variables.

do you know what file they're declared in?

I don't think they're declared anywhere, you just call getThreadLocalStorage or getThreadCommandBuffer and it uses inline asm to get a pointer to the current thread's local storage.

I guess I should start a separate issue for that feature request somewhere, but I wanted to bring up the possibility of .tbss and .tdata overflowing the IPC buffer while the linker script stuff is still fresh in everyone's mind :)

Meziu commented 6 months ago

If anybody manages to overflow the TLS section I will not do anything to help them 😤. (jk, I’m just tired from all the thread related issues)

Meziu commented 6 months ago

devkitarm-crtls version 1.2.4 was released. I've tested the hashmaps example and my realtime-audio project and were happy to see the issue was fixed. I'll close this again, but feel free to comment again if other issues appear.

Jhynjhiruu commented 6 months ago

I guess I'm assuming that the compiler must be able to figure out how to emit the mrc 15, 0, rX, cr13, cr0, 3 instruction as rarely as possible to get the current thread's TLS pointer and then calculate offsets from it, since it must be doing that already to access normal thread local variables.

do you know what file they're declared in?

I don't think they're declared anywhere, you just call getThreadLocalStorage or getThreadCommandBuffer and it uses inline asm to get a pointer to the current thread's local storage.

I guess I should start a separate issue for that feature request somewhere, but I wanted to bring up the possibility of .tbss and .tdata overflowing the IPC buffer while the linker script stuff is still fresh in everyone's mind :)

Getting the thread pointer is done with a function defined in libctru somewhere (it's called __aeabi_something_something I think). How are you thinking that the thread-local variable area might get overflowed? Anything related to their position in memory should be handled by the linker script.

DeltaF1 commented 6 months ago

How are you thinking that the thread-local variable area might get overflowed?

According to https://www.3dbrew.org/wiki/Thread_Local_Storage, the TLS area is 0x200 bytes in size, but only the first 0x80 bytes are available for general purpose use. The IPC command area takes up the remaining space. I didn't see anything in the .ld script that @Meziu edited that references either the 0x80 or 0x200 size limits. As I said, I don't really know the details of how the linker works so maybe those limits are declared somewhere else but I can't find them. If they are not defined then I believe you could declare a large thread local object and end up overlapping with the IPC buffer.

Update: I did some testing by declaring a large [u8; 0x84] array with the thread_local! macro and filled it with a marker value. I then printed out the first 0x80 bytes of the thread local storage area. Most of them were 0, but there was a pointer in the first few bytes that pointed to my marker array. Adding more thread local variables did not add any more pointers to this area, so I guess that all the thread local variables are stored in a struct on the heap somewhere and then a pointer to that struct is stored in TLS. I guess it's possible that other code somewhere could put things into that first 0x80 bytes but I now see that this is not likely to ever be a problem in practice.

Meziu commented 6 months ago

Yeah, exactly, TLS via the thread_local macro uses the STD TLS default implementation, which actually just holds a single vector in the TLS. Anything specified with #[thread_local] might give issues, but at least we have alternatives.

gyrovorbis commented 3 months ago

Wow, I feel so much better seeing that TLS has been as much of a PITA for you guys as it was for me in KallistiOS for the Sega Dreamcast... I've done a LOT of coding in our kernel and for our toolchains and SDK, but implementing this with verification for all of the corner-cases still sticks out as one of the roughest things I've ever done...

...but now the inner masochist in me wants to see if I can implement the fully dynamic TLS model for loading dynamic libraries at runtime that use thread_local... Do you guys have support for that? Any plans to support it? Maybe we can hate ourselves together... lmfao.

Meziu commented 3 months ago

Wow, I feel so much better seeing that TLS has been as much of a PITA for you guys as it was for me in KallistiOS for the Sega Dreamcast... I've done a LOT of coding in our kernel and for our toolchains and SDK, but implementing this with verification for all of the corner-cases still sticks out as one of the roughest things I've ever done...

A part of me feels like this isn't really a closed issue, but I digress...

...but now the inner masochist in me wants to see if I can implement the fully dynamic TLS model for loading dynamic libraries at runtime that use thread_local... Do you guys have support for that? Any plans to support it? Maybe we can hate ourselves together... lmfao.

Well, on the 3DS there is no "official" support for dynamic libraries (with the exception of a few OS-related cases, which aren't easily accessible through homebrew), and while there have been some efforts to dinamically load code (see 3ds_dl) the overall homebrew scene isn't interested in supporting anything that isn't statically linked. Also, SD card read speeds on the 3DS are atrociously slow, so there wouldn't be that much gain either way :smile:.

It does leave me to wonder how such cases might have to be handled though :thinking:.

rust3ds / ctru-rs

Misaligned pointers in thread_local static variables upon creation of a HashMap #124