Rapier Physics example blocks runtime init

Meziu commented 2 years ago

Continuing off #56 discussions. @ian-h-chamberlain @AzureMarker

The new example in the test/rapier-physics branch doesn’t run correctly.

While the actual panic (which catches a potential ARM Exception) is situated in the thread_info::set function in std, the actual problem is only present when a specific function call from the rapier2d crate is present, and only at opt-level = 0, in either release or debug mode.

We can thus speculate it’s either an issue with optimisations during the compilation process, a stack fault, or an overflow in the thread_local storage.

Meziu commented 2 years ago

@ian-h-chamberlain about the last message you sent in #56, I want to say it actually would cause a segmentation fault if it weren’t for a check in the code (rtassert!) that, seeing the bad initialisation, panics the program. I’ve tried removing the check and it does throw an ARM Exception just the line after, so it doesn’t have a “different nature” as an error itself.

AzureMarker commented 2 years ago

I don't think it's possible for it to be a stack issue, right? This is happening before the stack even reaches main (and the stack at that point shouldn't be affected by anything main and beyond does).

Meziu commented 2 years ago

I don't think it's possible for it to be a stack issue, right? This is happening before the stack even reaches main (and the stack at that point shouldn't be affected by anything main and beyond does).

Yeah, but I have no idea how a non-executed line could propagate an error all the way to the std, so I just made hypotheses.

ian-h-chamberlain commented 2 years ago

Just for everyone's reference, I made a repo for the method I used to debug the regex case: https://github.com/ian-h-chamberlain/3ds-codegen-crash-repro

It's kind of a brain dump of notes but maybe you'll find it useful. I'm going to see if I can reuse it on this example as well.

Edit: also, I'm running on macOS so it's possible some parts of the bisect script aren't 100% portable, and it also relies on a prebuild std for the 3DS instead of -Zbuild-std.

ian-h-chamberlain commented 2 years ago

Okay, I was able to bisect down to a single optimization pass on my system, using this code:

use std::time::Duration;

use ctru::console::Console;
use ctru::gfx::Gfx;

use rapier2d::prelude::*;

fn main() {
    ctru::init();

    let gfx = Gfx::init().expect("Couldn't obtain GFX controller");
    let _console = Console::init(gfx.top_screen.borrow_mut());

    let mut collider_set = ColliderSet::new();

    /* Create the ground. */
    let collider = ColliderBuilder::cuboid(100.0, 0.1).build();
    collider_set.insert(collider);

    println!("Done!");
    gfx.flush_buffers();
    gfx.swap_buffers();

    // So we can see the program got past the crash
    std::thread::sleep(Duration::from_secs(2));
}

Attempting 37953 passes
+ RUSTFLAGS='-C opt-level=1 -C debuginfo=0 -C codegen-units=1 -C lto=off -C llvm-args=-opt-bisect-limit=37953'
+ cargo -v rustc --target armv6k-nintendo-3ds --target-dir=target-repro
Sending crash-repro.3dsx, 567352 bytes
266001 sent (46.88%), 24 blocks
Did it crash? [y/n] n
Attempting 37952 passes
+ RUSTFLAGS='-C opt-level=1 -C debuginfo=0 -C codegen-units=1 -C lto=off -C llvm-args=-opt-bisect-limit=37952'
+ cargo -v rustc --target armv6k-nintendo-3ds --target-dir=target-repro
Sending crash-repro.3dsx, 588584 bytes
275711 sent (46.84%), 24 blocks
Did it crash? [y/n] y
Final pass was: 37952

These are the candidate passes that actually had that number associated with them:

BISECT: running pass (37953) ARM Instruction Selection on function (<alloc::sync::Weak<T> as core::ops::drop::Drop>::drop)
BISECT: running pass (37953) SCCPPass on core::mem::manually_drop::ManuallyDrop<T>::into_inner
BISECT: running pass (37953) SROAPass on core::ptr::const_ptr::<impl *const T>::offset
BISECT: running pass (37953) Simplify the CFG on function (core::ptr::drop_in_place<std::ffi::os_str::OsString>)
BISECT: running pass (37953) Machine Copy Propagation Pass on function (<num_rational::Ratio<isize> as num_traits::cast::FromPrimitive>::from_u64)
BISECT: running pass (37953) SROAPass on <I as core::iter::traits::collect::IntoIterator>::into_iter
BISECT: running pass (37953) IPSCCPPass on [module]
BISECT: running pass (37953) SimplifyCFGPass on rapier2d::counters::Counters::set_nconstraints

Anything look suspicious on first glance? My first though is maybe something to do with the drop impls, or perhaps the IPSCCPPass on [module] since that could theoretically affect the whole program? I don't see anything do with threads unfortunately, which would probably make it more obvious.

The next step, I suppose, would be to try and look at the LLVM IR before + after this pass for each of those affected functions, and try to see if any of them look like they might be related to the crash.

AzureMarker commented 2 years ago

Thanks for taking a look at this. I'm interested in exploring another thread, which is if the static memory for the thread info gets set correctly. Last I looked at the LLVM IR, it was set correctly, but wasn't set correctly by the time of the crash (it was zeroed out, but it should have a non-zero byte to indicate the Option is None).

AzureMarker commented 2 years ago

After rebasing the std and threads PR, updating dependencies, and merging master into test/rapier-physics, I now see a full ARM exception when running the example in debug mode:

This failure happens a little earlier than thread::set:

AzureMarker commented 2 years ago

Actually, this may just be an issue in the new std changes or pthread_3ds:

AzureMarker commented 2 years ago

Pushed a fix in https://github.com/rust-lang/rust/pull/95897/commits/836146ce997b43d8c046d7ee55c3d231950865e5. Now it's back to the original issue.

Meziu commented 2 years ago

@ian-h-chamberlain are you sure those are the correct opt passes to cause the crash? Using your same parameters the app did’t crash to me…

ian-h-chamberlain commented 2 years ago

@ian-h-chamberlain are you sure those are the correct opt passes to cause the crash? Using your same parameters the app did’t crash to me…

@Meziu the exact number will probably depend on your rustc + std (I had a prebuilt std in my case). You might be able to find the pass number on your setup by setting the same RUSTFLAGS but use -opt-bisect-limit=-1 and grep the output for the pass descriptions I posted above, then try again with the number you end up with.

Also make sure you cargo clean before the build, otherwise only the "changed" crates will be optimized and the pass numbers will look different as well.

If none of that works, you could always try the bisect-repro.sh script from my repo https://github.com/ian-h-chamberlain/3ds-codegen-crash-repro – but be aware it takes a while, I think ~18 builds from scratch are needed to bisect the issue (and you have to manually reset the 3DS and provide input to the script 😔).

Meziu commented 2 years ago

I absolutely cannot check this issue the way @ian-h-chamberlain did. It is great how much effort you've put, but sadly my cpu is just way too slow for the building process (build and rebuild on an Intel i5 M520) (yes i live in 2010).

I'm not even sure how to tackle the problem honestly. Looking at the whole "idea" of an optimization pass and how they are handled it makes little sense to me the presence of one function changes the machine code of another. Maybe we still aren't understanding this problem correctly, just as the regex was linked to something different (stack size) that was masked only when a specific amount of optimizations were done, maybe something similar is linked to our issue.

Meziu commented 2 years ago

I've just done a simple test I couldn't do before (especially for @AzureMarker). I've tried running the main directly without any of the rt::lang_start code. While I had to remove the ctru::panic_handler since ThreadLocals weren't accessible anymore, the rest of the code (especially rapier2d) works fine, so I can confirm that crate doesn't use ThreadLocals at any point. This also implies I have no idea why this whole problem even exists.

Meziu commented 2 years ago

I’ve started looking into the rapier2d code that caused the problem, and the issue seems to go and come if I just cut out some function calls.

E.G. Removing the call to SharedShape::cuboid (inside the ColliderBuilder code) actually solves the issue, yet it looks like removing the actual code inside SharedShape::cuboid doesn’t change anything.

I am still confused and unsure. It is a bit irrational so I don’t know if I checked everything, but it looks to be that way. @ian-h-chamberlain how did you check the stack in the regex example?

ian-h-chamberlain commented 2 years ago

I am still confused and unsure. It is a bit irrational so I don’t know if I checked everything, but it looks to be that way. @ian-h-chamberlain how did you check the stack in the regex example?

The biggest clue I noticed was using gdb and seeing that &self was set to 0x0, e.g. what I wrote in https://github.com/Meziu/ctru-rs/pull/56#issuecomment-1100761853 Also, when I got Luma3ds crash dumps, the Stack dump: section would always be empty, and the crash would be an invalid write to the stack.

The crash in that case occurred immediately afterwards, since I think it was attempting to dereference the null frame pointer. I don't fully understand how that would happen, but those were some of the signs of a corrupt / invalid stack.

Removing the call to SharedShape::cuboid (inside the ColliderBuilder code) actually solves the issue, yet it looks like removing the actual code inside SharedShape::cuboid doesn’t change anything.

This is pretty interesting, it would seem to suggest that simply calling the function (modifying the stack pointer, I guess) is the cause of the problem, rather than any specific code in the function? How deep is the call stack if you put a breakpoint there?

ian-h-chamberlain commented 2 years ago

Ok, I looked at this a bit more and found some interesting things. I'm able to pinpoint the crash as coming from a cast of Arc<Cuboid> to Arc<dyn Shape>, as the crash is prevented with this change in parry2d:

diff --git a/src/shape/shared_shape.rs b/src/shape/shared_shape.rs
index be4792b..1c1f89a 100644
--- a/src/shape/shared_shape.rs
+++ b/src/shape/shared_shape.rs
@@ -103,7 +103,11 @@ impl SharedShape {
     /// Initialize a cuboid shape defined by its half-extents.
     #[cfg(feature = "dim2")]
     pub fn cuboid(hx: Real, hy: Real) -> Self {
-        SharedShape(Arc::new(Cuboid::new(Vector::new(hx, hy))))
+        let vec = Vector::new(hx, hy);
+        let cuboid = Cuboid::new(vec);
+        let arc = Arc::new(cuboid);
+        panic!();
+        SharedShape(arc)
     }

     /// Initialize a round cuboid shape defined by its half-extents and border radius.

Removing the panic, or explicitly writing let arc = Arc::new(cuboid) as Arc<dyn Shape>; before the panic, seems to induce the crash.

I also captured the difference in MIR between the explicit cast before and the implicit cast after the panic (paths shortened for brevity):

--- ../parry2d-crashing.mir 2022-05-15 22:31:05.000000000 -0400
+++ ../parry2d-noncrashing.mir  2022-05-15 22:32:14.000000000 -0400
@@ -103078,15 +103078,13 @@
     let mut _4: f32;                     // in scope 0 at parry/crates/parry2d/../../src/shape/shared_shape.rs:106:31: 106:33
     let mut _5: f32;                     // in scope 0 at parry/crates/parry2d/../../src/shape/shared_shape.rs:106:35: 106:37
     let mut _7: na::Matrix<f32, na::Const<2_usize>, na::Const<1_usize>, na::ArrayStorage<f32, 2_usize, 1_usize>>; // in scope 0 at parry/crates/parry2d/../../src/shape/shared_shape.rs:107:34: 107:37
-    let mut _9: std::sync::Arc<dyn shape::shape::Shape>; // in scope 0 at parry/crates/parry2d/../../src/shape/shared_shape.rs:108:19: 108:53
-    let mut _10: std::sync::Arc<shape::cuboid::Cuboid>; // in scope 0 at parry/crates/parry2d/../../src/shape/shared_shape.rs:108:19: 108:35
-    let mut _11: shape::cuboid::Cuboid;  // in scope 0 at parry/crates/parry2d/../../src/shape/shared_shape.rs:108:28: 108:34
+    let mut _9: shape::cuboid::Cuboid;   // in scope 0 at parry/crates/parry2d/../../src/shape/shared_shape.rs:108:28: 108:34
     scope 1 {
         debug vec => _3;                 // in scope 1 at parry/crates/parry2d/../../src/shape/shared_shape.rs:106:13: 106:16
         let _6: shape::cuboid::Cuboid;   // in scope 1 at parry/crates/parry2d/../../src/shape/shared_shape.rs:107:13: 107:19
         scope 2 {
             debug cuboid => _6;          // in scope 2 at parry/crates/parry2d/../../src/shape/shared_shape.rs:107:13: 107:19
-            let _8: std::sync::Arc<dyn shape::shape::Shape>; // in scope 2 at parry/crates/parry2d/../../src/shape/shared_shape.rs:108:13: 108:16
+            let _8: std::sync::Arc<shape::cuboid::Cuboid>; // in scope 2 at parry/crates/parry2d/../../src/shape/shared_shape.rs:108:13: 108:16
             scope 3 {
                 debug arc => _8;         // in scope 3 at parry/crates/parry2d/../../src/shape/shared_shape.rs:108:13: 108:16
             }
@@ -103112,8 +103110,8 @@
     }

     bb2: {
-        _11 = _6;                        // scope 2 at parry/crates/parry2d/../../src/shape/shared_shape.rs:108:28: 108:34
-        _10 = Arc::<Cuboid>::new(move _11) -> bb3; // scope 2 at parry/crates/parry2d/../../src/shape/shared_shape.rs:108:19: 108:35
+        _9 = _6;                         // scope 2 at parry/crates/parry2d/../../src/shape/shared_shape.rs:108:28: 108:34
+        _8 = Arc::<Cuboid>::new(move _9) -> bb3; // scope 2 at parry/crates/parry2d/../../src/shape/shared_shape.rs:108:19: 108:35
                                          // mir::Constant
                                          // + span: parry/crates/parry2d/../../src/shape/shared_shape.rs:108:19: 108:27
                                          // + user_ty: UserType(1)
@@ -103121,8 +103119,6 @@
     }

     bb3: {
-        _9 = move _10 as std::sync::Arc<dyn shape::shape::Shape> (Pointer(Unsize)); // scope 2 at parry/crates/parry2d/../../src/shape/shared_shape.rs:108:19: 108:35
-        _8 = move _9;                    // scope 2 at parry/crates/parry2d/../../src/shape/shared_shape.rs:108:19: 108:53
         panic(const "explicit panic") -> bb4; // scope 3 at /Users/ianchamberlain/Documents/Development/3ds/rust-horizon/library/core/src/panic.rs:50:9: 50:51
                                          // mir::Constant
                                          // + span: /Users/ianchamberlain/Documents/Development/3ds/rust-horizon/library/core/src/panic.rs:50:9: 50:33

The changes are fairly small but it is surprising to me that a simple trait cast like this would actually change the behavior of the entire program. There's still some mystery to solve about why the crash happens in a different part of the code, but it's interesting that the trait cast itself seems to have an effect.

I don't know too much about how Arc is implemented, but perhaps there is something going on there which relates to the thread info issue @AzureMarker wanted to investigate? I'll see if I can find anything else as well.

AzureMarker commented 2 years ago

Looking at the initial thread local storage in both debug and release, they both set the 02 byte to make the Option None:

@ian-h-chamberlain you were able to get it into main? It crashes before main for me due to the thread_local is_none issue.

ian-h-chamberlain commented 2 years ago

@AzureMarker this is the minimum reproduction I've got now:

use std::sync::Arc;

use ctru::services::soc::Soc;
use parry2d::na::Vector2;
use parry2d::shape::{Cuboid, Shape};

fn main() {
    ctru::init();

    let mut soc = Soc::init().expect("failed to init SOC");
    soc.redirect_to_3dslink(true, true)
        .expect("failed to redirect stdio");

    eprintln!("creating vec");
    let v = Vector2::new(1.0, 1.0);
    eprintln!("creating cuboid");
    let c = Cuboid::new(v);
    eprintln!("creating Arc");
    let a = Arc::new(c);
    // eprintln!("casting Arc");
    // let b = a as Arc<dyn Shape>; // uncomment this, and crash happens before main
    eprintln!("Done!");
}

Interestingly, if I use #[start] to define an entrypoint instead of using the builtin one, I encounter an ARM exception in core::sync::atomic::atomic_add. I'm guessing this might be related to a runtime assertion in core that is failing with regard to the thread info initialization?

Edit: stack trace of the exception below. I guess I might actually need to pull out some of the panic handler stuff too, let me see...

(gdb) bt 
#0  0x001656a0 in core::sync::atomic::atomic_add<usize> (dst=0x0, val=1, 
    order=core::sync::atomic::Ordering::Relaxed)
    at /Users/ianchamberlain/.rustup/toolchains/horizon/lib/rustlib/src/rust/library/core/src/sync/atomic.rs:2604
#1  0x00165d20 in core::sync::atomic::AtomicUsize::fetch_add (self=0x0, val=1, 
    order=core::sync::atomic::Ordering::Relaxed)
    at /Users/ianchamberlain/.rustup/toolchains/horizon/lib/rustlib/src/rust/library/core/src/sync/atomic.rs:1957
#2  0x0015d644 in alloc::sync::{impl#22}::clone<std::thread::Inner> (self=0x2b1c20)
    at /Users/ianchamberlain/.rustup/toolchains/horizon/lib/rustlib/src/rust/library/alloc/src/sync.rs:1343
#3  0x001b8f44 in std::thread::{impl#22}::clone (self=0x2b1c20)
    at /Users/ianchamberlain/.rustup/toolchains/horizon/lib/rustlib/src/rust/library/std/src/thread/mod.rs:1099
#4  0x00160038 in std::sys_common::thread_info::current_thread::{closure#0} (info=0x2b1c14)
    at /Users/ianchamberlain/.rustup/toolchains/horizon/lib/rustlib/src/rust/library/std/src/sys_common/thread_info.rs:34
#5  0x0015ff68 in std::sys_common::thread_info::{impl#0}::with::{closure#0}<std::thread::Thread, std::sys_common::thread_info::current_thread::{closure_env#0}> (thread_info=0x2b1c10)
    at /Users/ianchamberlain/.rustup/toolchains/horizon/lib/rustlib/src/rust/library/std/src/sys_common/thread_info.rs:27
#6  0x0015e648 in std::thread::local::LocalKey<core::cell::RefCell<core::option::Option<std::sys_common::thread_info::ThreadInfo>>>::try_with<core::cell::RefCell<core::option::Option<std::sys_common::thread_info::ThreadInfo>>, std::sys_common::thread_info::{impl#0}::with::{closure_env#0}<std::thread::Thread, std::sys_common::thread_info::current_thread::{closure_env#0}>, std::thread::Thread> (self=0x2830cc, f=...)
    at /Users/ianchamberlain/.rustup/toolchains/horizon/lib/rustlib/src/rust/library/std/src/thread/local.rs:442
#7  0x0015fe90 in std::sys_common::thread_info::ThreadInfo::with<std::thread::Thread, std::sys_common::thread_info::current_thread::{closure_env#0}> (f=...)
    at /Users/ianchamberlain/.rustup/toolchains/horizon/lib/rustlib/src/rust/library/std/src/sys_common/thread_info.rs:20
#8  0x00160010 in std::sys_common::thread_info::current_thread ()
    at /Users/ianchamberlain/.rustup/toolchains/horizon/lib/rustlib/src/rust/library/std/src/sys_common/thread_info.rs:34
#9  0x001b7d20 in std::thread::current ()
    at /Users/ianchamberlain/.rustup/toolchains/horizon/lib/rustlib/src/rust/library/std/src/thread/mod.rs:676
#10 0x0012a5c0 in ctru::panic_hook_setup () at ctru-rs/src/lib.rs:42
#11 0x0012a5a8 in ctru::init () at ctru-rs/src/lib.rs:34
#12 0x00101244 in rapier_physics::main (_argc=1, _argv=0x8000000) at ctru-rs/examples/rapier-physics.rs:13
(gdb)

AzureMarker commented 2 years ago

Does it need the SOC setup? Is it possible to trim it down to just the Arc casting (possibly with a simpler to construct data type)?

AzureMarker commented 2 years ago

The destination pointer is null, which is causing the crash. Edit: it's null because it thinks the thread info is initialized I think?

ian-h-chamberlain commented 2 years ago

Ok, I tried to remove more, I so far haven't been able to find a set of simpler types to reproduce but this is a fairly small program that shows it:

#![feature(start)]

use std::sync::Arc;

use parry2d::na::Vector2;
use parry2d::shape::{Cuboid, Shape};

#[start]
fn main(_argc: isize, _argv: *const *const u8) -> isize {
    linker_fix_3ds::init();
    pthread_3ds::init();

    let _ = std::thread::current(); // crashes here

    let v = Vector2::new(1.0, 1.0);
    let c = Cuboid::new(v);
    let a = Arc::new(c);

    // but only if this line is present:
    let b = a as Arc<dyn Shape>;

    0
}

I still can't quite track down where the value is getting initialized improperly, and it's getting late so I think I'll have to pause for tonight.

Something I wonder is if we might be hitting this path prematurely? https://github.com/rust-lang/rust/blob/master/library/std/src/thread/local.rs#L249 Or, perhaps crate::mem::needs_drop is returning the wrong thing somehow and immediately returning Some()?

AzureMarker commented 2 years ago

That's a good point about the state value. I haven't checked that during my debugging, but I'll take a look at it soon.

AzureMarker commented 2 years ago

I did more debugging and I saw that STATE is handled correctly (starts off as 0) and needs_drop must be working right since the match STATE code ran.

Unfortunately gdb says that VAL isn't a symbol, so I can't directly check what is being returned.

I might try disabling thread_local support for the target and seeing if that fixes anything.

Here's some screenshots of the debug and release executions. Notice that both runs start off with the thread info as Some, but the release build corrects it back to None before the assertion. This might just be some sort of weird initialization or line number mixup.

Debug: debug - not working

Release: release - working

AzureMarker commented 2 years ago

Yeah, it works if I disable has_thread_local. I'm going to add a commit to the STD PR to disable this since it seems to be causing issues.

Meziu commented 2 years ago

I don’t like this solution. This means that LocalKey instances aren’t being dropped, and that we use our slower version with pthread_3ds. I understand the problem impacting in that flag, but are we sure we can do nothing about it?

We should search more in the tests me and @ian-h-chamberlain have done. That flag works flawlessly everytime but with this one specific crate (and its specific functions).

AzureMarker commented 2 years ago

Yeah we can continue investigating, but to fix it for now (especially when merging to std - it should be stable) disabling it is the right step.

Edit: they're still dropped right? They should use the pthread key destructors. Though I don't think we finished implementing them.

ian-h-chamberlain commented 2 years ago

I've spent a bit more time trying to come up with a minimal reproduction (using some methodology from https://blog.pnkfx.org/blog/2019/11/18/rust-bug-minimization-patterns), and I've pushed the results to https://github.com/ian-h-chamberlain/3ds-codegen-crash-repro

Unfortunately nalgebra is a gnarly crate full of some intense generics, and dropping it or trying to replace with my own types has so far not reproduced the crash. There's probably a bit more that can still be done to minimize the reproduction, but at least what I've got there does some a bit smaller than what we started with.

I might try to look at the IR generated for different cases (like why (u32, u32) reproduces but (u32) doesn't?) or similar, but wanted to let you know what progress I've made in the meantime.

ian-h-chamberlain commented 2 years ago

Okay, wow! I found something super interesting relating to the enum discriminant / memory layout.

Correct behavior (nalgebra calls omitted):

addr_of!(THREAD_INFO) = 0x00174adc
x/20 0x00174adc

0x174adc:   0   2   0   0
0x174aec:   0   0   2   0
0x174afc:   0   0   0   0
0x174b0c:   0   0   0   0
0x174b1c:   0   0   0   0

Bad behavior (nalgebra calls present):

addr_of!(THREAD_INFO) = 0x0018baf4
x/20 (0x0018baf4 - 24)

0x18badc:   0   2   0   0
0x18baec:   0   0   2   0
0x18bafc:   0   0   0   0
0x18bb0c:   0   0   0   0
0x18bb1c:   0   0   0   0

So somehow, it seems adding the extra calls to nalgebra stuff (just the presence of the code itself) is moving the TLS address by 24 bytes, but the #[thread_local] static still refers to the "correct" address somehow? Or vice versa (the TLS is unmoved but the address is offset by 24).

Another interesting point:

[src/main.rs:43] std::mem::size_of::<Matrix2<f32>>() = 16
[src/main.rs:44] std::mem::size_of::<Vector2<f32>>() = 8

Maybe this means that the address of the thread local is being offset by the presence of those two stack variables, which can be optimized out (or moved, or whatever) if they're not used in the Mul operation let _ = v * mul; ?

The good news seems to be the thread-local variable is still being initialized properly, the bad news being that it's generated with the wrong address! I'm not really sure how to debug this kind of problem further, maybe we'd need to look at the TLS implementation in LLVM? Or is it possible this is a 3DS kernel bug or something up with the libctru TLS implementation? I started reading https://www.3dbrew.org/wiki/Thread_Local_Storage but I am way out of my depth here...

Edit: I also tried to reproduce this with other 24-byte stack variables but so far have not been able to reproduce. Some stuff deep in the bowels of nalgebra must be doing more than I have tried manually...

Meziu commented 2 years ago

Hmm, I don't think it's an OS/kernel issue (and even if it was, the conditions under which it occurs are very very specific).

Another interesting point:

[src/main.rs:43] std::mem::size_of::<Matrix2<f32>>() = 16
[src/main.rs:44] std::mem::size_of::<Vector2<f32>>() = 8

At what opt-level did you get this info? Does it change with release or debug mode?

Edit: I want to note something small: the binary size with the matrix multiplication is about 1717108 bytes, but without it becomes 1549312 bytes. That is a considerable amount of pulled code for such a small program, so maybe something else (pulled by the multiplication itself) is the cause?

ian-h-chamberlain commented 2 years ago

At what opt-level did you get this info? Does it change with release or debug mode?

Both debug and release show the same sizes (as far as I know struct size will not generally change depending on build mode, except maybe for #[cfg]'d out fields?).

The multiplication definitely is a lot of code... I tried duplicating the logic myself with custom types (copy-pasted from nalgebra) but I got a bit overwhelmed by the heavy generics and the algorithm, which seems to come from this crate? https://github.com/bluss/matrixmultiply

Maybe the call to that function is related to the extra code (and the origin of the weird thread-local offset)?

Meziu commented 2 years ago

The issue is in matrixmultiply::sgemm. Interesting thing to note: this crate actually is optional and only used for targets with the std feature. (Maybe an issue in our std or libc implementation?)

ian-h-chamberlain commented 2 years ago

Hmm, that is interesting, especially since those calls are behind an if that I think is always false in this program: https://github.com/dimforge/nalgebra/blob/dd801567f274fb855d3bd0f0ddcd4cbb0cebb6d3/src/base/blas_uninit.rs#L216-L222 (meaning it should be compiled out), but maybe that code is only culled when the optimizer runs a bit more (hence the program working in --release)...

Meanwhile I've been trying to track down the cause of the weird offset, and I found this: in the code that dereferences the thread-local, there is something like this:

  1084c0:   e59f10dc    ldr r1, [pc, #220]  ; 1085a4 <crash_repro::repro+0x164>
  1084c4:   eb0000bc    bl  1087bc <__aeabi_read_tp>
  1084c8:   e0800001    add r0, r0, r1
  1084cc:   ebffe7ab    bl  102380 <core::cell::RefCell<T>::try_borrow_mut>

The __aeabi_read_tp implementation is provided by libctru, and looks to be correct as far as I can tell. The interesting thing is that r1 load, which is supposed to be the offset of the thread-local variable from the start of TLS. Its definition just comes from a constant pool nearby:

  10859c:   001794b8    .word   0x001794b8
  1085a0:   0017995c    .word   0x0017995c
  1085a4:   00000020    .word   0x00000020 ; <-- this one
  1085a8:   00179994    .word   0x00179994
  1085ac:   00179140    .word   0x00179140
  1085b0:   00179978    .word   0x00179978

In a correct program, that offset is 0x08 instead of 0x20, which effectively points to the beginning of thread-local storage (8 bytes for an ARM ABI header, I think). What's not clear to me is how that value actually gets generated (I think just by LLVM?) or why it might be off like this. We may be looking at an LLVM bug, but the fact that the presence of some other code changes it means perhaps there is undefined behavior so the compiler is generating nonsense...

cargo 3ds rustc -- --emit=asm shows one other interesting bit of data around this, which is the "name" of that constant:

.LCPI234_33:
    .long   .L__unnamed_45
    .p2align    2
.LCPI234_34:
    .long   crash_repro::LOCAL_STATIC(TPOFF) ; <-- this one
    .p2align    2
.LCPI234_35:
    .long   __tls_start
    .p2align    2
.LCPI234_36:
    .long   .L__unnamed_46
    .p2align    2

This output looks the same for both correct and incorrect programs, which is why I suspect an LLVM bug, but I'm not too sure what else to look at. I did find this issue which looks to be sort of similar, but only applies for a certain relocation model (not static like we are using) and has a workaround by setting codegen-units=1: https://github.com/rust-lang/rust/issues/95871

You can get the program to print the value of the offset with something like this:

#![feature(asm_sym)]
#![feature(thread_local)]

#[thread_local]
static LOCAL_STATIC: ThreadLocal = ThreadLocal::new(None);

fn main() {
    let mut offset: u32;
    unsafe {
        asm!("
            ldr {offset}, ={local_static}(TPOFF)
            ",
            local_static = sym LOCAL_STATIC,
            offset = out(reg) offset,
        );
    }
    eprintln!("offset = {offset:#X}");
}

With that in mind, I'm going to see if I can reduce the program further, but I imagine that it would still need the matrixmultiply code somewhere to reproduce, so it's probably still worth digging down into that as well...

Meziu commented 2 years ago

@ian-h-chamberlain I can confirm the issue happens on the loop at https://github.com/bluss/matrixmultiply/blob/4f841faf50fd61367123b89289dc361f34b1f33e/src/gemm.rs#L299.

Yes, those are thread locals that are used...

I'll look into the exact reason behind the issue but it's already starting to look like a crate specific issue.

ian-h-chamberlain commented 2 years ago

@Meziu nice find! With that, the reproduction gets way smaller and actually reproduces without any dependencies (other than ctru for printing/socket output): https://github.com/ian-h-chamberlain/3ds-codegen-crash-repro/blob/main/src/main.rs

I have a feeling it may be related to this issue and that LLVM is generating some weird alignment or something: https://github.com/rust-lang/rust/pull/51828 Perhaps a similar kind of workaround would work for us but I think it'll take a little more investigation.

Meziu commented 2 years ago

I'd like to note it still makes a difference between release and debug, even if the buffer is accessed either way. Maybe we could check the optimisation passes again since the program just got a lot smaller.

ian-h-chamberlain commented 2 years ago

Good point, I was able to bisect fairly quickly since the program doesn't force a crash + reboot of the device any longer. These are the passes I found:

$ egrep '\(99\)' cargo-bisect-output.txt | rustfilt
BISECT: running pass (99) EarlyCSEPass on std::ffi::os_str::<impl core::convert::AsRef<std::ffi::os_str::OsStr> for alloc::string::String>::as_ref
BISECT: running pass (99) EarlyCSEPass on core::ptr::drop_in_place<once_cell::imp::Guard>
BISECT: running pass (99) EarlyCSEPass on <libc::unix::newlib::in_addr as core::clone::Clone>::clone
BISECT: running pass (99) EarlyCSEPass on <&T as core::fmt::Debug>::fmt
BISECT: running pass (99) PromotePass on core::mem::maybe_uninit::MaybeUninit<T>::as_mut_ptr
BISECT: running pass (99) EarlyCSEPass on <core::ops::range::RangeTo<usize> as core::slice::index::SliceIndex<[T]>>::get_unchecked_mut
BISECT: running pass (99) EarlyCSEPass on <core::ops::range::RangeFrom<usize> as core::slice::index::SliceIndex<[T]>>::index
BISECT: running pass (99) IPSCCPPass on [module]

I suspect the last one (it's some kind of constant propagation that runs on the whole program), but after looking a bit I fear that this may just be masking the real issue, and the code has moved around enough that the offsets "happen to work" or something like that.

When I examined the object file before linking, it has all zeros for the values of the thread-local offsets, so I think it's actually the linker that fills in this information based on the relocation table when it links the final executable. So I think there are two most likely possibilities:

LLVM bug generating an incorrect relocation table, linker happily obeys it
Linker bug generating wrong offsets for LLVM's correct relocation table

I might play around with some different relocation flags to the linker and see what changes, if anything.

Edit: something else worth mentioning that I found, is that 3dsxtool itself does some linker-like relocation (and specifically has support for some ARM_TLS relocs): https://github.com/devkitPro/3dstools/blob/master/src/3dsxtool.cpp#L177-L180

I don't think this is the root cause, because I was able to reproduce the issue in Citra using the plain .elf as well as the 3dsx, but it is something to keep in mind.

ian-h-chamberlain commented 2 years ago

Ok, I've gotten some evidence that this is an issue with the linker and/or the linker script, rather than LLVM. We may still end up wanting to work around it with a rustc workaround, but maybe we can just fix the upstream devkitARM toolchain instead.

This C program also reproduces the issue:

```c #include <3ds.h> #include #include typedef ALIGN(4) struct { u8 inner[3]; } Align4; typedef ALIGN(16) struct { u8 inner[3]; } Align16; static __thread Align4 BUF_4 = {.inner = {2, 2, 2}}; static __thread Align16 BUF_16 = {.inner = {1, 1, 1}}; int main(int argc, char** argv) { gfxInitDefault(); consoleInit(GFX_TOP, NULL); BUF_16.inner[0] = 0; bool reproduced = false; printf("["); for (int i = 0; i < 3; i++) { if (BUF_4.inner[i] != 2) { reproduced = true; } printf("%d, ", BUF_4.inner[i]); } printf("]\n"); if (reproduced) { printf("reproduced!\n"); } else { printf("nope"); } // Main loop while (aptMainLoop()) { gspWaitForVBlank(); hidScanInput(); u32 kDown = hidKeysDown(); if (kDown & KEY_START) break; // break in order to return to hbmenu // Flush and swap framebuffers gfxFlushBuffers(); gfxSwapBuffers(); } gfxExit(); return 0; } ```

The symptoms look roughly the same, although of course with a different compiler things seem to get reordered a bit, but if I remove BUF_16 from the program it clears up and the issue does not reproduce. I'll keep looking into it, but I think we can safely say this is a toolchain issue and not rustc.

AzureMarker commented 2 years ago

Thanks both of you for the deep investigation!

Meziu commented 2 years ago

@AzureMarker we should probably try to close this issue before pushing the thread-related code to Rust’s tree. Having support for has_thread_local would be quite nice.

Meziu commented 2 years ago

Btw, since it hasn't been mentioned yet, I'd like to dump this link here about libctru's TLS support: https://github.com/devkitPro/libctru/commit/3f9513e62cc6a0cff487da73c202a8c43646d1ad

ian-h-chamberlain commented 2 years ago

In a related note – I've tried to ask about this on the devkitpro.org forums, but it seems my post is still hidden since I am a newly registered user there. Hopefully a mod will approve it soon so the question actually gets posted to the forums... I will link it here if + when it does.

Meziu commented 2 years ago

I've opened an issue on libctru for good measure, but I don't know if they usually respond to those.

AzureMarker commented 2 years ago

@AzureMarker we should probably try to close this issue before pushing the thread-related code to Rust’s tree. Having support for has_thread_local would be quite nice.

Can I open the thread PR in parallel? We can always open another small PR to re-enable the native thread local support.

Meziu commented 2 years ago

@AzureMarker we should probably try to close this issue before pushing the thread-related code to Rust’s tree. Having support for has_thread_local would be quite nice.

Can I open the thread PR in parallel? We can always open another small PR to re-enable the native thread local support.

Well, neither the post on the forum nor the issue have had any immediate reaction. Opening it already will quicken the merge times (especially for the review). You can open it.

AzureMarker commented 2 years ago

FYI: I opened the thread PR: https://github.com/rust-lang/rust/pull/98514

Meziu commented 2 years ago

Until https://github.com/devkitPro/libctru/issues/497 is somehow fixed (pretty much can't be) we can't do anything about this. The fix we already have in place (using a heap-based TLS system) is fine for now, but it hurts to see so much wasted performance.

AzureMarker commented 2 years ago

We could still leave this issue open, since it is an "open issue".

Meziu commented 2 years ago

We could still leave this issue open, since it is an "open issue".

Well, yeah, but this is neither a ctru-rs issue nor an issue related to our toolchain at all. We may open a related issue in pthread-3ds just as a reminder in case anything changes though.

On a side note, can you confirm TLS variables are dropped correctly in our current pthread implementation? I haven’t been able to test it out due to the thread-supporting std branch being still out of tree.

AzureMarker commented 2 years ago

I don't think we call the destructors. Or, at least I didn't find any code that does that. I opened https://github.com/rust3ds/pthread-3ds/issues/19

ian-h-chamberlain commented 2 years ago

Hooray!!

When I use all the following:

All the changes I've made for https://github.com/devkitPro/libctru/issues/497
upgraded rapier2d and use their example code https://www.rapier.rs/docs/user_guides/rust/getting_started/#basic-simulation-example
set "has-thread-local": true in the target spec for 3ds and build a rustc toolchain

The example now works, both emulated and on hardware! Screenshot from Citra (since Luma screenshots still show up blank sometimes):

Screen Shot 2022-08-20 at 3 43 34 PM

rust3ds / ctru-rs

Rapier Physics example blocks runtime init #60