Sporadic panic in async-io thread with esp-sys-svc async TLS example

jothan commented 5 months ago

Code: https://github.com/esp-rs/esp-idf-svc/blob/4fff46bba1be66ae3c6945d3b8bda30a589f6f6b/examples/tls_async.rs

Crate versions: Cargo.lock

I am running the example on a ESPcam board, with an esp32 (revision v3.0) Xtensa module (Tinker AI ESP32-S, to be exact).

The example usually crashes once or twice after reset before working.

I've only managed to reproduce this with an optimized debug build with codegen-units=1.

[profile.de[v]
debug = true
opt-level = "s"
codegen-units = 1

Vollbrecht commented 5 months ago

I've only managed to reproduce this with an optimized debug build with codegen-units=1.

To be more concrete you never saw it before applying this optimization?

jothan commented 5 months ago

I've only managed to reproduce this with an optimized debug build with codegen-units=1.

To be more concrete you never saw it before applying this optimization?

To be honest, it seems to depend on code layout. It might trigger in other conditions, I have not managed to narrow it down. It smells like a race condition of some sort.

notgull commented 5 months ago

cc @ivmarkov Any idea why this would be happening?

notgull commented 5 months ago

The issue actually appears to be in our locking implementation, specifically OnceCell.

ivmarkov commented 5 months ago

cc @ivmarkov Any idea why this would be happening?

No idea. Will try to reproduce once I'm back.

jothan commented 4 months ago

@ivmarkov have you had any luck reproducing this ? Let me know if you need further input and testing from my part.

ivmarkov commented 4 months ago

Couldn't get to there yet. Sorry about that! Very likely will try to do something related to async-io in the next couple of days.

npmenard commented 4 months ago

I am running in the same issue as described by @jothan

I have done some investigation and I believe the issue boils down to the pinning of event_listener In this case the address generated is not 32 bit aligned (ex : 0x3ffe860f not sure if this matters) but more importantly point back to an area of the stack where the first byte is not 0x0. Using GDB i can see that the value it points to is 0x3f (which is the MSB of an address used during the creation of the thread) since the remain bytes are 0 maybe 0xffe8610 would be the correct pinned address? Interestingly if I add a dummy read before pinning like :

let event_listener = EventListener::new();
let _ = event_listener.is_listening();
pin!(event_listener);

The pinned address will be 32bit aligned pointing to a stack area where all bytes are 0 and the bug doesn't happen.

For reference here are the two assembly dump of the function not working working

Hope that helps @ivmarkov

ivmarkov commented 4 months ago

@npmenard Thank you for the detailed analysis!

I can reproduce the issue (though I haven't connected with GDB to examine the address), and I confirm that the dummy listener.is_listening(); is fixing the issue for me too - but only in a more trivial case I have. In a more complex setup, I have another crash - something related to thread locals - but let's put that aside for a moment.

(Also, not sure if that matters, but I'm consistently reproducing the issue with opt-levels "s" and "z" (haven't tried with other opt levels yet, will do)).

As for:

I have done some investigation and I believe the issue boils down to the pinning of event_listener In this case the address generated is not 32 bit aligned (ex : 0x3ffe860f not sure if this matters)

I assume by address 0x3ffe860f you mean the &mut EventListener which is actually stored in the Pin struct? Yes, not being 32-bit aligned sounds suspicious, as long as the Event:Listener contains (recursively) at least one integer type which is 4 bytes. As in i32 or u32. If that's the case, the whole structure needs to be aligned to 4 bytes of course, or else the xtensa cpu would generate a trap.

but more importantly point back to an area of the stack where the first byte is not 0x0. Using GDB i can see that the value it points to is 0x3f (which is the MSB of an address used during the creation of the thread) since the remain bytes are 0 maybe 0xffe8610 would be the correct pinned address?

Even more suspicious. This sounds a bit like a bug in the LLVM xtensa backend.

@MabezDev - sorry for pulling you here, but do you think you and/or Andrei can help? There are assembly dumps attached to the previous comment. I can also try to simplify and provide a small(er) test case, though with async-io in the picture, it would be an ESP IDF IO example in the end...

npmenard commented 4 months ago

I assume by address 0x3ffe860f you mean the &mut EventListener which is actually stored in the Pin struct? Yes, not being 32-bit aligned sounds suspicious, as long as the Event:Listener contains (recursively) at least one integer type which is 4 bytes. As in i32 or u32. If that's the case, the whole structure needs to be aligned to 4 bytes of course, or else the xtensa cpu would generate a trap.

It's the address that would be printed when formatting event_listener with {:p} after it was pinned for example : log::info!( "pin_ev_ptr {:p}", event_listener); In my case the cpu doesn't generate a trap when loading at an unaligned address. To illustrate see is_listening assembly dump and the registers status when we check self.listener.listener.is_some() (l32i.n a2, a2, 0). Register a2 contains the unaligned address and once the load is executed it will contain 3f when it should be 0.

I have noticed something odd but I am not sure if it will shade some light, whenever the bug occurs the value printed by log::info!( "pin_ev_ptr {:p} {:x}", event_listener, *(&(&**(&event_listener.pointer))) as *const _ as u32); are different.

Lastly I am attaching a backtrace showing the code path that writes at the location in stack where ultimatley the Pin will point to. It happens during WindowOverflow8 exception is triggered when setting a thread local variable.

Hopefully this helps a bit. Let me know if there is anything else I can do to help.

jothan commented 4 months ago

With the new Rust 1.78 for ESP32 (in pre-release in the rust-build project), debug assertions for std are now enabled and seem to catch a similar alignment problem with an assertion in core::ptr::read::precondition_check.

Full backtrace: backtrace.txt

Partial backtrace:

Guru Meditation Error: Core  0 panic'ed (Unhandled debug exception).                                                                                          
Debug exception reason: BREAK instr                                                                                                                           
Core  0 register dump:                                                                                                                                        
PC      : 0x400f0d06  PS      : 0x00060c36  A0      : 0x800f0ab0  A1      : 0x3ffcdfc0                                                                        
0x400f0d06 - core::panicking::panic_nounwind_fmt::runtime                                                                                                     
    at /home/user/.rustup/toolchains/esp/lib/rustlib/src/rust/library/core/src/panicking.rs:91                                                                
A2      : 0x3ffce060  A3      : 0x00000040  A4      : 0x000001c0  A5      : 0x3ffce220                                                                        
A6      : 0x00060b23  A7      : 0x00060b20  A8      : 0x800f0d01  A9      : 0x3ffcdfa0                                                                        
A10     : 0x00000000  A11     : 0x00000040  A12     : 0x000000f8  A13     : 0x3ffce4e0                                                                        
A14     : 0x3fff3fc8  A15     : 0x400f0cf4  SAR     : 0x00000018  EXCCAUSE: 0x00000001                                                                        
0x400f0cf4 - core::ptr::read::precondition_check                                                                                                              
    at /home/user/.rustup/toolchains/esp/lib/rustlib/src/rust/library/core/src/intrinsics.rs:2797                                                             
EXCVADDR: 0x00000000  LBEG    : 0x4000c2e0  LEND    : 0x4000c2f6  LCOUNT  : 0x00000000                                                                        

Backtrace: 0x400f0d03:0x3ffcdfc0 0x400f0aad:0x3ffcdfe0 0x400f0beb:0x3ffce5c0 0x400ee62d:0x3ffce860 0x400f14bb:0x3ffce880 0x400f148d:0x3ffce8b0 0x400d6e7e:0x3\
ffce8e0 0x400e3afe:0x3ffce930 0x400decc1:0x3ffced20 0x401a1419:0x3ffcf150 0x400d6a52:0x3ffcf170 0x400d479b:0x3ffcf230 0x400d6450:0x3ffcf350 0x400e82f1:0x3ffc\
f5a0 0x400da0c2:0x3ffcf880 0x400da994:0x3ffcf8a0 0x4011b067:0x3ffcf8d0 0x4011b04e:0x3ffcf8f0 0x40120ff3:0x3ffcf910 0x4012cf40:0x3ffcf930                      
0x400f0d03 - core::ptr::read::precondition_check                                                                                                              
    at /home/user/.rustup/toolchains/esp/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:1282                                                                
0x400f0aad - core::ptr::read                                                                                                                                  
    at /home/user/.rustup/toolchains/esp/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:1276                                                                
0x400f0beb - async_lock::once_cell::OnceCell<T>::get_or_try_init_blocking                                                                                     
    at /home/user/.cargo/registry/src/index.crates.io-6f17d22bba15001f/async-lock-3.3.0/src/once_cell.rs:441                                                  
0x400ee62d - async_lock::once_cell::OnceCell<T>::get_or_init_blocking

jothan commented 4 months ago

Looking at the backtraces, it would seem that OnceCell does something weird that causes EventListener trouble down the line.

ivmarkov commented 4 months ago

Looking at the backtraces, it would seem that OnceCell does something weird that causes EventListener trouble down the line.

Why would OnceCell be doing something weird and moreover, this something only happens with esp idf + xtensa and not on x86 and (likely) arm? I still find it more likely that the root cause is a miscompilation bug in the llvm xtensa backend.

Will open a bug at the xtensa llvm fork shortly btw.

And will test (again) with riscv32imc on the esp idf, though I'm relatively sure the issue did not happen there.

jothan commented 4 months ago

@ivmarkov Doing something weird in this case is triggering the miscompilation. I don't doubt your analysis.

ivmarkov commented 4 months ago

Other than dereferencing a raw pointer, I don't see anything else weird, unfortunately...

ivmarkov commented 3 months ago

I know there is little to no progress here, for which I apologize. In the meantime, what I found to work for me is setting the version of async-io back to "2.0.0" (as in async-io = "=2.0.0").

We'll resume the work on finding the root cause shortly.

notgull commented 3 months ago

Does this issue still occur with async-lock v3.4.0?

jothan commented 3 months ago

@notgull It still panics in the same spot with debug assertions on (Rust 1.78), however, it does not crash when they are off.

jothan commented 1 month ago

@ivmarkov Any news on the LLVM front ? I can reproduce unaligned pointers on the latest 1.80 toolchain.

ivmarkov commented 1 month ago

@ivmarkov Any news on the LLVM front ? I can reproduce unaligned pointers on the latest 1.80 toolchain.

@mabezdev sorry for the ping ^^^

@jothan I know this is not really a proper fix, but you can give async-io-mini a try until a proper fix is in-place. async-io-mini is API-compatible with async-io as long as you don't use the async-io timers - ~~which you anyway probably shouldn't as on the ESP IDF they have a minimum res of 10ms~~ (UPDATE: on a second thought, this is likely not true).

jothan commented 1 month ago

@ivmarkov Any news on the LLVM front ? I can reproduce unaligned pointers on the latest 1.80 toolchain.

@MabezDev sorry for the ping ^^^

@jothan I know this is not really a proper fix, but you can give async-io-mini a try until a proper fix is in-place. async-io-mini is API-compatible with async-io as long as you don't use the async-io timers - ~which you anyway probably shouldn't as on the ESP IDF they have a minimum res of 10ms~ (UPDATE: on a second thought, this is likely not true).

I might give async-io-mini a try, my blocker is mostly my use of smol-hyper (that depends on regular async-io). I am also using some async-io timers, but I can always use some other timers.

bugadani commented 3 days ago

Did any of you end up opening an issue in https://github.com/espressif/llvm-project ? I can't find it.

ivmarkov commented 3 days ago

Did any of you end up opening an issue in https://github.com/espressif/llvm-project ? I can't find it.

I didn't. I really feel uncomfortable opening it and providing details myself, as my knowledge in xtensa is exactly 0%. I keep pushing though.

Would you like to take it upon yourself? Would really appreciate that and you might be the right person, given your xtensa mis-compilations' history, so to say.

bugadani commented 3 days ago

I can reliably reproduce a panic with this:

use core::pin::pin;

use esp_idf_svc::io;

// Don't forget to raise the CONFIG_PTHREAD_TASK_STACK_SIZE_DEFAULT in `sdkconfig.defaults` to > 4K so that the
// `async-io` background thread can work fine
pub fn main() {
    esp_idf_svc::sys::link_patches();
    esp_idf_svc::log::EspLogger::initialize_default();
    io::vfs::initialize_eventfd(5).unwrap();

    // You can use `esp_idf_svc::hal::task::block_on` as well
    async_io::block_on(pin!(async move { Result::<_, anyhow::Error>::Ok(()) }));
}

thread 'async-io' panicked at C:\Users\bugad\.rustup\toolchains\esp\lib\rustlib\src\rust\library\core\src\panicking.rs:219:5:
unsafe precondition(s) violated: ptr::read requires that the pointer argument is aligned and non-null

But removing pin! just makes this work.

This sounds a bit different from the report here, so can someone please verify if this is the actual issue?

ivmarkov commented 3 days ago

I can reliably reproduce a panic with this:

use core::pin::pin;

use esp_idf_svc::io;

// Don't forget to raise the CONFIG_PTHREAD_TASK_STACK_SIZE_DEFAULT in `sdkconfig.defaults` to > 4K so that the
// `async-io` background thread can work fine
pub fn main() {
    esp_idf_svc::sys::link_patches();
    esp_idf_svc::log::EspLogger::initialize_default();
    io::vfs::initialize_eventfd(5).unwrap();

    // You can use `esp_idf_svc::hal::task::block_on` as well
    async_io::block_on(pin!(async move { Result::<_, anyhow::Error>::Ok(()) }));
}

thread 'async-io' panicked at C:\Users\bugad\.rustup\toolchains\esp\lib\rustlib\src\rust\library\core\src\panicking.rs:219:5:
unsafe precondition(s) violated: ptr::read requires that the pointer argument is aligned and non-null

But removing pin! just makes this work.

This sounds a bit different from the report here, so can someone please verify if this is the actual issue?

Is it? To me, it sounds suspiciously close to this and the following comments.

It is as if the code generated by the pin! macro results in a mis-aligned pointer address, due to mis-compilation?

pin! generates simply:

$crate::pin::Pin::<&mut _> { __pointer: &mut { $value } }

According to the report I quoted, the above code results in: a) ~~__pointer containing misaligned address~~ UPDATE: or more likely calling Deref on the Pin type b) This address seems to be wrong (off by 1 byte)

ivmarkov commented 3 days ago

Actually @npmenard says something even more interesting:

I have noticed something odd but I am not sure if it will shade some light, whenever the bug occurs the value printed by log::info!( "pin_ev_ptr {:p} {:x}", event_listener, *(&(&*(&event_listener.pointer))) as const _ as u32); are different.

Sounds like Deref on Pin produces bogus code.

bugadani commented 3 days ago

Well I guess we can blame pin, but that might also be independent of our issue. What's weird is that pin seems to carbon-copy whatever you give it, which may be slightly problematic for a self-referencing generator.

Whether this is related to an architecture-specific crash... I don't know.

ivmarkov commented 3 days ago

Well I guess we can blame pin, but that might also be independent of our issue.

I'm not blaming pin. But so far, it seems we can only reproduce the mis-compilation when pin is used. I'm sure the mis-compilation has a bigger blast radius, but so far we have seen it happening when pin is somehow involved.

What's weird is that pin seems to carbon-copy whatever you give it, which may be slightly problematic for a self-referencing generator.

I would not expect anything else given that Rust arrays are Copy? self-referencing generators that you get from async are certainly not, but I think this point is (hopefully) irrelevant for the issue at hand.

bugadani commented 3 days ago

Why do you expect &mut array to duplicate the array on stack?

Regardless, I can reproduce the issue (using different opt levels, I get very different crashes and occasionally something working), I'm trying to minimize it somehow.

ivmarkov commented 3 days ago

Why do you expect &mut array to duplicate the array on stack?

Because the pin! macro does not seem to simply expand to a &mut array? I find this comment relevant (line 1959). It talks about moving, but since arrays are Copy, what happens is copying instead of moving, I guess. Would you agree?

Regardless, I can reproduce the issue (using different opt levels, I get very different crashes and occasionally something working), I'm trying to minimize it somehow.

Good news - if you can minimize it, you would be our hero! :)

bugadani commented 2 days ago

So, in my case, the root cause of the crash is:

Reactor is a 448 byte struct, with alignment 64
OnceCell initializes this struct in an async closure, and taking the value out of a Ready is the pointer read where my app crashes
The cause of the crash is that Ready and its contents aren't placed on a 64-byte allocated address. (in my case, 0xsomethingsomethingA0, or 0x...30)

Now let's see if I can break a non-idf application with this knowledge :)

ivmarkov commented 2 days ago

So, in my case, the root cause of the crash is:

Reactor is a 448 byte struct, with alignment 64

Alignment of 64 bytes? :o How is that even possible? What might require such an alignment? I think on xtensa, the maximum alignment is 64 bits which is only 8 bytes?

OnceCell initializes this struct in an async closure, and taking the value out of a Ready is the pointer read where my app crashes

The cause of the crash is that Ready and its contents aren't placed on a 64-byte allocated address. (in my case, 0xsomethingsomethingA0, or 0x...30)

Yes this is aligned to 64bits, but obviously not to 64 bYtes.

bugadani commented 2 days ago

https://doc.rust-lang.org/std/mem/fn.align_of_val.html

Returns the ABI-required minimum alignment of the type of the value that val points to in bytes.

The culprit is ConcurrentQueue, it raises the alignment requirement (with a contained CachePadded type, which is repr(align(64))).

ivmarkov commented 2 days ago

I think you are getting there! It might well be, that the LLVM xtensa backend simply does not support such an alignment. Or would that be the linker? Not sure...

smol-rs / async-lock

Sporadic panic in async-io thread with esp-sys-svc async TLS example #84