nbdd0121 / unwinding

Stack unwinding library in Rust
Apache License 2.0
97 stars 17 forks source link

MIPS Support #23

Closed SK83RJOSH closed 6 months ago

SK83RJOSH commented 6 months ago

Hello, I'm targeting bare metal on mipsel-sony-psp and would love to be able to use unwinding instead of libunwind. I managed to get it mostly working by manually adding an Arch + registers for MIPS and following the bare metal guide + no std sample.

Unfortunately I think there's likely some MIPS specific magic required though, as raise exception fails with end of stack (5).

I figured you might be able to point me in the right direction? :)

nbdd0121 commented 6 months ago

End of stack likely just means it's not caught Do you have a catch_unwind somewhere on the call stack?

SK83RJOSH commented 6 months ago

Do you have a catch_unwind somewhere on the call stack?

I'm running under emulation, with somewhat subpar tools (old school assembly level debugger); but tracing the program's execution and I can see that my call to winding::catch_unwind is executed, and that it invokes my main function (PSP has a custom executable entry point, so I trampoline to my main from there with a call to catch_unwind(my_main)).

In the actual code raising the exception however (first statement in main is panic!("test panic!")), it seems like the callstack is cut off at do_panic, but this may be an issue with the debugger more than anything else given the above. 🙂

The net result is my custom panic.rs ends up printing to screen, but the panic doesn't get raised: image

Some things I'm left wondering is whether or not my linker script may be incorrect, and the one or all of the symbols used for static unwinding are mis-located or similar? Or perhaps there are some changes that need to be made to one or both of gimli/unwinding to correctly handle things on MIPS?

Here's my patches so far (somewhat hacky, I need to upstream MIPS arch to gimli and bypassed it for now) and my linkerscript.

nbdd0121 commented 6 months ago

I think your register saving code is incorrect. You need to ensure that all registers are saved to the correct offset of the Context struct.

SK83RJOSH commented 6 months ago

I think your register saving code is incorrect.

Are they used for anything beyond debug printing? I noticed this as well this morning, but figured it innocuous for now as the restore will work regardless. ^^

I'll give it a try after work though 😄

nbdd0121 commented 6 months ago

It's necessary for stack unwinding.

SK83RJOSH commented 6 months ago

Quick update, I correctly calculated all the offsets, as well as the size of the context; and I'm unfortunately still encountering the same issue. Really scratching my head on this one and think I may just have to throw in the towel.

The only lead I have at this point is that Frame::from_context seems to return None on the first iteration. The linker script looks correct to me, and dumpdwarf shows that .eh_frame exists and has entries... I guess this could end up being a gimli issue, but I don't know if we've entirely eliminated PEBKAC just yet... 🙂

Any ideas what may lead to this? Are there any decent ways to verify that __etext and __eh_frame are correctly located? After becoming more acquainted with how unwinding works it really does seem like things should just work assuming __eh_frame is located correctly and readable + context[SP] returns the correct value. 😕

nbdd0121 commented 6 months ago

It's very hard to tell from distance what's wrong. It could be with the assembly, gimli or the usage.

I'd suggest you to submit your change as a draft PR and add mipsel-unknown-linux-gnu and mips-unknown-linux-gnu to the target matrix in this file. This way we can isolate the usage part.

SK83RJOSH commented 6 months ago

So to follow up on this, I'm still having trouble with both:

It seems like it's failing even on x86_64-unknown-linux-gnu based on testing. I hate to ask, but would you mind looking over the the test and seeing if there's anything obviously wrong with the setup? Otherwise I think I've isolated the issue to specifically these two features/the combination of them with a custom panic_handler.rs (albeit, I couldn't tell you the root cause, and that test should probably use the panic-handler feature rather than reimplementing it.).

nbdd0121 commented 6 months ago

I didn't see where fde-static is used

SK83RJOSH commented 6 months ago

Not using it in that test specifically, so I don't need to manually link the symbol for all configurations, but it exhibits the same unwind failure as manually linking it in. I would expect registry to behave identically to a well formed test with static? Or is that a bad assumption? :)

My apologies though, the test name is a misnomer, and it should be _registry – still need to think about how I can implement a proper "bare metal"-like test without a ton of build.rs and linker script shenanigans.

nbdd0121 commented 6 months ago

Registry does nothing if there's no explicit registration

SK83RJOSH commented 6 months ago

Ah, okay, misinterpreted for dynamic registration ie/ it doesn't provide dynamic registration... sorry... time to do a proper static test then :)

SK83RJOSH commented 6 months ago

Yeah, I think I've exhausted all of my options trying to triage this one. Between your changes, and the many dozens of permutations of linker scripts and build changes I've tried to make; I simply can't get any sort of unwinding to work on mipsel-sony-psp. Going to close this out as not to clutter your issues, and will try to get the PR merged at least.

nbdd0121 commented 6 months ago

What issue are you still having after #24? Try verifying the actual value of __executable_start, etext and eh_frame to see if they point to the correct location.

nbdd0121 commented 6 months ago

According to the linker script, the __eh_frame_hdr_start symbol is already exposed.

Perhaps you can enable the fde-custom feature and try this:

use unwinding::custom_eh_frame_finder::*;

struct MyFrameFinder;

unsafe impl EhFrameFinder for MyFrameFinder {
    fn find(&self, pc: usize) -> Option<FrameInfo> {
        extern "C" {
            static __eh_frame_hdr_start: u8;
        }
        Some(FrameInfo {
            text_base: 0,
            kind: FrameInfoKind::EhFrameHdr(&__eh_frame_hdr_start as *const u8 as usize),
        })
    }
}

// Usage
set_custom_eh_frame_finder(&MyFrameFinder);
SK83RJOSH commented 6 months ago

Unfortunately still hits the same error as my custom linker script; error code 5 and aborts (though that's rather handy indeed) - my suspicion is that there's something going wrong at link time (perhaps a linker bug, or an issue with the linker script/target configuration) but I'm unfortunately not really equipped to confirm that.

Inspecting the elf with objdump at least confirmed that my linker script seemingly assigns things correctly (grepped eh_frame to keep it to the point): section_headers.txt symbols.txt

Here's the elf, if you'd like to analyze it (though don't feel you have to; it'd be lovely to get to the bottom of this but I have definitely asked far too much of you already): psp_tests.zip

One thing of note, is I did notice using the snippet from the README was causing the linker to emit .bss __eh_frame which definitely didn't seem correct. I ended up doing PROVIDE(__eh_frame = ADDR(.eh_frame)) which changed behavior somewhat, the execution time of _Unwind increased overall (which is the case with the custom finder as well - so it definitely had an effect, but I haven't debugged it just yet)

nbdd0121 commented 6 months ago

Briefly looked into the binary I didn't notice anything obviously wrong. Might be related to PSP/normal MIPS difference. AFAIK libunwind used by rust-psp is also patched...

SK83RJOSH commented 6 months ago

Perhaps it would wise of me to investigate what those patches were, and how they were building it. The thing that kicked this off was libunwind becoming unstable recently, so I was hoping to find a nice unwinding solution that we would have far more control over.

In any case, appreciate that you looked into it, I'll do some digging and see if there's anything interesting. :)

SK83RJOSH commented 6 months ago

So I haven't really been able to dig up any information on what/if any patches were made to libunwind; so for now I'm assuming they built it from trunk 🙂

The only info I've really managed to dig up is this. I assume this corresponds to Frame::from_context, and if so I do note some differences (particularly with how they detect end of stack + they manipulate ra in an odd way).

Anything you're able to glean from the differences there?

nbdd0121 commented 6 months ago

The fallback unwinding mechanism shouldn't be relevant here, it's only supposed to be used when there's no unwind table for a function.

If -Cpanic=unwind is used, then Rust should emit unwind table for all functions, so the fallback unwinding mechanism shouldn't be necessary at all. You can try to use -C force-unwind-tables=yes to force emission, although I doubt result will change.

SK83RJOSH commented 6 months ago

I tried that early on, but not post custom-fde + context fixes, may give it a try as a hail Mary tomorrow - yeah, as you suspected, the result was the same.

At this point I'm thinking I may just need mock things up in a test executable on a platform I can actually debug, and feed Gimli the eh_frame data + RAs from the mipsel executable. Then I can potentially get to the bottom of why it's failing to get the initial frame info.

Unfortunately very time consuming, but I think I've exhausted all my options, and I suppose it would at least confirm if the problem is with the DWARF data or not.

SK83RJOSH commented 6 months ago

And... that actually ended up giving me the most promising lead, I noticed all the of the FDE entries seems to be in a range of [0x0..0xDEADBEEF]; and any attempt to find an entry using RA always fails.

Just on a hunch, I decided to see where .text was mapped in memory (0x8804000) and decided to subtract that from RA in ::from_context + do the same for the address of _Unwind_Backtrace (frame.initial_address() == (_Unwind_Backtrace - 0x8804000)).

Which allows me to get a successful backtrace: image

Unfortunately, this doesn't fix _Raise or give me an idea of what the underlying issue is, but perhaps you have some ideas?

I would have assumed that BaseAddresses::set_text would be related to this in some way; but despite that being set correctly it doesn't change much without the manual offset of addresses.

nbdd0121 commented 6 months ago

That's very useful information! I am not aware that there's a fixed offset being applied at load time. This demystified everything.

It's important for the linker to know where the program will be loaded to (or, alternative, the program needs to compiled as PIE, in this case there'll be some runtime relocation fixups necessary for the program to do at init time). The unwinder really expects to see the correct address in this particular case.

The psp_test program that you attached doesn't do either -- the objdump --dwarf shows the augmentation data is 0b. The upper nibble of this data indicates the relocation mode, and 0 means absolute (it'll say 1 i.e. pcrel if you dump on a x64 Rust program, as majority of those are PIE now, 2 means textrel and only in this case the BaseAddress would be used). This explains why the text base have no effect.

I saw the linker script forced .text to locate at 0 and I thought it's actually loaded into that place. ELF does support different virtual address and physical address, you might be able to use that feature to convey the correct virtual address to the linker but still make PRX (I don't know what's that though) happy? Alternatively try use PIE/PIC flags to see if it can work.

Link that might be useful: https://sourceware.org/binutils/docs/ld/Output-Section-LMA.html

SK83RJOSH commented 6 months ago

Interestingly, it seems like the linker totally ignores any and all attempts to change LMA/VMA; the resulting .text section always has a VMA/LMA of 00000000. I came across this thread where someone else seemingly ran into a similar issue but I'm stumped as of right now. 🙂

And PRX is the Sony flavor of ELF - the ELF I sent you actually ends up going through a process that rewrites the executable sections and so on to match the expected format on the PSP. Though that should be mostly irrelevant to this issue, since this seems to be some sort of linker issue happening long before that step.

--

Managed to get around that with OVERWRITE_SECTIONS but it appears that actually, the PRX does expect .text to have a VMA of zero... and PIE/PIC seem to cause their own troubles as well. This is a tough nut to crack. 😞

I'll probably need to do a bit of digging and see what exactly is going on here; because it clearly does want things to be loaded at zero initially (crashes otherwise), but then it seems like things are relocated/remapped in some way to that address after initialization.

SK83RJOSH commented 6 months ago

Yeah unfortunately, I think the vaddr being set to zero is going to be a hard requirement for the PRX to load correctly. So I don't think there's much I can do at link time, and there's definitely some Sony specific relocation happening.

Would you happen to have any suggestions? Since we're post-processing the ELF anyway, I have to wonder if patching up the .eh_frame/_hdr data might be an option? Though I'm not sure just how complex that would end up being. Outside of that, I don't think PIC/PIE will work, but maybe there's something that can be done there in one way or another.

nbdd0121 commented 6 months ago

What's the issue with PIC/PIE?

SK83RJOSH commented 6 months ago

Mostly just unfamiliar with it, and slightly concerned it will degrade performance (though, of course, I suppose in release relocation-model=static + panic=abort could be heavily recommended).

From my testing, compilation works just fine, but I get an invalid memory access immediately after starting the executable. I presume this is due to me needing to do some sort of setup in executable start?

nbdd0121 commented 6 months ago

I did some further research and I think PIC/PIE is indeed probably not needed. I think the tool flow is that an ELF linker first produce an executable with relocation enabled, and then the PRX tool will use these relocation information. In this case, the eh_frame in memory should be pointing to the relocated PC address IMO?

Does libunwind work for the same binary?

nbdd0121 commented 6 months ago

Also, have you tried to change linker from LLVM ld to GNU ld? This could also make a difference.

SK83RJOSH commented 6 months ago

In this case, the eh_frame in memory should be pointing to the relocated PC address IMO?

Yeah, you would think so? I was looking at the PRX tool yesterday to try to spot some issue there, but found nothing immediately obvious. I was going to put some time into rewriting it today, just to eliminate the tool as a possible concern. It's based on a much older tool written in C, so I was going to look there to see if I can spot any differences.

Does libunwind work for the same binary?

It used to, but with a recent nightly bump it started to exhibit consistent crashes. Before that it worked reasonably well, with occasional failures – which is why I've been leaning towards an issue with rust-lld but on the surface things seem fine. Not really sure where to look upstream in rustc/LLVM for possible regressions either – I might end up bisecting my way to find it which will be very fun... 😅

I think PIC/PIE is indeed probably not needed

Aye probably not, but it would be interesting to see if it unblocks things for now. I've been reading up on it a bit today, and trying to find some information on what kind of preinit needs to happen in executables that use it. Unfortunately struggling to find much of anything, which is unfortunate.

SK83RJOSH commented 6 months ago

Also, have you tried to change linker from LLVM ld to GNU ld? This could also make a difference.

Not yet, but that's easy enough to try. :)

Unfortunately not a simple thing to do in this case, fails to link, and I'll probably need to install the custom gcc toolchain to get it working. More effort than I'd want to ask of a user, so I'll continue with the other mitigation investigations for now.