oracle / graal

GraalVM compiles Java applications into native executables that start instantly, scale fast, and use fewer compute resources 🚀
https://www.graalvm.org
Other
20.44k stars 1.64k forks source link

[native-image] AArch64: Using GDB errors with unknown CFA value #2378

Open a74nh opened 4 years ago

a74nh commented 4 years ago

Take a simple java program. On AArch64, create a native image using the debug info flag. Then load it into GDB. Break at main, run, then backtrace. GDB errors with unknown CFA value.

$ native-image -H:GenerateDebugInfo=1 FPAdder

$ gdb fpadder
GNU gdb (GDB) 10.0.50.20200416-git
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "aarch64-unknown-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./fpadder...
(gdb) b main
Breakpoint 1 at 0x2300c: main. (2 locations)
(gdb) r
Starting program: /home/alahay01/javatut/fpadder
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".

Breakpoint 1, 0x0000aaaaaaade528 in com.oracle.svm.core.code.IsolateEnterStub::JavaMainWrapper_run_5087f5482cc9a6abc971913ece43acb471d2631b(int, org.graalvm.nativeimage.c.type.CCharPointerPointer)(void) ()
    at com/oracle/svm/core/code/IsolateEnterStub.java:5
5   com/oracle/svm/core/code/IsolateEnterStub.java: No such file or directory.
(gdb) bt
#0  0x0000aaaaaaade528 in com.oracle.svm.core.code.IsolateEnterStub::JavaMainWrapper_run_5087f5482cc9a6abc971913ece43acb471d2631b(int, org.graalvm.nativeimage.c.type.CCharPointerPointer)(void) ()
    at com/oracle/svm/core/code/IsolateEnterStub.java:5
../../src/binutils-gdb/gdb/dwarf2/frame.c:1077: internal-error: Unknown CFA rule.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Quit this debugging session? (y or n)

Tested on Ubuntu 18.04 with gdb 8.1.0.20180409-git (default ubuntu) and 10.0.50.20200416-git built from gdb head last week.

Graal built from git head b47aaf4814 2020-04-14 Using labsjdk-ce-11.0.6-jvmci-20.0-b02

olpaw commented 4 years ago

@adinn looks like com.oracle.objectfile.elf.dwarf.DwarfFrameSectionImplAArch64 needs further refinement for AArch64.

adinn commented 4 years ago

The CFA rule in the frame CIE is definitely wrong but that's not the only thing that is going on here. Comparing with C frame records the CIE needs to be empty and the PC register needs to be marked a lr (r30). Unfortunately those fixes still do not stop the gdb crash. I need to debug gdb to see what is going on here (doing that now).

Just for the record this seems to be gdb trying to be smart about the generated code. The break address used when a method break point is set on AArch64 is some (~5/6) instructions into the code segment. This does appear to not correspond to any meaningful point in the Graal native generated code. If you print the code by address gdb definitely knows the correct start address for the method. The first instruction labelled as belonging to the method corresponds to the address specified in the info section.

It looks like gdb is trying to insert the break point after what it assumes is a method prolog. It does the same skip forward from the first instruction when given a compiled C program. In the example I used it placed the break at the point where the frame has been constructed and locals have been written to the frame.

I tried inserting an address break at the first instruction of the method but gdb still crashes when it reaches that point if you try to obtain a backtrace. I suspect it may not like the the FDE records which are supposed to tell it that the frame has been built because it is expecting to recognise the frame build sequence itself by looking at the instructions.

I'll follow up when I get more info.

a74nh commented 4 years ago

It looks like gdb is trying to insert the break point after what it assumes is a method prolog. It does the same skip forward from the first instruction when given a compiled C program. In the example I used it placed the break at the point where the frame has been constructed and locals have been written to the frame.

Yes. aarch64_analyze_prologue() in aarch64-tdep.c does that.

Other architectures do the same in GDB too - eg amd64

I did fix a big in the AArch64 function back in 2019-08-14 - if you get an assert in that function, then try a newer gdb.

adinn commented 4 years ago

Well, I fixed up the CFA issue. I had bollixed up the CIE. I have a patch which stop[s it crashing but the FDEs are still not correct. I can't really fix that until I sort out the prologue issue.

It seems the break is being inserted at the first control-flow transfer which is not really good enough as that can be an arbitrary way into the body of the method code.

I am sure there is an 'equivalent' check on x86_64 but it doesn't manifest the same behaviour. On that architecture the method breakpoint always seems to be planted at the start of the method.

If I can work out what type of frame setup code gdb recognizes then we might perhaps tweak the AArch64 frame setup code to conform to what gdb expects. Alternatively, we might need to patch gdb/aarch64. I'll post details as I find them.

a74nh commented 4 years ago

If I can work out what type of frame setup code gdb recognizes then we might perhaps tweak the AArch64 frame setup code to conform to what gdb expects. Alternatively, we might need to patch gdb/aarch64. I'll post details as I find them.

Ideally sticking to what gdb expects helps with compatibility across the board with tools in general. GDB guys are usually happy with small tweaks for non standard stuff as long as it's not going to break anything else. And of course, changing gdb then requires graal users to update their gdb.

adinn commented 4 years ago

Ok, so using gdb 8.3 it turns out there is a difference between routines amd64_skip_prolog and aarch64_skip_prologue which are suppose dot find the offset to the method body after the method prologue. It is this rather than any difference in the debug info that seems to be causing a problem setting the breakpoint.

The AArch64 code checks for two successive debug line records at or following the start pc. If found it assumes the first identifies the prologue and treats the second as the address where the method code starts for real. If there is no line info for the method or only one line entry then it calls aarch64_analyze_prologue to look for known prologue patterns. When it finds a pattern it recognises sets the pc to the instruction following that pattern. If there is no match it uses the start pc as the method start.

The Java JIT does not generate a line record for the prologue and then for the post prologue code. So, in any method with more than one line it will never call aarch64_analyze_prologue. It just assumes the address in the second line record is the start of the method code start. As a result, the break point normally gets placed at the second line in the source for the method.

The x86_64 code differs in that it only assumes two successive line records indicate prologue then first line of code when the compile unit was generated by clang. Since the producer is not the pattern approach is tried, fails and reverts to the method start pc.

I think this probably needs fixing in the gdb code. One thing I might be able to do as a workaround is generate an extra special line record at the point where the first increment to the stack frame size is notified.

I think I also need to add some extra instructions to the FDE records to deal with the fact that lr and fp are being saved to the stack. At present lr is flagged as holding the return pc when the call occurs and a break at the true start successfully unwinds the stack. However, I think that needs to be corrected at the pint the stack grows so that fp and lr are marked as saved to cfa - framesize and cfa - (framesize-8), respectively. That also means correcting at the point where the stack shrinks back to empty to note that they have been restored. Currently, without the correction the stack unwind just keeps finding the same stack frame.

a74nh commented 4 years ago

The complete "skip prologue code if clang" block was added back in Dec 2012 (https://sourceware.org/pipermail/gdb-patches/2012-December/098215.html), due to the prologue being different on clang. Instead adding adding additional code to amd64_analyze_prologue to pattern match the clang code, it was easier to trust that clang was producing usable line notes. At the time GCC was not trusted ("compilers can not always be trusted to emit the right information for this to work. In the past GCC has been particularly flaky in this respect").

Fast forward a few months and the AArch64 backend was added in a single patch. It has the skip prologue part, but doesn't check for Clang. I'm going to hazard a guess that because the gcc AArch64 backend was brand new, the writers decided that they did trust all versions of AArch64 GCC.

I thought all this code sounded familiar - I posted a patch in this area back in 2018. That patch fell by the wayside - but this comment is pertinent: "but I would much rather have a blacklist of bad compilers than the current approach of a whitelist of good ones". (https://sourceware.org/pipermail/gdb-patches/2018-November/154187.html)

What to do now? I'd be concerned about accidentally effecting other compilers. So maybe a negative check for native-image in aarch64 and amd64 is the correct route.

However, "The Java JIT does not generate a line record for the prologue and then for the post prologue code". With my GDB hat on, I'd ask why not and can it be added? (also I'm unclear to where you mean by java jit - graal compiler, svm, openjdk.... etc). If it could be done, then it makes things a lot easier down the road (for example, no need to then fix for lldb if people went down that route)