jart / blink

tiniest x86-64-linux emulator
ISC License
6.95k stars 220 forks source link

ppc64le JIT #17

Open classilla opened 1 year ago

classilla commented 1 year ago

This isn't a request to write it; I can write this. The question is whether it would be accepted (the comments in jit.c seem to imply the desire is to keep it x86_64 and aarch64 only, perhaps I've read them wrong).

jart commented 1 year ago

I would love to see Blink be able to JIT ppc64le. Especially if being able to do so doesn't increase the binary footprint of our x86-64 and aarch64 builds. If you can help us do that, then please join our Discord and have fun hacking with us! https://discord.gg/vFdkMdQN

gorsing commented 1 year ago

I hope ppc64le support will be

jart commented 1 year ago

This issue hasn't been updated in a while, so I intend to make an announcement.

I will implement JIT support for the IBM OpenPOWER architecture if someone donates to me either the Talos™ II 2U Rack Mount Server or Talos™ II Desktop Development System. The rack mounted one might be better, since it has 36 cores and would therefore let me compile code faster for all our users. It would cost $10,669.99 and I could deliver top-notch x86_64 JITing for POWER users in less than a month, made freely available under an ISC license.

classilla commented 1 year ago

I've wrestled off and on with this for awhile and I'm blocked on a crash I can't resolve. If I may gently prod, the code that apparently needs to be updated for bringing up a new JIT is in multiple places and they aren't always obviously marked, so I've probably missed something I don't know to fix. This is the current patch, with several things commented out that don't work yet but if I read it right should only affect the quality of generated code, not its functionality.

With gdb --args o//blink/blink -es build/bootstrap/mkdeps.com it ends up bombing out in OpStos with evidence of stack corruption (can't unwind past ExecuteInstruction) after executing for awhile, so this is tough to debug. I tried to pattern it after aarch64 but there were some places that generated ARM64 code which weren't clearly doing so, ahem.

If you wouldn't mind having a look at the patch, where have I missed? It codegens fine and starts execution, so the rudiments work. It is limited to ppc64le but it may work fine on big-endian ppc64 when this is done.

argh.txt

classilla commented 1 year ago

I did notice that I got IsRet wrong and fixed that to be more like ARM, but that isn't the problem here. The current crash looks like this:

% gdb --args o//blink/blink -es build/bootstrap/mkdeps.com
GNU gdb (GDB) Fedora Linux 13.1-4.fc38
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "ppc64le-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from o//blink/blink...
(gdb) run
Starting program: /home/spectre/src/blink/o/blink/blink -es build/bootstrap/mkdeps.com
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
I2023-06-01T10:40:25.265590:blink/loader.c:708:816742 (sys) LoadProgram build/bootstrap/mkdeps.com
I2023-06-01T10:40:25.265771:blink/loader.c:100:816742 (sys) PT_LOAD R.X [400000,42e000) build/bootstrap/mkdeps.com
I2023-06-01T10:40:25.265854:blink/loader.c:100:816742 (sys) PT_LOAD RW. [42e000,456000) build/bootstrap/mkdeps.com
FuseBranchCmp
FuseBranchTest
FuseBranchTest
FuseBranchTest
FuseBranchCmp
FuseBranchTest
FuseBranchTest
FuseBranchTest
FuseBranchCmp
FuseBranchCmp
FuseBranchTest
FuseBranchTest
FuseBranchTest

Program received signal SIGSEGV, Segmentation fault.
0x0000000100048fc0 in StringOp (m=0x101ff7230, rde=297425592599457792, disp=0, 
    uimm0=0, op=op@entry=2) at blink/string.c:145
145         switch (op) {
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.37-4.fc38.ppc64le zlib-1.2.13-3.fc38.ppc64le
(gdb) bt
#0  0x0000000100048fc0 in StringOp (m=0x101ff7230, rde=297425592599457792, 
    disp=0, uimm0=0, op=op@entry=2) at blink/string.c:145
#1  0x000000010004987c in OpStos (m=<optimized out>, rde=<optimized out>, 
    disp=<optimized out>, uimm0=<optimized out>) at blink/string.c:301
#2  0x00000001000de008 in g_code ()
#3  0x0000000100029b00 in ExecuteInstruction (m=0x101ff7230)
    at blink/machine.c:2205
#4  ExecuteInstruction (m=0x101ff7230) at blink/machine.c:2194
#5  0x00004fffffffeb18 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
classilla commented 1 year ago

And the disassembly going up to OpStos. This looks pretty normal, so the foul probably occurred earlier.

(gdb) disas 0x00000001000de008-0x40, 0x00000001000de008+0x40
Dump of assembler code from 0x1000ddfc8 to 0x1000de048:
   0x00000001000ddfc8 <g_code+171544>:  lis     r6,0
   0x00000001000ddfcc <g_code+171548>:  ori     r6,r6,0
   0x00000001000ddfd0 <g_code+171552>:  sldi    r6,r6,32
   0x00000001000ddfd4 <g_code+171556>:  oris    r6,r6,0
   0x00000001000ddfd8 <g_code+171560>:  ori     r6,r6,0
   0x00000001000ddfdc <g_code+171564>:  lis     r5,0
   0x00000001000ddfe0 <g_code+171568>:  ori     r5,r5,0
   0x00000001000ddfe4 <g_code+171572>:  sldi    r5,r5,32
   0x00000001000ddfe8 <g_code+171576>:  oris    r5,r5,0
   0x00000001000ddfec <g_code+171580>:  ori     r5,r5,0
   0x00000001000ddff0 <g_code+171584>:  lis     r4,1056
   0x00000001000ddff4 <g_code+171588>:  ori     r4,r4,43776
   0x00000001000ddff8 <g_code+171592>:  sldi    r4,r4,32
   0x00000001000ddffc <g_code+171596>:  oris    r4,r4,10752
   0x00000001000de000 <g_code+171600>:  ori     r4,r4,12288
   0x00000001000de004 <g_code+171604>:  bl      0x100049858 <OpStos>
   0x00000001000de008 <g_code+171608>:  lis     r5,0
   0x00000001000de00c <g_code+171612>:  ori     r5,r5,0
   0x00000001000de010 <g_code+171616>:  sldi    r5,r5,32
   0x00000001000de014 <g_code+171620>:  oris    r5,r5,0
   0x00000001000de018 <g_code+171624>:  ori     r5,r5,1638
   0x00000001000de01c <g_code+171628>:  lis     r6,0
   0x00000001000de020 <g_code+171632>:  ori     r6,r6,0
   0x00000001000de024 <g_code+171636>:  sldi    r6,r6,32
   0x00000001000de028 <g_code+171640>:  oris    r6,r6,0
   0x00000001000de02c <g_code+171644>:  ori     r6,r6,1638
   0x00000001000de030 <g_code+171648>:  lis     r7,0
   0x00000001000de034 <g_code+171652>:  ori     r7,r7,0
   0x00000001000de038 <g_code+171656>:  sldi    r7,r7,32
   0x00000001000de03c <g_code+171660>:  oris    r7,r7,0
   0x00000001000de040 <g_code+171664>:  ori     r7,r7,1638
   0x00000001000de044 <g_code+171668>:  lis     r8,0
End of assembler dump.
jart commented 1 year ago

This is very exciting news! I'll have time to respond in the next few days. We're also talking about your contribution on our Discord. https://discord.gg/HQNA9faw We'd love if you joined us!

tkchia commented 1 year ago

Hello @classilla,

I am not familiar with the PowerPC assembly or debugging — but are you able to dump the state of the registers at the crash site?

Incidentally, I noticed that, when I tried running o/test/asm/add.com under o/powerpc64le/blink/blink (with QEMU emulation), I would get a

blink/jit.c:1919:14593 assertion failed: !(disp & 0x03) (0)
     PC 12b8c9012f0c mov %rax,0x30(%rsp) 48 89 44 24 30 48 8d 05

Some further exploration suggests that this was caused by Blink trying to insert a jump from an OomJit() address to some other place.

Thank you!

classilla commented 1 year ago

Yeah, I can reproduce that. I'm trying to find where that's set off (again, is there some other section of the JIT that I've missed?).

Looks like the crash was caused by TOC getting stomped on certain calls. I was hoping to avoid setting r12 to the destination address on every call but this seems unavoidable. It gets further now.

classilla commented 1 year ago

What should OomJit() point to? How did that ever work for aarch64?

tkchia commented 1 year ago

Hello @classilla,

What should OomJit() point to? How did that ever work for aarch64?

See OomJit() in blink/jit.c.

OK, I think I know what is going on. When the AArch64 JITter (e.g.) finds that there is no more space in the JIT buffer,

In such cases it is OK for the JIT location counter to be unaligned, since the code will be discarded anyway. So perhaps instead of

  unassert(!(disp & 0x03));

you could just say

  unassert(!(disp & 0x03) || jb->index > kJitBlockSize);

Thank you!

classilla commented 1 year ago

This is the current checkpoint. It is enough to execute o//blink/blink build/bootstrap/mkdeps.com and many of the tests (in particular cosmo/2/test_suite_md.com and cosmo/2/test_suite_mpi.com are indeed 5-6x faster), but other tests that should pass quickly seem to hang indefinitely. This includes o//blink/blink third_party/cosmo/2/palandprintf_test.com and o//blink/blink third_party/cosmo/2/cos_test.com. Is there something weird about floating point I haven't accounted for?

The code it generates now is pretty good, but for many of the micro ops it seems unnecessary to load r12 since they don't reference the TOC, and it would be nice to eliminate it for those quick load/store/gimme-register functions which get called a lot. Maybe I can come up with a white list. Some of those functions are single instructions and seem ideal for inlining if that ever becomes a thing. About the only thing missing is the inability to fuse overflow checks because we have to go to XER for that, not the regular condition register fields.

Is there a way to debug calls?

checkpoint-20230602.txt

tkchia commented 1 year ago

Hello @classilla,

Do you mean you want to step into a function that is being called? You can probably use GDB or LLDB's step and/or stepi commands for that (unless I am missing something).

Thank you!

classilla commented 1 year ago

No (I'm well aware of what those do, probably wouldn't have been able to write anything without them ;-). What I want is to instrument what x86 instructions map to what blocks of generated code so I can understand where the infinite loop is coming from. If this isn't easily possible, I may put this aside for awhile again, since I don't have any further way to understand the tests that fail.

tkchia commented 1 year ago

Hello @classilla,

What I want is to instrument what x86 instructions map to what blocks of generated code so I can understand where the infinite loop is coming from.

It might be helpful to dump m->ipm normally goes into the register kJitSav0 — to get an idea of which basic block in the guest code is being run.

(In non-JIT mode, m->ip - m->oplen should give precise guest %rip values, but as the README explains, JITted code may try not to update m->ip unless really necessary.)

Some of those functions are single instructions and seem ideal for inlining if that ever becomes a thing.

The x86-64 JITter does know how to inline the more "trivial" micro-ops into the JIT stream. See the implementation of CallMicroOp( ) in blink/uop.c. The AArch64 JITter does not do this yet, but I am working on implementing it (https://github.com/jart/blink/pull/145). You could probably do something similar.

Thank you!

classilla commented 1 year ago

I eventually started stepping through the code with blinkenlights -j third_party/cosmo/2/cos_test.com to see where it diverges from a non-JIT run. It ends up making three normal calls to dtoa but the fourth call is where it goes haywire.

The code gets to 00414d9c * mov %rax,%r14. On the non-JIT run, single stepping goes to 00414d9f mov %r13d,%esi (the next instruction), as expected, but on the JIT run a single step immediately jumps to 00415400 movl $1,-0x8c(%rbp).

That doesn't make any sense. Did I forget to convert a section of code in my patch?

tkchia commented 1 year ago

Hello @classilla,

This is probably expected and OK. In JIT mode, "single stepping" will not really step through just the next instruction, but instead it will run through an entire translated basic block. If the guest state is correct by the time the guest reaches %rip = 0x415400 then there should be no problem.

Thank you!

classilla commented 1 year ago

That's going to be a problem, because the divergence occurs in that entire segment it flies through (the guest state is not correct at the end of the basic block).