seanjensengrey / unladen-swallow

Automatically exported from code.google.com/p/unladen-swallow
Other
0 stars 0 forks source link

Make LLVM-compiled functions faster than interpreted functions #47

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
This is a summary bug to track everything I'm trying while optimizing IR
generation and codegen. This doesn't cover more aggressive optimizations
like type feedback and const-prop-based devirtualization. It does cover
issues 16, 17, and 18.

Original issue reported on code.google.com by jyass...@gmail.com on 4 Jun 2009 at 3:52

GoogleCodeExporter commented 9 years ago
One possible problem with 64-bit builds is that the heap may be far away from 
the
code segment. x86-64 can't represent a direct call more than 32 bits away, so 
instead
the JIT inserts stubs like:

(gdb) disassemble 0x7ffff5d40fc0 0x7ffff5d40fcd
Dump of assembler code from 0x7ffff5d40fc0 to 0x7ffff5d40fcd:
0x00007ffff5d40fc0:     mov    $0x8cbc42,%r10
0x00007ffff5d40fca:     jmpq   *%r10
End of assembler dump.
(gdb) disassemble 0x7ffff5d40fd0 0x7ffff5d40fdd
Dump of assembler code from 0x7ffff5d40fd0 to 0x7ffff5d40fdd:
0x00007ffff5d40fd0:     mov    $0x8c9e9e,%r10
0x00007ffff5d40fda:     jmpq   *%r10
End of assembler dump.
...

These appear every 16 bytes. The obvious slowness is that these are indirect 
jumps
which makes the branch predictor much less happy than direct calls. If I were a
processor designer, I might special-case this mov+jmp pair to avoid needing
prediction, but I'll need to measure to see if real processors do.

The second possible problem is that the indirect jump predictor may not 
distinguish
between indirect jumps that are too close together. I've heard that the 
conditional
branch predictor groups jumps in 32-byte lines. I need to check whether the 
indirect
jump predictor does anything similar.

We could fix this whole problem by allocating JITted code closer to the code 
page,
rather than from the heap. Unfortunately, there may be no way to do so portably.

Original comment by jyass...@gmail.com on 4 Jun 2009 at 4:11

GoogleCodeExporter commented 9 years ago
I've now read through the assembly produced for a very simple function by our 
JIT and
gcc, and I can't see anything obviously different in a way that would produce a
performance change. One possibility is that the code layout is enough worse to
produce more branch or icache misses. oprofile/shark should be able to tell us 
that.
We can fix it by recording branch frequencies in the interpreter and feeding 
those to
the optimizers when we translate bytecode to IR.

Another possibility is that the overhead of calling _PyLlvmFunction_Eval() and
getPointerToFunction() on each call is high enough to make a difference. I'll 
try
caching that pointer in the code and frame objects next.

Original comment by jyass...@gmail.com on 4 Jun 2009 at 5:00

GoogleCodeExporter commented 9 years ago
The 64-bit indirect jumps look like a likely suspect. On my Mac, in a 32-bit 
build,
-L2 makes the benchmarks between 0% and 18% faster. On my Linux box, in a 64-bit
build, -L2 makes the benchmarks between 2% faster and 15% slower. So, how do we
allocate memory for JITted code so it lands near the statically-compiled code?

Original comment by jyass...@gmail.com on 4 Jun 2009 at 5:22

GoogleCodeExporter commented 9 years ago
r610 fixed the overhead of calling  _PyLlvmFunction_Eval() and 
getPointerToFunction()
on each call. That was significant.

The indirect jumps are looking less likely as a culprit. I wrote a test program 
which
showed that indirect jumps more than 4 bytes apart predict independently, and 
when
they're predicted well they run as fast as direct jumps.
http://codereview.appspot.com/63221, which avoids the indirect jumps, actually 
shows
a slowdown on the benchmarks. The only thing left to try around this is to 
avoid the
stubs entirely, as 32-bit builds do.

Original comment by jyass...@gmail.com on 9 Jun 2009 at 11:11

GoogleCodeExporter commented 9 years ago
Based on the latest round of benchmarking, our benchmarks are now faster with 
LLVM 
than with the eval loop. There's more work that can be done, but I'm closing 
this now 
that the initial baseline has been exceeded. We should file individual issues 
for 
subsequent improvements.

Original comment by collinw on 1 Jul 2009 at 6:40