llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
27.82k stars 11.45k forks source link

A function with computed goto is compiled slowly and produces a large object file #7734

Open llvmbot opened 14 years ago

llvmbot commented 14 years ago
Bugzilla Link 7362
Version trunk
OS Linux
Attachments a self-contained testcase
Reporter LLVM Bugzilla Contributor
CC @efriedma-quic,@stoklund

Extended Description

$ time gcc -O2 -c -o vm_gcc.o vm.i

real 0m13.378s user 0m12.248s sys 0m0.264s $ time clang -O2 -c -o vm_clang.o vm.i

real 1m34.942s user 1m3.066s sys 0m25.986s $ size vm_gcc.o vm_clang.o text data bss dec hex filename 74704 640 164 75508 126f4 vm_gcc.o 525331 2688 145 528164 80f24 vm_clang.o

This file is from a development version of Guile, which can be built with clang, but due to this bug it runs at least five times slower than gcc-compiled version.

llvmbot commented 14 years ago

Sorry about the late response, I was on holiday. I have about 20 huge single-function testcases with computed gotos that all compile to good code so there is something peculiar in this VM code that causes bad regalloc. I also agree that VM runtime speed is more important than compile-time speed.

1ba3d143-a64b-4671-82b2-0b31cfb91709 commented 14 years ago

Yes, exactly :-)

The frontend funnels all indirect branches through a single basic block, %indirectgoto in this case. This block has 206 ingoing and 183 outgoing CFG edges. This is a managable CFG, but that indirect branch is completely impossible to handle for the CPU branch predictor.

Early tail duplication creates 206 copies of the indirect branch block, each with 183 outgoing CFG edges. This is the CFG explosion that the frontend was trying to avoid, but unfortunately it is necessary to produce good code. With an individual indirect branch, the CPU branch predictor gets to do its Markov chain thing based on the branch instruction address.

Part of the philosophy here is that any function using computed gotos is written with performance in mind, and it is OK to spend some extra compile time on getting good code. It looks like we are failing to do that here, though.

efriedma-quic commented 14 years ago

Thanks, Eli.

It is interesting that PHI elimination takes to long. I thought we had that under control for these indirectbr interpreters.

PHI elimination is uniqueing phi nodes to limit the number of join registers produced. Early tail duplication is generating lots of identical phi nodes.

I'd bet -disable-early-taildup would improve both compile time and code size a lot. Unfortunately, early tail dup is critical to getting good branch prediction.

It does help a lot: it makes llc about 15x faster and reduces the codesize to something comparable with gcc. Maybe some heuristic could be added to prevent tail dup on the blocks in question before regalloc?

Wait, nevermind... it's supposed to help cases like this, isn't it. :)

efriedma-quic commented 14 years ago

Thanks, Eli.

It is interesting that PHI elimination takes to long. I thought we had that under control for these indirectbr interpreters.

PHI elimination is uniqueing phi nodes to limit the number of join registers produced. Early tail duplication is generating lots of identical phi nodes.

I'd bet -disable-early-taildup would improve both compile time and code size a lot. Unfortunately, early tail dup is critical to getting good branch prediction.

It does help a lot: it makes llc about 15x faster and reduces the codesize to something comparable with gcc. Maybe some heuristic could be added to prevent tail dup on the blocks in question before regalloc?

1ba3d143-a64b-4671-82b2-0b31cfb91709 commented 14 years ago

Thanks, Eli.

It is interesting that PHI elimination takes to long. I thought we had that under control for these indirectbr interpreters.

PHI elimination is uniqueing phi nodes to limit the number of join registers produced. Early tail duplication is generating lots of identical phi nodes.

I'd bet -disable-early-taildup would improve both compile time and code size a lot. Unfortunately, early tail dup is critical to getting good branch prediction.

efriedma-quic commented 14 years ago

Testcase compiled to .bc Requested file; timings on my computer:

$ time gcc -m32 -O2 /tmp/vm.i -S -o /tmp/vmgcc.s real 0m4.305s user 0m4.170s sys 0m0.070s

$ time llc /tmp/vm.bc real 0m24.500s user 0m24.300s sys 0m0.160s

Top time users in llc: 15.2600 ( 62.9%) 0.0600 ( 46.2%) 15.3200 ( 62.8%) 15.3209 ( 62.8%) Simple Register Coalescing 4.0700 ( 16.8%) 0.0000 ( 0.0%) 4.0700 ( 16.7%) 4.0741 ( 16.7%) Eliminate PHI nodes for register allocation 1.5300 ( 6.3%) 0.0000 ( 0.0%) 1.5300 ( 6.3%) 1.4996 ( 6.2%) Linear Scan Register Allocator 1.2000 ( 4.9%) 0.0200 ( 15.4%) 1.2200 ( 5.0%) 1.2115 ( 5.0%) Tail Duplication 0.6900 ( 2.8%) 0.0100 ( 7.7%) 0.7000 ( 2.9%) 0.6990 ( 2.9%) Live Variable Analysis 0.4500 ( 1.9%) 0.0100 ( 7.7%) 0.4600 ( 1.9%) 0.4886 ( 2.0%) X86 DAG->DAG Instruction Selection

Size of stripped clang object file: 592096 Size of stripped gcc object file: 80264

1ba3d143-a64b-4671-82b2-0b31cfb91709 commented 14 years ago

Juhani, I can't compile your precompiled file on Darwin. Could you produce a bitcode file with "clang -O2 -emit-llvm" please? The slowdown is likely to be in the code generator, as "llc -O2 -time-passes vm.bc" should tell you.

efriedma-quic commented 14 years ago

Eli, your crasher doesn't crash for me. Is it fixed, or could you provide an llc command line as well? Thanks

I can't reproduce the crash anymore.

1ba3d143-a64b-4671-82b2-0b31cfb91709 commented 14 years ago

Eli, your crasher doesn't crash for me. Is it fixed, or could you provide an llc command line as well? Thanks

efriedma-quic commented 14 years ago

CCing someone who might be able to figure out what's going on here.

efriedma-quic commented 14 years ago

Bugpoint-reduced crash So despite how long it took, the testcase reduced to a sane size. I'm not sure if this is relevant to the issue in comment 0, but the liveness being miscomputed could cause all sorts of strange results.

llvmbot commented 14 years ago

It's possible there's something subtly wrong with the LiveVariables pass messing up the output here; while messing with the testcase, I found a .bc file which SEGFAULTs llc; I'm currently trying to reduce it with bugpoint, but it's going very slowly. I can attach it if anyone is interested.

I have filed also #​7348 and #​7356 which happen on big functions with computed gotos but result in sig11 (stack overflow)/sig6 (in jump threading). They are not directly related to this bug, but many of the testcases took a long time to compile as well before crashing..

efriedma-quic commented 14 years ago

The -O0 code here is okay (only slightly larger than gcc).

It appears the codesize explosion comes from the code generator handling the very large phi nodes from mem2reg badly. Looking over the -O2 generated code, something like half the generated code is spills and reloads. the vast majority of lines are labeled with either "4-byte Reload" or "4-byte Spill".

It's possible there's something subtly wrong with the LiveVariables pass messing up the output here; while messing with the testcase, I found a .bc file which SEGFAULTs llc; I'm currently trying to reduce it with bugpoint, but it's going very slowly. I can attach it if anyone is interested.