llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
27.92k stars 11.52k forks source link

slow compilation #13447

Open llvmbot opened 12 years ago

llvmbot commented 12 years ago
Bugzilla Link 13075
Version trunk
OS Linux
Reporter LLVM Bugzilla Contributor
CC @DougGregor,@belkadan,@jryans,@tkremenek

Extended Description

clang version 3.2 (trunk 155835)

The attached 2.6MB .E file is a pre-processed (-E) file from the chromium build of WebKit, and clang takes ~3s to compile it down to a 3.4M .o file (on a beefy linux machine).

$ /usr/bin/time clang -w -g   -x c++  -c webcore_rendering.RenderMediaControlsChromium.E -o /dev/null
2.99user 0.08system 0:03.08elapsed 99%CPU (0avgtext+0avgdata 705440maxresident)k
0inputs+0outputs (0major+56787minor)pagefaults 0swaps

Dropping the -g drops the compile time to 2.2s-2.3s, although this still seems excessively long.

When I do the same pre-processing for a g++-based ARM cross-compile of this file, the -E output is still 2.6M, but g++ 4.6.3 takes only ~70ms (!) to compile it down to a 1.4M .o file (once caches are warmed up).

So for some reason clang is 42x slower than gcc in producing output that is less than 3x the size.

llvmbot commented 11 years ago

... but top-of-tree is r171419. That's a significant number of revisions. For sanity, could you recheck the timings with tip-of-trunk?

I fail to build at r171419. Using chromium's pre-built clang at r170392 shows no speedup (still in 1.5-1.6s).

Does the llvm project have performance/timing tests that might benefit from adding this test case?

tkremenek commented 11 years ago

As of r168474 my test from comment#3 is down to 1.6s on an HP z620 running precise (original timings were from a z600 running lucid, FWIW).

So, there's about a 2x speed improvement over the originally-reported bug, but that's still over 20x slower than gcc.

... but top-of-tree is r171419. That's a significant number of revisions. For sanity, could you recheck the timings with tip-of-trunk?

tkremenek commented 11 years ago

As of r168474 my test from comment#3 is down to 1.6s on an HP z620 running precise (original timings were from a z600 running lucid, FWIW).

So, there's about a 2x speed improvement over the originally-reported bug, but that's still over 20x slower than gcc.

Thanks. The comparison against gcc is useful.

llvmbot commented 11 years ago

As of r168474 my test from comment#3 is down to 1.6s on an HP z620 running precise (original timings were from a z600 running lucid, FWIW).

So, there's about a 2x speed improvement over the originally-reported bug, but that's still over 20x slower than gcc.

tkremenek commented 11 years ago

Is this still an issue? On my machine, this takes 1.4s to compile:

$ time ~/llvm-make/Release/bin/clang -c /tmp/webcore_rendering.RenderMediaControlsChromium.ii -w -g -fno-blocks

real    0m1.395s
user    0m1.316s
sys 0m0.074s
tkremenek commented 11 years ago

Is this still an issue? On my machine, this takes 1.4s to compile:

$ time ~/llvm-make/Release/bin/clang -c /tmp/webcore_rendering.RenderMediaControlsChromium.ii -w -g -fno-blocks

real    0m1.395s
user    0m1.316s
sys 0m0.074s
llvmbot commented 11 years ago

Nico Weber tells me this bug needs -ftime-report output, so here it is, for clang version 3.2 (trunk 163674) on x86_64-unknown-linux-gnu:

$ /usr/bin/time ~/src/chromium/src/third_party/llvm-build/Release+Asserts/bin/clang -w -g   -x c++  -c webcore_rendering.RenderMediaControlsChromium.E -o /dev/null -ftime-report
===-------------------------------------------------------------------------===
                      Instruction Selection and Scheduling
===-------------------------------------------------------------------------===
  Total Execution Time: 0.0100 seconds (0.0168 wall clock)

   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   0.0100 (100.0%)   0.0100 (100.0%)   0.0042 ( 24.7%)  Instruction Scheduling
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0028 ( 16.6%)  Instruction Selection
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0025 ( 15.1%)  Instruction Creation
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0019 ( 11.4%)  DAG Combining 1
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0017 (  9.9%)  Vector Legalization
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0014 (  8.2%)  DAG Legalization
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0010 (  6.1%)  Type Legalization
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0009 (  5.5%)  DAG Combining 2
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0004 (  2.3%)  Instruction Scheduling Cleanup
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.2%)  DAG Combining after legalize types
   0.0100 (100.0%)   0.0100 (100.0%)   0.0168 (100.0%)  Total

===-------------------------------------------------------------------------===
                                 DWARF Emission
===-------------------------------------------------------------------------===
  Total Execution Time: 0.1100 seconds (0.1267 wall clock)

   ---User Time---   --User+System--   ---Wall Time---  --- Name ---
   0.1100 (100.0%)   0.1100 (100.0%)   0.1254 ( 99.0%)  DWARF Debug Writer
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0013 (  1.0%)  DWARF Exception Writer
   0.1100 (100.0%)   0.1100 (100.0%)   0.1267 (100.0%)  Total

===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 0.1000 seconds (0.1219 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   0.0200 ( 22.2%)   0.0100 (100.0%)   0.0300 ( 30.0%)   0.0362 ( 29.7%)  X86 DAG->DAG Instruction Selection
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0263 ( 21.6%)  X86 AT&T-Style Assembly Printer
   0.0200 ( 22.2%)   0.0000 (  0.0%)   0.0200 ( 20.0%)   0.0138 ( 11.3%)  Module Verifier
   0.0100 ( 11.1%)   0.0000 (  0.0%)   0.0100 ( 10.0%)   0.0128 ( 10.5%)  Module Verifier
   0.0100 ( 11.1%)   0.0000 (  0.0%)   0.0100 ( 10.0%)   0.0123 ( 10.1%)  Module Verifier
   0.0200 ( 22.2%)   0.0000 (  0.0%)   0.0200 ( 20.0%)   0.0037 (  3.0%)  Fast Register Allocator
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0030 (  2.4%)  Machine Function Analysis
   0.0100 ( 11.1%)   0.0000 (  0.0%)   0.0100 ( 10.0%)   0.0026 (  2.1%)  Prologue/Epilogue Insertion & Frame Finalization
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0019 (  1.6%)  Two-Address instruction pass
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0012 (  1.0%)  Dominator Tree Construction
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0009 (  0.7%)  Dominator Tree Construction
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0009 (  0.7%)  Basic Alias Analysis (stateless AA impl)
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0009 (  0.7%)  Dominator Tree Construction
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0007 (  0.6%)  Basic CallGraph Construction
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0005 (  0.4%)  Remove unreachable blocks from the CFG
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0004 (  0.4%)  No Alias Analysis (always returns 'may' alias)
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0004 (  0.3%)  Post-RA pseudo instruction expansion pass
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0004 (  0.3%)  Bundle Machine CFG Edges
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0004 (  0.3%)  Inliner for always_inline functions
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0003 (  0.3%)  Eliminate PHI nodes for register allocation
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0003 (  0.2%)  Exception handling preparation
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0002 (  0.2%)  Expand ISel Pseudo-instructions
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0002 (  0.2%)  Preliminary module verification
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0002 (  0.2%)  Preliminary module verification
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0002 (  0.2%)  Insert stack protectors
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0002 (  0.1%)  X86 FP Stackifier
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0002 (  0.1%)  Analyze Machine Code For Garbage Collection
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0002 (  0.1%)  Local Stack Slot Allocation
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0002 (  0.1%)  X86 Maximal Stack Alignment Check
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0001 (  0.1%)  Lower Garbage Collection Instructions
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0001 (  0.1%)  Preliminary module verification
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0001 (  0.1%)  Delete Garbage Collector Information
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0001 (  0.1%)  Create Garbage Collector Module Metadata
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Target Library Information
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Target Pass Configuration
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Machine Module Information
   0.0900 (100.0%)   0.0100 (100.0%)   0.1000 (100.0%)   0.1219 (100.0%)  Total

===-------------------------------------------------------------------------===
                         Miscellaneous Ungrouped Timers
===-------------------------------------------------------------------------===

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   1.7600 ( 76.9%)   0.0800 ( 66.7%)   1.8400 ( 76.3%)   1.8533 ( 76.6%)  Clang front-end timer
   0.2400 ( 10.5%)   0.0400 ( 33.3%)   0.2800 ( 11.6%)   0.2847 ( 11.8%)  Code Generation Time
   0.2900 ( 12.7%)   0.0000 (  0.0%)   0.2900 ( 12.0%)   0.2810 ( 11.6%)  LLVM IR Generation Time
   2.2900 (100.0%)   0.1200 (100.0%)   2.4100 (100.0%)   2.4191 (100.0%)  Total

1.78user 0.09system 0:01.88elapsed 99%CPU (0avgtext+0avgdata 646960maxresident)k
0inputs+0outputs (0major+51782minor)pagefaults 0swaps
llvmbot commented 12 years ago

[pre-processed https://user-images.githubusercontent.com/60944935/143745482-bfb49fb9-290c-42b8-93ff-d84618c1c891.gz) Here's a reconstruction of what I described in my original report (not sure where the old attachment went).

This takes 2.63s walltime to compile 2.3M of pre-processed source into 3.1M of compiled .o:

$ /usr/bin/time clang -w -g   -x c++  -c /tmp/webcore_rendering.RenderMediaControlsChromium.E -o /dev/null
2.48user 0.13system 0:02.63elapsed 99%CPU (0avgtext+0avgdata 656096maxresident)k
0inputs+0outputs (0major+53620minor)pagefaults 0swaps
belkadan commented 12 years ago

Sorry to get back to you months later, but I don't see a preprocessed file...

DougGregor commented 12 years ago

cloned to rdar://problem/11637948

Endilll commented 10 months ago

Re-uploading preprocessed source: webcore_rendering.RenderMediaControlsChromium.E.zip