llc version 4.0 and later takes up to 45 times as long to compile shader code for Mesa

llvmbot commented 7 years ago


Bugzilla Link	34647
Version	4.0
OS	Linux
Attachments	Vertex shader bytecode, Fragment shader bytecode
Reporter	LLVM Bugzilla Contributor
CC	@echristo,@kuhar,@RKSimon,@arsenm,@MatzeB,@nemanjai,@qcolombet,@stoklund

Extended Description

The Piglit (OpenGL test suite) ext_transform_feedback-max-varyings test utilizes somewhat unusual shader programs (both vertex and fragment shaders).

The llc compiler prior to 4.0 compiled these programs in not-unacceptable times of 0.078 seconds for a representative vertex shader and 2.6-4.5 seconds for a representative fragment shader.

The V4.0 and later llc takes a MUCH longer time to compile the same code: 1.66 seconds for the vertex shader (a factor of 20 times slower!) and 1 minute 55 seconds for the fragment shader (a factor of 25-45 times slower!).

I will attach sample vertex shader code (ir_draw_llvm_vs_variant0.bc) and fragment shader code (ir_fs914_variant0.bc).

The target architecture is PPC64LE.

kuhar commented 4 years ago

We can still reproduce this bugs with LLVM 12.0 using the LLPC pipeline compiler for AMD GPU.

Our profiles show the same pathological behavior in tryEvict.

qcolombet commented 6 years ago

Instead of using the fast regalloc (which is super bad for code quality), you could give a try to basic.

llvmbot commented 6 years ago

Comparing release builds, here are the compile-time differences with and without Eric's patch:

Going back to the first revision with the patch only shows a degradation (2x-3x) with input file ir_fs138_variant0.bc
Just pulling the patch from ToT doesn't show noticeable improvement for any of the input files

The compile-time increase comes from greedy register allocation. The patch changes the instruction scheduling - as it is meant to, which unfortunately means that in this particular case, we produce a schedule that is particularly bad for Greedy RA. Here are a few technical details as I understand them from my investigation:

Greedy RA greedily assigns physical registers to virtual registers with the longest live range first
Then it will try to fit in shorter live ranges and will evict virtual registers that were already assigned if it is profitable to do so. When it does this, it needs to re-queue the evicted live range
It just so happens that with this schedule, we evict more registers so we converge more slowly
It is possible that the new schedule for this test case produces live ranges that are near worst case for greedy RA (I haven't analyzed the algorithm enough to claim this is the case, but perhaps developers more familiar with this code can comment).
Furthermore, I assume this near worst case could quite conceivably be produced without this patch with the right test case (again, I haven't confirmed this).

With this information in mind, I think we might have to consider this a limitation and close this PR. Considering MESA is a JIT, it may be worth while investigating the possibility of switching to the fast register allocator (like use option -regalloc=fast). I assume that will produce less optimal register allocation, but is presumably faster than the near-optimal greedy register allocator. Let me know what you think about this.

echristo commented 7 years ago

My patch is merely a scheduler description change, at worst it's highlighting a performance problem somewhere else sadly.

llvmbot commented 7 years ago

I tested reverting the patch from Eric reported by Ben on clang branch 5.0. Just revert that patch reduce the Release version llc compile time for file ir_fs138_variant0.bc by about 10% (the other two byte code see even smaller compile time difference). Which means just Eric's patch causes about 10% compile degradation on branch 5.0. But we did see about 3x-4x compile time difference for file ir_fs138_variant0.bc if I revert everything including Eric's patch and after (from about 0.5 seconds before to about 2.0s seconds after). I will continue look at this issue. Meanwhile please use Release version llc for the compilation in future, since that's way faster than the Debug version llc(about 25x-45x faster).

llvmbot commented 7 years ago

I see I did not specify my exact build procedure; apologies!

Here it is:

In my LLVM directory, /tmp/llvm-bisect (i.e., on RAMdisk):

% cmake -G "Unix Makefiles" -DLLVM_BUILD_LLVM_DYLIB=ON -DCMAKE_INSTALL_PREFIX=/tmp/local /tmp/llvm-bisect
% make -j 144

I.e., I built with "gcc (GCC) 7.2.1 20170915 (Red Hat 7.2.1-2)", the system compiler; we are somewhat constrained to use GCC when building Mesa, LLVM, etc.

But I DID do Debug builds (i.e. let the build type default to Debug), so maybe that has something to do with the differences in our experiences.

llvmbot commented 7 years ago

Hi Ben, can you validate that you are not comparing a Release build llc with a Debug build llc? We know that the Debug build llc is significantly slower than the Release build llc. According to my tests, the Release build llc compile time for Fragment shader bytecode is always around 3 seconds with/without Eric's patch. While the Debug build llc compile time is almost 2 minutes with/without Eric's patch. Can you do the test again for both the Release and Debug build and post your detailed result here if you still believe there is a compile degradation? Thanks a lot!

nemanjai commented 7 years ago

I can reproduce this degradation. I'm not sure how you did your experiment Tony, but I get consistent run times around 0.5s before the first patch and 2.0s after it. We will continue investigating.

llvmbot commented 7 years ago

Hi Ben, I tried to compile all the three byte code files you put in the attachment on our PPC64LE dev machine. With/Without the first bad commit you mentioned (0ef3663fb81c9cd73f646728463a6105b5d9b88a) using the options you put in the comment (-mcpu=pwr8 -mattr=+altivec,+vsx). There is no significant compile time difference for all the 3 byte code files. I run 10 times with/without that patch. Can you retry this problem against the latest trunk of clang/llvm and see whether you can still reproduce? I was just reverting the problematic patch you mentioned from Eric Christopher. If you can provide me with the git hash number for all the other three projects (clang/compiler-rt and test-suite) when you found the bad llvm commit (should have similar time stamp with the 0ef3663fb81c9cd73f646728463a6105b5d9b88a patch). I can revert all the projects to around the bad llvm commit time and test again to see whether I can reproduce. Thank you very much!

The following is one of my test results (there is no visible difference between different compile)

time `llc fragmentShader.bc  -mcpu=pwr8 -mattr=+altivec,+vsx`

real    0m3.501s
user    0m3.491s
sys     0m0.008s

llvmbot commented 7 years ago

P.S. Note that I kept my LLVM build in /tmp, i.e. on RAM disk, so the only disk I/O involved was reading the bytecode file and writing the assembly language output.

llvmbot commented 7 years ago

Bytecode used for bisect operation Hi Nemanja,

Sorry, I did not keep the compile time information for each of the individual bisect steps. HOWEVER, I CAN tell you that, before the problem commit, the compile time for the shader code was routinely in the 6-7 second range, while after the problem commit, the compile time was in the 37-45 second range.

BTW I've attached the specific bytecode file I used for the bisect operation, ir_fs138_variant0.bc .

nemanjai commented 7 years ago

Hi Ben, do you happen to have the compile times for the same shader code with each of the mentioned revisions? It would be good to see which one results in the largest jump. Then we can investigate why this results in such a large compile-time increase.

llvmbot commented 7 years ago

I did a bisect operation as requested by Nemanja, and here is the result (please pardon my use of git instead of SVN):

# first bad commit: [0ef3663fb81c9cd73f646728463a6105b5d9b88a] vec perm can go down either pipeline on P8. No observable changes, spotted while looking at the scheduling description.

This certainly looks suspicious, in light of the fact that the change is in lib/Target/PowerPC/PPCScheduleP8.td.

Here is the text of the commit in the context of the surrounding commits:

commit b89cc7e5e30432b6093664a44ee2e2af9a42f3b6
Author: Nirav Dave <niravd@google.com>
Date:   Sun Feb 26 01:27:32 2017 +0000

    Revert "In visitSTORE, always use FindBetterChain, rather than only when UseAA is enabled."

    This reverts commit r296252 until 256-bit operations are more efficiently generated in X86.

    git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@296279 91177308-0d34-0410-b5e6-96231b3b80d8

commit 0ef3663fb81c9cd73f646728463a6105b5d9b88a
Author: Eric Christopher <echristo@gmail.com>
Date:   Sun Feb 26 00:11:58 2017 +0000

    vec perm can go down either pipeline on P8.
    No observable changes, spotted while looking at the scheduling description.

    git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@296277 91177308-0d34-0410-b5e6-96231b3b80d8

commit 3a603f41297cad31be9ce54e1c8c076c76c60ddf
Author: Sanjoy Das <sanjoy@playingwithpointers.com>
Date:   Sat Feb 25 22:25:48 2017 +0000

    Fix signed-unsigned comparison warning

    git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@296274 91177308-0d34-0410-b5e6-96231b3b80d8

RKSimon commented 7 years ago

Build time seems to be in RAGreedy (fragment shader):

llvm::MachineFunctionPass::runOnFunction    99.24 %
- `anonymous namespace'::RAGreedy::runOnMachineFunction 93.59 % 0.00 %
 - llvm::RegAllocBase::allocatePhysRegs 93.52 % 0.00 %
  - `anonymous namespace'::RAGreedy::selectOrSplit  92.44 % 0.00 %
   - `anonymous namespace'::RAGreedy::selectOrSplitImpl 92.20 % 0.00 %
    - `anonymous namespace'::RAGreedy::tryEvict 86.68 % 0.02 %
     - `anonymous namespace'::RAGreedy::canEvictInterference    86.27 % 0.06 %
      - `anonymous namespace'::RAGreedy::canReassign    80.64 % 0.35 %
       - llvm::LiveIntervalUnion::Query::checkInterference  61.62 % 0.31 %
        - llvm::LiveIntervalUnion::Query::collectInterferingVRegs   61.30 % 1.27 %
         - llvm::IntervalMap<llvm::SlotIndex,llvm::LiveInterval * __ptr64,8,llvm::IntervalMapInfo<llvm::SlotIndex> >::const_iterator::find  19.26 % 0.36 %
          + llvm::IntervalMapImpl::LeafNode<llvm::SlotIndex,llvm::LiveInterval * __ptr64,8,llvm::IntervalMapInfo<llvm::SlotIndex> >::findFrom   7.71 %  0.20 %
          + llvm::IntervalMap<llvm::SlotIndex,llvm::LiveInterval * __ptr64,8,llvm::IntervalMapInfo<llvm::SlotIndex> >::const_iterator::treeFind 5.70 %  0.05 %
          + llvm::IntervalMap<llvm::SlotIndex,llvm::LiveInterval * __ptr64,8,llvm::IntervalMapInfo<llvm::SlotIndex> >::const_iterator::setRoot  3.64 %  0.15 %
          + llvm::IntervalMap<llvm::SlotIndex,llvm::LiveInterval * __ptr64,8,llvm::IntervalMapInfo<llvm::SlotIndex> >::rootLeaf 0.99 %  0.28 %
          + llvm::IntervalMap<llvm::SlotIndex,llvm::LiveInterval * __ptr64,8,llvm::IntervalMapInfo<llvm::SlotIndex> >::const_iterator::branched 0.83 %  0.46 %

llvmbot commented 7 years ago

What were your triple/cpu settings?

Tom Stellard suggested I also supply the -mcp and -mattr options. Here they are:

% llc -mcpu=pwr8 -mattr=+altivec,+vsx

llvmbot commented 7 years ago

What were your triple/cpu settings?

% llc --version
LLVM (http://llvm.org/):
LLVM version 6.0.0svn
DEBUG build with assertions.
Default target: powerpc64le-unknown-linux-gnu
Host CPU: pwr8

RKSimon commented 7 years ago

What were your triple/cpu settings?

llvm / llvm-project

llc version 4.0 and later takes up to 45 times as long to compile shader code for Mesa #33995

Extended Description