llc takes forever on a test with lots of spills

abique commented 7 years ago


Bugzilla Link	32767
Resolution	FIXED
Resolved on	Apr 25, 2017 08:30
Version	4.0
OS	other
Attachments	Problematic LLVM IR
CC	@RKSimon,@rotateright

Extended Description

Hi,

I'm a Bitwig Studio developer (www.bitwig.com), and we use LLVM to JIT some digital signal processing algorithms. Our software is used by thousand of customers on Windows, Mac and Linux.

We just updated from LLVM 3.9.1 to LLVM 4.0 and found that it takes a lot of time to JIT the attached LLVM IR (48701ms) on AMD Ryzen R7 1700X, while it is not noticeable on other architectures.

The command "opt -mcpu=znver1 /home/abique/downloads/claes-cache-entry.ll -o tutu.bc -O3" is instant, so it is slow in the target lowering or in the JIT phase.

By the way, do you have a workaround for this issue until the fix is released?

Many thanks.

Regards, Alexandre

abique commented 7 years ago

Hi,

May I ask why running opt first prevents llc to trigger the bug? Which pass are executed by opt which are not by llc?

Thanks.

abique commented 7 years ago

Thank you very much.

llvmbot commented 7 years ago

The commit that fixed the compile time regression is

commit 30a921f62a8444a478e456d99022ea847f48336c
Author: Nirav Dave <niravd@google.com>
Date:   Tue Mar 14 00:34:14 2017 +0000

    In visitSTORE, always use FindBetterChain, rather than only when UseAA is enabled.

        Recommiting with compiler time improvements

        Recommitting after fixup of 32-bit aliasing sign offset bug in DAGCombiner.

        * Simplify Consecutive Merge Store Candidate Search

        Now that address aliasing is much less conservative, push through
        simplified store merging search and chain alias analysis which only
        checks for parallel stores through the chain subgraph. This is cleaner
        as the separation of non-interfering loads/stores from the
        store-merging logic.

        When merging stores search up the chain through a single load, and
        finds all possible stores by looking down from through a load and a
        TokenFactor to all stores visited.

        This improves the quality of the output SelectionDAG and the output
        Codegen (save perhaps for some ARM cases where we correctly constructs
        wider loads, but then promotes them to float operations which appear
        but requires more expensive constant generation).

        Some minor peephole optimizations to deal with improved SubDAG shapes (listed below)
[...]

abique commented 7 years ago

Hi,

Thank you very much for finding out what is the root cause of the issue.

We'll give a try to the latest svn revision tomorrow.

llvmbot commented 7 years ago

So, to clarify, I'm now able to reproduce on every machine I try. The trick is to not run opt before but just run llc on the bitcode as is.

So, this doesn't happen only on a zen host. This also doesn't happen only when optimizing for Ryzen. It's a general problem.

Examples:

[davide@localhost bin]$ time ./llc blah.ll -mtriple=x86_64-unknown -mcpu=core2

real 0m4.327s user 0m4.267s sys 0m0.060s [davide@localhost bin]$ time ./llc blah.ll -mtriple=x86_64-unknown -mcpu=znver1

real 0m7.321s user 0m7.239s sys 0m0.081s [davide@localhost bin]$ time ./llc blah.ll -mtriple=x86_64-unknown -mcpu=btver1

real 0m8.947s user 0m8.918s sys 0m0.029s

We'll take a look, but please take the time to elaborate adn be more precise when reporting bugs in the future.

It seems the time went down to 40 seconds to < 10 seconds from 4.0 to today. I recommend to try on ToT as workaround.

llvmbot commented 7 years ago

So, to clarify

llvmbot commented 7 years ago

Apparently Simon is able to reproduce this one

The problem seems to be in llc and not the JIT. 46% of self time in SUnit::addPred and 40% in SUint::ComputeHeight

We'll investigate.

llvmbot commented 7 years ago

This still has no info on how to reproduce. Please reopen when you have a standalone testcase. Thanks!

llvmbot commented 7 years ago

Please provide a standalone repro, otherwise it's impossible to reproduce.

abique commented 7 years ago

The issue does not happen in llc or opt, but when we JIT the code using llvm::ExecutionEngine::getPointerToFunction().

llvmbot commented 7 years ago

I tried on several machines and I'm not able to reproduce. Also, your bug report doesn't seem to contain enough informations to reproduce the problem (opt is fast, llc is fast, hard to guess where the cycles are spent). Feel free to reopen when you have more informations. Cheers.

llvmbot commented 7 years ago

This doesn't reproduce on trunk for me (I tried on a Ryzen). I suspect a problem in your setup. Also, please try trunk before reporting issues.

abique commented 7 years ago

On the problematic computer, even if we force the cpu target to "core2" it is still taking a lot of time.

Is it possible that the optimizer ignore the "core2" at some points and gets into ryzen optimizations?

abique commented 7 years ago

I should add that on Linux, with Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz it takes 151ms.

llvm / llvm-project

llc takes forever on a test with lots of spills #32114

Extended Description