llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.56k stars 11.81k forks source link

GeekBench4/Lua on Skylake/SSE2 regressed by 12% between r308321 and r308891 #33379

Open db4cdc3a-4335-4c8c-9bf1-3141f9c3cba2 opened 7 years ago

db4cdc3a-4335-4c8c-9bf1-3141f9c3cba2 commented 7 years ago
Bugzilla Link 34032
Version trunk
OS Windows NT
CC @d0k,@RKSimon,@rotateright

Extended Description

We observed a 12% degradation in the benchmark score between the mentioned commits. Due to llvm/llvm-project#33247 the test would time-out in the range of commits, so we can only get measurements on the boundaries.

Just so we are all on the same page, here is a timeline: r308321 - GeekBench4/Lua was measured with score X. r308322 - GeekBench4/Lua started failing due to compile-time timeout (same as llvm/llvm-project#33247 ) r308321 - GeekBench4/Lua back to life, but with score X-12%.

r308891 intended to fix the timeout problem and according to the log: "The cut off at 20 uses was chosen by looking at other cut-offs in LLVM for user scanning. It's probably too high, but does the job and is very unlikely to regress anything."

The combination of r308321+r308321 brings the benchmark to a lesser score than in r308321. I ran an experiment on top-of-trunk (r309733) where i applied the following patch which is intended to identify cases where FindAllMemoryUses bails-out early due to r308891:

--- a/lib/CodeGen/CodeGenPrepare.cpp +++ b/lib/CodeGen/CodeGenPrepare.cpp @@ -4059,8 +4059,10 @@ static bool FindAllMemoryUses( for (Use &U : I->uses()) { // Conservatively return true if we're seeing a large number or a deep chain // of users. This avoids excessive compilation times in pathological cases.

With the patch applied, compiled the benchmark and indeed the early-exit did fire. So unfortunately, the assumption that r308891 does not affect real-world code (as much as benchmarks can be considered as such) does not hold.

Next, i ran the LLVM Test-Suite with the same patch applied and saw 79 compilations that hit the UNREACHABLE. For reference, here is a list of the suites (some of the MultiSource suites contained more than one UNREACHABLE hit):

MultiSource/Applications/ClamAV/clamscan MultiSource/Applications/JM/ldecod/ldecod MultiSource/Applications/JM/lencod/lencod MultiSource/Applications/SIBsim4/SIBsim4 MultiSource/Applications/SPASS/SPASS MultiSource/Applications/d/make_dparser MultiSource/Applications/hbd/hbd MultiSource/Applications/lua/lua MultiSource/Applications/minisat/minisat MultiSource/Applications/oggenc/oggenc MultiSource/Applications/sqlite3/sqlite3 MultiSource/Applications/treecc/treecc MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000 MultiSource/Benchmarks/Bullet/bullet MultiSource/Benchmarks/MallocBench/espresso/espresso MultiSource/Benchmarks/McCat/05-eks/eks MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame MultiSource/Benchmarks/MiBench/consumer-typeset/consumer-typeset MultiSource/Benchmarks/Prolangs-C/cdecl/cdecl MultiSource/Benchmarks/mafft/pairlocalalign MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/cjpeg MultiSource/Benchmarks/mediabench/mpeg2/mpeg2dec/mpeg2decode MultiSource/Benchmarks/nbench/nbench MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4 SingleSource/Benchmarks/Adobe-C++/loop_unroll SingleSource/Benchmarks/Misc-C++-EH/spirit

Just to clear, hitting the early-exit does not necessarily mean there was an affect on the final generated code, but it does show that the added early exit does not handle only a pathological case.

In light of the above, I think we should take a few steps back to r308322 and find a solution that avoids the compilation-time explosion reported in llvm/llvm-project#33247 and in GeekBench, and does not have a negative performance impact.

If we don't have data that shows the positive affect of r308322, i would like to please request that r308322 and r308321 be disabled until a more robust solution is found. We should also ensure the 5.0 release branch is fixed accordingly.

Thanks, Zvi.

rotateright commented 2 years ago

mentioned in issue llvm/llvm-bugzilla-archive#36421

rotateright commented 7 years ago

I think this is independent of this bug (we need a fix in CGP for AddressingModeMatcher), but after looking at the llvm/llvm-project#33247 example a bit, I think it's a bit crazy to not expand small memcmp sooner in the optimization pipeline.

  1. We have 5000+ memcmp calls.
  2. Most of these are comparing against small constant strings.
  3. There are large numbers of repeated compares against the same constant. Eg, this 2 byte call:

$ grep memcmp llvm/llvm-project#33247 _compile_time_explosion.ll | grep 1313 %call1314 = tail call i32 @​memcmp(i8 %add.ptr1313, i8 getelementptr inbounds ([3 x i8], [3 x i8]* @.str.386, i64 0, i64 0), i64 2) #​16 ...

$ grep memcmp llvm/llvm-project#33247 _compile_time_explosion.ll | grep 1313 | wc -l 130

  1. As a simpler example, consider: https://godbolt.org/g/QjVXvS

Because we only expand this in CGP and DAGCombiner can't look across blocks, we repeat loads and compares against the same values.

If we expanded sooner and ran the normal opt pipeline: $ ./opt -codegenprepare memcmp_same_prefix_consts.ll -S -O2 | ./llc -o -

We'd fold this down to:

BB#0:

cmpl    $858927408, (%rdi) <--- if first chunk doesn't match, done
jne LBB0_3

BB#1:

movzbl  4(%rdi), %ecx
cmpq    $63, %rcx
ja  LBB0_3

BB#2:

xorl    %eax, %eax
movabsq $-9218868428637470720, %rdx 
btq %rcx, %rdx  <--- fancy bit hackery for the final byte
jae LBB0_3

BB#4:

retq

LBB0_3:
movl $42, %eax retq

llvmbot commented 7 years ago

Sanjay, you are right, AddressingModeMatcher keeps track of visited instructions. However, it remembers visited instructions only within a single optimizeMemoryInst call. If an address has N memory uses, then in findAllMemoryUses we will traverse all of them. And for each memory instruction a new call to optimizeMemoryInst is made, a new AddressingModeMatcher is created and all N memory uses are traversed again and again.

Here is a typical stack trace:

​1 FindAllMemoryUses ( . . . )

​2 AddressingModeMatcher::isProfitableToFoldIntoAddressingMode ( . . . )

​3 AddressingModeMatcher::matchAddr ( . . . )

​4 AddressingModeMatcher::matchOperationAddr ( . . . )

​5 AddressingModeMatcher::matchAddr ( . . . )

​6 AddressingModeMatcher::Match ( . . . )

​7 CodeGenPrepare::optimizeMemoryInst ( . . . )

As an example you may want to consider luaV_execute function from lua benchmark in llvm testsuite.

rotateright commented 7 years ago

adjust MaxMemoryUsesToScan implementation patch

rotateright commented 7 years ago

The reproducer in Bug 33900 should be sufficient.

I can run experiments with that code to check compile-time perf, but I can't solve this bug without geekbench. I need some help. :)

I have attached the IR reduction from bug 33900 here, so we can focus on the CGP effect on that.

Let me list what I have so far. All timing is from a Haswell 4GHz system running macOS using the attached example IR file, and I'm purposely using only 1 sigdig for timing because the rest is irrelevant:

  1. Before r308321 (before memcmp expansion to 2 blocks), 33900.ll took 4 sec to get through CGP.

  2. With r308321, it went practically infinite (nobody has run long enough to know).

  3. With Benjamin's bail-out fix in r308891, it comes back to 19 sec. This is apparently still acceptable perf (#33900 is closed).

  4. memcmp expansion is not the real factor in either bug. By this I mean if either source had been written in a memcmp-expanded way to begin, we'd be seeing the same problems.

  5. Given that we're using heuristics for both profitability of the address mode transform and when to bail out, the easiest potential fix is to adjust the MaxMemoryUsesToScan threshhold assuming we can restore geekbench perf while not going over the compile-time cliff. Some partial data to consider:

MaxMemoryUsesToScan = 2, llvm/llvm-project#33247 compile-time = 8 s MaxMemoryUsesToScan = 10, llvm/llvm-project#33247 compile-time = 12 s MaxMemoryUsesToScan = 20, llvm/llvm-project#33247 compile-time = 19 s MaxMemoryUsesToScan = 40, llvm/llvm-project#33247 compile-time = 36 s MaxMemoryUsesToScan = 80, llvm/llvm-project#33247 compile-time = 69 s

Someone can try these settings with geekbench to see if it is restored?

  1. In comment 3, Nikolai wrote that there's a problem with the AddressingModeMatcher algorithm because it's visiting the same insts repeatedly. It's not clear to me how that happens. We are tracking "ConsideredInsts", so we shouldn't be revisiting already seen insts?

  2. Independent of that, we could adjust how we increment MaxMemoryUsesToScan. Let me attach a patch idea for that - we could track actual memops rather than all users. This causes compile-time for llvm/llvm-bugzilla-archive#33990 to rise from 19 s to 21 s. Does it "solve" geekbench?

rotateright commented 7 years ago

PR33900 IR compile time explosion Tested with: $ time opt -S -codegenprepare -o /dev/null llvm/llvm-project#33247 _compile_time_explosion.ll

db4cdc3a-4335-4c8c-9bf1-3141f9c3cba2 commented 7 years ago

The benchmark in Bug 33914 was not affected by the compile-time explosion issue. The reproducer in Bug 33900 should be sufficient.

rotateright commented 7 years ago

A reduction from geekbench (this bug) or EEMBC (bug 33914) would actually be the best way to solve this. I don't have access to either of these benchmarks.

rotateright commented 7 years ago

Sanjay, with the 5.0rc2 release being a couple of days away and our desire to keep this release above the 4.0 baseline, would it be possible to ensure that the regression be fixed in the 5.0 branch by either:

  • reverting r308322 and 308891 from the 5.0 branch
  • landing a fix in the 5.0 branch ?

I'd really like to keep the memcmp expansion in the 5.0 release because that has measurable benefits for small memcmp perf as mentioned in comment 1.

I still don't have an understanding of bug 33900 (MaxMemoryUsesToScan) and how memcmp expansion triggers that.

Do you or Nikolai have a reduction that shows that problem? My strong preference would be to fix that if that's the real problem.

db4cdc3a-4335-4c8c-9bf1-3141f9c3cba2 commented 7 years ago

Sanjay, with the 5.0rc2 release being a couple of days away and our desire to keep this release above the 4.0 baseline, would it be possible to ensure that the regression be fixed in the 5.0 branch by either:

Thanks, Zvi

rotateright commented 7 years ago

I believe this particular regression is caused not by memcpy itself, but by new MaxMemoryUsesToScan limit (r308891). If this limit gets hit in for example MultiSource/Applications/lua/lua test, then we have the mentioned regression. To me it seems that the underlying problem is that AddressingModeMatcher is used inefficiently - it keeps visiting the same instruction again and again. As a result, the limit is reached before matching a desired address computation even in non-pathological cases.

Thanks - that's a relief at least that my understanding of the memcpy expansion instruction timing is probably ok in this case. I guess I need to look closer at the code in bug 33900 to see how memcmp expansion is triggering the other problem.

llvmbot commented 7 years ago

Hi Sanjay,

I believe this particular regression is caused not by memcpy itself, but by new MaxMemoryUsesToScan limit (r308891). If this limit gets hit in for example MultiSource/Applications/lua/lua test, then we have the mentioned regression. To me it seems that the underlying problem is that AddressingModeMatcher is used inefficiently - it keeps visiting the same instruction again and again. As a result, the limit is reached before matching a desired address computation even in non-pathological cases.

db4cdc3a-4335-4c8c-9bf1-3141f9c3cba2 commented 7 years ago

First, a correction to a typo in the timeline above (thanks, Craig): r308321 - GeekBench4/Lua was measured with score X. r308322 - GeekBench4/Lua started failing due to compile-time timeout (same as llvm/llvm-project#33247 ) r308891 - GeekBench4/Lua back to life, but with score X-12%. <--- Correction

Sanjay, I passed on a request for the experiments you suggested. Will not be able to run them myself as i will be away for a few days. Having said that, IMO we should not isolate the compile-time problem from the performance regression, because the fix for the compile-time problem (r308891) may be the root cause of the perf regression.

rotateright commented 7 years ago

Although this might be upside-down, I want to get more details about the run-time perf problem before the compile-time issue (because I don't know anything about the compile-time problem yet!).

First, a question about the geekbench results - do you see any difference with these recent commits:

  1. https://reviews.llvm.org/rL308986 (less expansion)?
  2. https://reviews.llvm.org/rL309597 (less cmov)?
  3. https://reviews.llvm.org/rL309711 (narrower ops)?

I've updated my test program attached to bug 33329 to time the system memcmp and time calls with misaligned buffers.

For the motivating cases in that PR (16 and 32 byte memcmps), the inlined code benchmarks faster than Linux __memcmp_sse4_1 even for what I think is the Linux implementation's best case - buffers that are equal in all bytes.

I don't have a Skylake system to test on, but I ran on Haswell and Jaguar.

For Haswell, it appears to take at least 9 cycles to execute the best case memcmp of 16 equal bytes. The scalar inlined code that we produce after r308322 (expand memcmp for x86) improves that to about 6 cycles. On Jaguar, the difference is 15 cycles for the libc call vs. 9 cycles. I'm running Ubuntu 16.04 on the Haswell box and Ubuntu 14.04 on the Jaguar box, but I think the libc memcmp code is largely the same.

Can you try that program on your systems, so we know if that behaves the same for you locally? There must be something I'm not accounting for that allows a call to libc to be faster than inline code.