llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.24k stars 11.66k forks source link

MemorySSA could cache alias query results more aggressively #32473

Open llvmbot opened 7 years ago

llvmbot commented 7 years ago
Bugzilla Link 33126
Version trunk
OS Windows NT
Reporter LLVM Bugzilla Contributor
CC @davidbolvansky,@fhahn,@gburgessiv

Extended Description

Dumping this one here, for whomever gets to it first (as I have a terrible memory, and probably will forget in a day or two).

Dan run ./opt -memoryssa on clang.bc and statistics return:

1175012 memoryssa - Number of MemorySSA alias queries 360784 memoryssa - Number of MemorySSA duplicate pointer queries

Which is a pretty significant amount of duplicate queries. In the conducted experiments, caching everything saves 20% of MemorySSA time, but the amount of memory used could be a concern for large testcases (e.g. LTO).

Probably using a smarter cache could lead to substantial benefits without memory usage going crazy. It's unclear what's the best policy, maybe LRU is a good starting point.

llvmbot commented 7 years ago

FWIW: We can also build memoryssa in parallel, but i don't know how far down this route LLVM wants to go. For optimizing uses, the current stack is a small issue, but you could trivially split up the accesses by memloc and optimize them separately without locking.

We also should probably start looking at the effectiveness of the use cache. Right now a memloc has a ptr, size, and tbaa info.

Eventually we really want memloc to be ptr, size, and instruction, but let's ignore that for a second.

I wonder how many calls we have where size differs for the same ptr (but the basicaa answer will not), and how many where tbaa node is different, but answer would not be different.

If it wasn't so expensive, the way to refine would be "first test ptr, ptr, if noalias, record that, and use it for everything".

But i think we need data on how often we are not caching things due to differing size/tbaa info.

(I'm trying to gather stats now)

llvmbot commented 7 years ago

What a test case. :)

I looked into the timing a bit, and > 90% of the time spent in MemorySSA when optimizing the whole program is seemingly spent building MSSA for one function: @​grn_expr_exec.

That function is nearly 450,000 lines of IR, it has over 7,700 allocas, and instruction numbering goes up to 282315.

Looks like a great way to find the hot parts of MSSA. :)

All of that said, I remember Danny and I talked about putting some trivial function size/shape tracking into MSSA a while back. With that, we could determine how aggressively we wanted to optimize the MemoryAccesses in a function ahead of time (e.g. do we want to look beyond branches, ...) based on how crazy said function looks. O1/O2/O3 would had increasing limits, etc.

I don't know how much that would help in the attached case, but...

Hi George, thanks for your feedback! I saw some/several papers describing a technique of using PGO information to reduce compile time not optimizing or optimizing less aggressively functions that are found to be cold. Yours seem just a profile-less version of it. Personally, I think it's worth a try.

gburgessiv commented 7 years ago

What a test case. :)

I looked into the timing a bit, and > 90% of the time spent in MemorySSA when optimizing the whole program is seemingly spent building MSSA for one function: @​grn_expr_exec.

That function is nearly 450,000 lines of IR, it has over 7,700 allocas, and instruction numbering goes up to 282315.

Looks like a great way to find the hot parts of MSSA. :)

All of that said, I remember Danny and I talked about putting some trivial function size/shape tracking into MSSA a while back. With that, we could determine how aggressively we wanted to optimize the MemoryAccesses in a function ahead of time (e.g. do we want to look beyond branches, ...) based on how crazy said function looks. O1/O2/O3 would had increasing limits, etc.

I don't know how much that would help in the attached case, but...

llvmbot commented 7 years ago

And bugzilla expresses dissatisfaction about the size of my testcase, so: https://drive.google.com/open?id=0BypHPYysMK9iVS1XalZTaHJjUlU

llvmbot commented 7 years ago

Not immediately related, but something that falls under MemorySSA improvements.

I had this test (attached) where GVN was taking 30% of the whole build time (at `opt -O2)

Total Execution Time: 9.4961 seconds (9.4933 wall clock)

---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 3.0191 ( 32.0%) 0.0038 ( 6.9%) 3.0229 ( 31.8%) 3.0228 ( 31.8%) Global Value Numbering 0.6682 ( 7.1%) 0.0007 ( 1.3%) 0.6689 ( 7.0%) 0.6688 ( 7.0%) Combine redundant instructions 0.4570 ( 4.8%) 0.0018 ( 3.3%) 0.4588 ( 4.8%) 0.4588 ( 4.8%) Combine redundant instructions 0.3517 ( 3.7%) 0.0004 ( 0.7%) 0.3521 ( 3.7%) 0.3521 ( 3.7%) Combine redundant instructions 0.3283 ( 3.5%) 0.0008 ( 1.5%) 0.3291 ( 3.5%) 0.3290 ( 3.5%) Combine redundant instructions 0.3227 ( 3.4%) 0.0006 ( 1.0%) 0.3232 ( 3.4%) 0.3232 ( 3.4%) Combine redundant instructions 0.3130 ( 3.3%) 0.0004 ( 0.8%) 0.3134 ( 3.3%) 0.3133 ( 3.3%) Combine redundant instructions 0.3112 ( 3.3%) 0.0005 ( 1.0%) 0.3117 ( 3.3%) 0.3117 ( 3.3%) Combine redundant instructions

If I turn on NewGVN:

===-------------------------------------------------------------------------=== ... Pass execution timing report ... ===-------------------------------------------------------------------------=== Total Execution Time: 8.5045 seconds (8.5015 wall clock)

---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 1.3051 ( 15.5%) 0.0018 ( 2.7%) 1.3069 ( 15.4%) 1.3069 ( 15.4%) Memory SSA 0.6676 ( 7.9%) 0.0014 ( 2.2%) 0.6691 ( 7.9%) 0.6690 ( 7.9%) Combine redundant instructions 0.6210 ( 7.4%) 0.0135 ( 20.5%) 0.6345 ( 7.5%) 0.6345 ( 7.5%) Global Value Numbering 0.4766 ( 5.6%) 0.0034 ( 5.2%) 0.4800 ( 5.6%) 0.4800 ( 5.6%) Combine redundant instructions 0.3729 ( 4.4%) 0.0005 ( 0.8%) 0.3735 ( 4.4%) 0.3734 ( 4.4%) Combine redundant instructions 0.3370 ( 4.0%) 0.0013 ( 1.9%) 0.3382 ( 4.0%) 0.3381 ( 4.0%) Combine redundant instructions 0.3247 ( 3.8%) 0.0007 ( 1.1%) 0.3254 ( 3.8%) 0.3253 ( 3.8%) Combine redundant instructions 0.3226 ( 3.8%) 0.0002 ( 0.2%) 0.3227 ( 3.8%) 0.3227 ( 3.8%) Combine redundant instructions 0.3033 ( 3.6%) 0.0007 ( 1.1%) 0.3040 ( 3.6%) 0.3040 ( 3.6%) Combine redundant instructions

Which is good because the whole opt time goes down about 10%, but now MemorySSA is the top pass. Not an immediate concern, but putting it here while I'm at it.

llvmbot commented 7 years ago

The current speed of caching is also dependent on hashing and equalling memlocs, which currently store a bunch of stuff (TBAA nodes). Long term, we want that not to be in memloc anyway.

In an ideal world, we would assign ids to memloc.ptr, and build a lower-triangular bitmap of id, id that we could test in constant time or something.

Right now, caching the results takes as long as computing them :(