llvmbot commented 15 years ago


Bugzilla Link	3255
Resolution	LATER
Resolved on	Mar 12, 2010 00:59
Version	unspecified
OS	All
Attachments	screenshot showing change
Reporter	LLVM Bugzilla Contributor
CC	@lattner,@tkremenek,@nunoplopes

Extended Description

clang's performance can vary greatly due to changes which should have had no executable impact.

For example, in my experiments, revision 60704 slowed down clang by 3.5%. This revision only modified the rewriter, and only change the strings that were present in the binary.

The root cause for such variations is that changing any code causes the layout of code in the binary to be modified. This has two primary effects: (1) Changing code alignment can have significant consequences on modern processors, and may also effect the generated code.

(2) Changing address can change the behavior of malloc. This can change the cache behavior and also change the executable behavior of the program in cases where the behavior depends on the actual malloc address (hash tables).

Fixing these variations is not inherently good, but it allows clang's performance to be monitored more closely, as variations due to unintended behavior make the results noisy and hard to interpret.

Fixing (1) is hard, but we should investigate (2) to see if there are any simple wins.

llvmbot commented 14 years ago

Closing this out, when I start performance testing Clang I have ways to quantify these problems.

llvmbot commented 15 years ago

I have a way to stabilize the "changing binary" issue by editing the binary so that it will map in additional space to pad to a 64MB boundary or so. I haven't been doing my regular performance testing (which motivated this bug) or tried this scheme out in such an environment, so I don't know if it works yet though.

nunoplopes commented 15 years ago

OK, so I've run the same tests but I've meausured the (user-)time instead of the number of executed instructions on two machines (intel centrino + amd opteron). I've also added a few more hash patterns to the test. Given this, well the hashes that came out aren't much better than the current one. I would even say that the different is negligible.

llvmbot commented 15 years ago

That is understandable. However, for something as general purpose as DenseMap I am a little more concerned that the hash function avoid rare bad behavior than be as fast as possible. But that is somewhat a matter of taste...

llvmbot commented 15 years ago

Daniel,

I did some measurements with more advanced hash functions in the past, and had a hard time getting the improvement seen in reduced collisions to makeup for the increased computational cost of the functions.

llvmbot commented 15 years ago

Nuno,

Would you care to undertake trying out a more sophisticated hash function and trying to evaluate its performance on a few of the tools which use DenseMaps?

Here is an example of a more sophisticated function which does a better scattering of bits: http://burtleburtle.net/bob/hash/doobs.html

lattner commented 15 years ago

That hash function makes a lot of logical sense to me! Please commit it, thanks Nuno.

nunoplopes commented 15 years ago

script to test hash functions php script to test a set of hash functions against a set of inputs. It orders by total number of instructions given by callgrind. the .tgz also includes the raw stats and a script to view them.

nunoplopes commented 15 years ago

ok, so after a night of computations I got these results (top 5):

1: (unsigned((uintptr_t)PtrVal) >> 2) :: 10,422,578

alloc-12000.log => 5,194,316 alloc-24000.log => 5,228,262

2: (unsigned((uintptr_t)PtrVal)) ^ (unsigned((uintptr_t)PtrVal) >> 2) :: 10,428,247

alloc-12000.log => 5,211,704 alloc-24000.log => 5,216,543

3: (unsigned((uintptr_t)PtrVal)) ^ (unsigned((uintptr_t)PtrVal) >> 3) :: 10,431,050

alloc-12000.log => 5,214,030 alloc-24000.log => 5,217,020

4: (unsigned((uintptr_t)PtrVal)) ^ (unsigned((uintptr_t)PtrVal) >> 6) :: 10,451,420

alloc-12000.log => 5,233,907 alloc-24000.log => 5,217,513

5: (unsigned((uintptr_t)PtrVal)) ^ (unsigned((uintptr_t)PtrVal) >> 7) :: 10,459,076

alloc-12000.log => 5,250,147 alloc-24000.log => 5,208,929

the current one:

199: (unsigned((uintptr_t)PtrVal) >> 4) ^ (unsigned((uintptr_t)PtrVal) >> 9) :: 12,084,410

alloc-12000.log => 5,656,294 alloc-24000.log => 6,428,116

so the first one gives a ~14% improvement over the current one.

llvmbot commented 15 years ago

Cool Nuno!

It isn't clear yet that we are actually seeing any performance degradation due to this, it just came up as an interesting idea while investigating the insn count variations. It might be nice to convert the logs to something we could load in one shot, i.e.:

struct { unsigned N; struct { unsigned code; unsigned value } items[N]; };

so that the test spent most of its time in the DenseMap, and then see how much the hashing function affects actual performance.

I'm also curious if you could quantify where the gain is coming from? Maybe run in a debug build and attach the callgrind output of before and after your change? Are we just winning because we do less probing, or is it something more subtle like we actually end up rehashing in one situation and not the other?

nunoplopes commented 15 years ago

Great work Daniel! The problem here is the poor hashing function of DenseMap (lines 38-41 of DenseMap.h). With your test data I was able to reduce the number of instructions by 1,000,000(!) by changing the magic numbers there to 3 and 8 (instead of 4 and 9). Making good hash functions is such a difficult art..

llvmbot commented 15 years ago

Log with max instruction count

llvmbot commented 15 years ago

Log with min instruction count

llvmbot commented 15 years ago

Simple tool to replay a log of DenseMap queries

llvmbot commented 15 years ago

I augmented getMacroInfo and setMacroInfo to log the calls to the DenseMap, and added an argument to just allocate some N bytes at the start of execution. I attached a simple tool to replay the log and the logs with the min & max insn counts.

-- ddunbar@ozzy:alloc-addr$ g++ -O3 -I/Users/ddunbar/llvm/include -o ReplayMapLog ReplayMapLog.cpp g++ -O3 -I/Users/ddunbar/llvm/include -o ReplayMapLog ReplayMapLog.cpp ddunbar@ozzy:alloc-addr$ valgrind --tool=callgrind ./ReplayMapLog alloc-12000.log 2>&1 | tail -1 ==96625== I refs: 100,037,930 ddunbar@ozzy:alloc-addr$ valgrind --tool=callgrind ./ReplayMapLog alloc-24000.log 2>&1 | tail -1 ==96634== I refs: 100,780,838 ddunbar@ozzy:alloc-addr$

llvmbot commented 15 years ago

I have verified that changes to the data segment alter the malloc values for this case.

From the callgrind output it looked like the Macros DenseMap in Preprocessor was one significant source of discrepancies. I just added prints of the arguments in setMacroInfo and added a global array to Preprocessor.cpp and tried a few sizes.

It turns out adding "char buffer[3000];" is enough to bump the addresses, and they shift by exactly 0x1000, so it does appear that the malloc region is returning addresses starting right after wherever the sections are loaded.

I think we have two interesting questions: (1) Can we find a set of addresses which trigger poor DenseMap behavior.

(2) Can we alter linking/etc to make malloc return more consistent addresses.

llvmbot commented 15 years ago

Here is a collection of darwin x86 binaries and callgrind runs: http://t1.minormatter.com/~ddunbar/r60704-funnyness.tgz

-- ddunbar@lordcrumb:r60704-funnyness$ ls -l total 227456 -rwxr-xr-x 1 ddunbar staff 27722544 Dec 22 16:30 lordcrumb-clang.Debug.r60703 -rwxr-xr-x 1 ddunbar staff 27722544 Dec 22 16:30 lordcrumb-clang.Debug.r60704 -rwxr-xr-x 1 ddunbar staff 9863960 Dec 22 16:29 lordcrumb-clang.Release-Asserts.install.r60703 -rwxr-xr-x 1 ddunbar staff 9868056 Dec 22 16:29 lordcrumb-clang.Release-Asserts.install.r60704 -rwxr-xr-x 1 ddunbar staff 10763544 Dec 22 16:29 lordcrumb-clang.Release-Asserts.r60703 -rwxr-xr-x 1 ddunbar staff 10767640 Dec 22 16:29 lordcrumb-clang.Release-Asserts.r60704 -rwxr-xr-x 1 ddunbar staff 9863960 Dec 22 16:27 smoosh-clang.Release-Asserts.install.r60703 -rwxr-xr-x 1 ddunbar staff 9863960 Dec 22 16:27 smoosh-clang.Release-Asserts.install.r60704 ddunbar@lordcrumb:r60704-funnyness$ for i in .r; do echo "-- $i --"; valgrind --tool=callgrind ./$i -Eonly ~/repos/clang/INPUTS/Cocoa_h.m 2>& 1 | tail -1; mv callgrind.out.* $i-callgrind.out; echo ""; done -- lordcrumb-clang.Debug.r60703 -- ==2580== I refs: 900,436,172

-- lordcrumb-clang.Debug.r60704 -- ==2591== I refs: 900,450,222

-- lordcrumb-clang.Release-Asserts.install.r60703 -- ==2602== I refs: 251,011,172

-- lordcrumb-clang.Release-Asserts.install.r60704 -- ==2613== I refs: 251,004,110

-- lordcrumb-clang.Release-Asserts.r60703 -- ==2623== I refs: 253,149,312

-- lordcrumb-clang.Release-Asserts.r60704 -- ==2633== I refs: 253,101,334

-- smoosh-clang.Release-Asserts.install.r60703 -- ==2649== I refs: 251,011,106

-- smoosh-clang.Release-Asserts.install.r60704 -- ==2659== I refs: 251,015,718

ddunbar@lordcrumb:r60704-funnyness$

The smoosh binaries are the ones from the machine that shows the performance difference. The lordcrumb ones are easier to look at though, the biggest discrepancy is on the Release-Asserts build (not installed). This has about a 50k instruction difference.

llvm / llvm-project

Investigate deterministic variations due to "unintended" changes. #3627

Extended Description

1: (unsigned((uintptr_t)PtrVal) >> 2) :: 10,422,578

2: (unsigned((uintptr_t)PtrVal)) ^ (unsigned((uintptr_t)PtrVal) >> 2) :: 10,428,247

3: (unsigned((uintptr_t)PtrVal)) ^ (unsigned((uintptr_t)PtrVal) >> 3) :: 10,431,050

4: (unsigned((uintptr_t)PtrVal)) ^ (unsigned((uintptr_t)PtrVal) >> 6) :: 10,451,420

5: (unsigned((uintptr_t)PtrVal)) ^ (unsigned((uintptr_t)PtrVal) >> 7) :: 10,459,076

199: (unsigned((uintptr_t)PtrVal) >> 4) ^ (unsigned((uintptr_t)PtrVal) >> 9) :: 12,084,410

It isn't clear yet that we are actually seeing any performance degradation due to this, it just came up as an interesting idea while investigating the insn count variations. It might be nice to convert the logs to something we could load in one shot, i.e.:

struct { unsigned N; struct { unsigned code; unsigned value } items[N]; };

ddunbar@lordcrumb:r60704-funnyness$

llvm / llvm-project

Investigate deterministic variations due to "unintended" changes. #3627

Extended Description

​1: (unsigned((uintptr_t)PtrVal) >> 2) :: 10,422,578

​2: (unsigned((uintptr_t)PtrVal)) ^ (unsigned((uintptr_t)PtrVal) >> 2) :: 10,428,247

​3: (unsigned((uintptr_t)PtrVal)) ^ (unsigned((uintptr_t)PtrVal) >> 3) :: 10,431,050

​4: (unsigned((uintptr_t)PtrVal)) ^ (unsigned((uintptr_t)PtrVal) >> 6) :: 10,451,420

​5: (unsigned((uintptr_t)PtrVal)) ^ (unsigned((uintptr_t)PtrVal) >> 7) :: 10,459,076

​199: (unsigned((uintptr_t)PtrVal) >> 4) ^ (unsigned((uintptr_t)PtrVal) >> 9) :: 12,084,410

It isn't clear yet that we are actually seeing any performance degradation due to this, it just came up as an interesting idea while investigating the insn count variations. It might be nice to convert the logs to something we could load in one shot, i.e.:

struct { unsigned N; struct { unsigned code; unsigned value } items[N]; };

ddunbar@lordcrumb:r60704-funnyness$

1: (unsigned((uintptr_t)PtrVal) >> 2) :: 10,422,578

2: (unsigned((uintptr_t)PtrVal)) ^ (unsigned((uintptr_t)PtrVal) >> 2) :: 10,428,247

3: (unsigned((uintptr_t)PtrVal)) ^ (unsigned((uintptr_t)PtrVal) >> 3) :: 10,431,050

4: (unsigned((uintptr_t)PtrVal)) ^ (unsigned((uintptr_t)PtrVal) >> 6) :: 10,451,420

5: (unsigned((uintptr_t)PtrVal)) ^ (unsigned((uintptr_t)PtrVal) >> 7) :: 10,459,076

199: (unsigned((uintptr_t)PtrVal) >> 4) ^ (unsigned((uintptr_t)PtrVal) >> 9) :: 12,084,410