Closed llvmbot closed 14 years ago
Closing this out, when I start performance testing Clang I have ways to quantify these problems.
I have a way to stabilize the "changing binary" issue by editing the binary so that it will map in additional space to pad to a 64MB boundary or so. I haven't been doing my regular performance testing (which motivated this bug) or tried this scheme out in such an environment, so I don't know if it works yet though.
OK, so I've run the same tests but I've meausured the (user-)time instead of the number of executed instructions on two machines (intel centrino + amd opteron). I've also added a few more hash patterns to the test. Given this, well the hashes that came out aren't much better than the current one. I would even say that the different is negligible.
That is understandable. However, for something as general purpose as DenseMap I am a little more concerned that the hash function avoid rare bad behavior than be as fast as possible. But that is somewhat a matter of taste...
Daniel,
I did some measurements with more advanced hash functions in the past, and had a hard time getting the improvement seen in reduced collisions to makeup for the increased computational cost of the functions.
Nuno,
Would you care to undertake trying out a more sophisticated hash function and trying to evaluate its performance on a few of the tools which use DenseMaps?
Here is an example of a more sophisticated function which does a better scattering of bits: http://burtleburtle.net/bob/hash/doobs.html
That hash function makes a lot of logical sense to me! Please commit it, thanks Nuno.
script to test hash functions php script to test a set of hash functions against a set of inputs. It orders by total number of instructions given by callgrind. the .tgz also includes the raw stats and a script to view them.
ok, so after a night of computations I got these results (top 5):
alloc-12000.log => 5,194,316 alloc-24000.log => 5,228,262
alloc-12000.log => 5,211,704 alloc-24000.log => 5,216,543
alloc-12000.log => 5,214,030 alloc-24000.log => 5,217,020
alloc-12000.log => 5,233,907 alloc-24000.log => 5,217,513
alloc-12000.log => 5,250,147 alloc-24000.log => 5,208,929
the current one:
alloc-12000.log => 5,656,294 alloc-24000.log => 6,428,116
so the first one gives a ~14% improvement over the current one.
Cool Nuno!
so that the test spent most of its time in the DenseMap, and then see how much the hashing function affects actual performance.
I'm also curious if you could quantify where the gain is coming from? Maybe run in a debug build and attach the callgrind output of before and after your change? Are we just winning because we do less probing, or is it something more subtle like we actually end up rehashing in one situation and not the other?
Great work Daniel! The problem here is the poor hashing function of DenseMap (lines 38-41 of DenseMap.h). With your test data I was able to reduce the number of instructions by 1,000,000(!) by changing the magic numbers there to 3 and 8 (instead of 4 and 9). Making good hash functions is such a difficult art..
I augmented getMacroInfo and setMacroInfo to log the calls to the DenseMap, and added an argument to just allocate some N bytes at the start of execution. I attached a simple tool to replay the log and the logs with the min & max insn counts.
I have verified that changes to the data segment alter the malloc values for this case.
From the callgrind output it looked like the Macros DenseMap in Preprocessor was one significant source of discrepancies. I just added prints of the arguments in setMacroInfo and added a global array to Preprocessor.cpp and tried a few sizes.
It turns out adding "char buffer[3000];" is enough to bump the addresses, and they shift by exactly 0x1000, so it does appear that the malloc region is returning addresses starting right after wherever the sections are loaded.
I think we have two interesting questions: (1) Can we find a set of addresses which trigger poor DenseMap behavior.
(2) Can we alter linking/etc to make malloc return more consistent addresses.
Here is a collection of darwin x86 binaries and callgrind runs: http://t1.minormatter.com/~ddunbar/r60704-funnyness.tgz
-- ddunbar@lordcrumb:r60704-funnyness$ ls -l total 227456 -rwxr-xr-x 1 ddunbar staff 27722544 Dec 22 16:30 lordcrumb-clang.Debug.r60703 -rwxr-xr-x 1 ddunbar staff 27722544 Dec 22 16:30 lordcrumb-clang.Debug.r60704 -rwxr-xr-x 1 ddunbar staff 9863960 Dec 22 16:29 lordcrumb-clang.Release-Asserts.install.r60703 -rwxr-xr-x 1 ddunbar staff 9868056 Dec 22 16:29 lordcrumb-clang.Release-Asserts.install.r60704 -rwxr-xr-x 1 ddunbar staff 10763544 Dec 22 16:29 lordcrumb-clang.Release-Asserts.r60703 -rwxr-xr-x 1 ddunbar staff 10767640 Dec 22 16:29 lordcrumb-clang.Release-Asserts.r60704 -rwxr-xr-x 1 ddunbar staff 9863960 Dec 22 16:27 smoosh-clang.Release-Asserts.install.r60703 -rwxr-xr-x 1 ddunbar staff 9863960 Dec 22 16:27 smoosh-clang.Release-Asserts.install.r60704 ddunbar@lordcrumb:r60704-funnyness$ for i in .r; do echo "-- $i --"; valgrind --tool=callgrind ./$i -Eonly ~/repos/clang/INPUTS/Cocoa_h.m 2>& 1 | tail -1; mv callgrind.out.* $i-callgrind.out; echo ""; done -- lordcrumb-clang.Debug.r60703 -- ==2580== I refs: 900,436,172
-- lordcrumb-clang.Debug.r60704 -- ==2591== I refs: 900,450,222
-- lordcrumb-clang.Release-Asserts.install.r60703 -- ==2602== I refs: 251,011,172
-- lordcrumb-clang.Release-Asserts.install.r60704 -- ==2613== I refs: 251,004,110
-- lordcrumb-clang.Release-Asserts.r60703 -- ==2623== I refs: 253,149,312
-- lordcrumb-clang.Release-Asserts.r60704 -- ==2633== I refs: 253,101,334
-- smoosh-clang.Release-Asserts.install.r60703 -- ==2649== I refs: 251,011,106
-- smoosh-clang.Release-Asserts.install.r60704 -- ==2659== I refs: 251,015,718
The smoosh binaries are the ones from the machine that shows the performance difference. The lordcrumb ones are easier to look at though, the biggest discrepancy is on the Release-Asserts build (not installed). This has about a 50k instruction difference.
Extended Description
clang's performance can vary greatly due to changes which should have had no executable impact.
For example, in my experiments, revision 60704 slowed down clang by 3.5%. This revision only modified the rewriter, and only change the strings that were present in the binary.
The root cause for such variations is that changing any code causes the layout of code in the binary to be modified. This has two primary effects: (1) Changing code alignment can have significant consequences on modern processors, and may also effect the generated code.
(2) Changing address can change the behavior of malloc. This can change the cache behavior and also change the executable behavior of the program in cases where the behavior depends on the actual malloc address (hash tables).
Fixing these variations is not inherently good, but it allows clang's performance to be monitored more closely, as variations due to unintended behavior make the results noisy and hard to interpret.
Fixing (1) is hard, but we should investigate (2) to see if there are any simple wins.