Open llvmbot opened 11 years ago
Good catch recognizing the extra loads and stores, Craig. :)
I figured the difference revolved around stack vs. heap storage, as opposed to raw arrays vs. encapsulated arrays: The equal performance of std::array and raw stack arrays indicated as much, but it's nice to have confirmation. I can't actually change cmwc_state_array to use heap allocation though, since that's what cmwc_state_vector is for: The point of cmwc_state_array is to provide an option for stack allocation in the event someone wants to contiguously store a large number of small, independent generators, or if the heap isn't an option for some other reason. Those are probably niche use cases, but the general problem of subpar code generation around stack arrays is bound to pop up in a lot of other contexts as well.
Anyway, I guess the implication here is that "larger" stack objects may be making LLVM "give up" on keeping frequently-used members in registers, or something along those lines. Unfortunately, even a 32-byte object seems to qualify as large enough to run up against this behavior.
Strangely enough though, for the specific test case I've given, the vector and array implementations both presumably take up the same 32 bytes of state on the stack anyway (given a 64-bit architecture). Both store the carry and circular buffer on the stack, and I'm guessing the 4 32-bit values in the array version are the same size as what I expect are a 64-bit heap pointer and 64-bit size member in the vector version.
The compiler's job seems easier for the vector version, but it seems like there could be a few things causing the difference:
I'm just thinking aloud here of course...
One interesting thing I noticed. The array version ends up with stores and loads to circularindex and carry_ inside the loop. In the vector version these are able to stay in registers through the loop.
If you make the cmwc_state_array a wrapper around a heap allocated array instead of a statically sized array that will end on the stack it should perform similarly to the std::vector version.
When the stack size of the cmwc_engine_minimal object is large, llvm seems to be preserving all loads/stores to it including to circualindex and carry_. When its small as it would be for a vector or heap allocated array, llvm is able to remove the accesses.
Oops, I sent that last comment prematurely.
The newer test case produces bloated assembly due to all of the includes, and benchmarking arrays and vectors in the same executable runs the risk of differences from caching, but it gives an interesting perspective nonetheless. Here are some benchmarks:
1.) Non-inlined BenchmarkRNG(), slow circular buffer updates: GCC 4.7 Benchmark Time: Array : 2.440034236s Vector : 2.365274656s
GCC 4.8 Benchmark Time:
Array : 2.526191167s
Vector : 2.463438398s
Clang 3.2 Benchmark Time:
Array : 2.331447587s
Vector : 2.547286271s
Clang 3.3+ Benchmark Time:
Array : 3.271941898s
Vector : 2.553101321s
2.) Non-inlined BenchmarkRNG(), fast circular buffer updates: GCC 4.7 Benchmark Time: Array : 2.104370443s Vector : 1.901762568s
GCC 4.8 Benchmark Time:
Array : 1.953596915s
Vector : 1.942452867s
Clang 3.2 Benchmark Time:
Array : 2.354230493s
Vector : 1.824467609s
Clang 3.3+ Benchmark Time:
Array : 2.749077046s
Vector : 1.85680768s
3.) Inlined BenchmarkRNG(), slow circular buffer updates: GCC 4.7 Benchmark Time: Array : 2.132182857s Vector : 2.30057284s
GCC 4.8 Benchmark Time:
Array : 2.360156631s
Vector : 2.610849987s
Clang 3.2 Benchmark Time:
Array : 2.116218351s
Vector : 2.003197595s
Clang 3.3+ Benchmark Time:
Array : 2.08847722s
Vector : 2.034692403s
4.) Inlined BenchmarkRNG(), fast circular buffer updates: GCC 4.7 Benchmark Time: Array : 2.186518931s Vector : 1.853176545s
GCC 4.8 Benchmark Time:
Array : 2.211062684s
Vector : 1.847149512s
Clang 3.2 Benchmark Time:
Array : 1.779760716s
Vector : 1.768924408s
Clang 3.3+ Benchmark Time:
Array : 1.848627263s
Vector : 1.830769086s
Using Clang 3.2, arrays are only significantly slower than vectors for case 2. Using Clang 3.3+, arrays are significantly slower than vectors for cases 1 and 2, and both are regressions from Clang 3.2.
[Test case: Self-benchmark arrays vs. vectors with Linux syshttps://user-images.githubusercontent.com/60944935/143747389-ddb39c59-6e7a-463b-87ba-bf9b92997bf2.gz) I've added another test case that might help show other conditions that might cause arrays to be slower than vectors.
This test case uses sys/time.h to benchmark arrays vs. vectors in the same program. It eliminates the demonstrative alignas() code from the previous test and moves the benchmarking loop into its own function template. Since it uses sys/time.h, it needs to be compiled with the -lrt switch.
I've run this test case 8 different ways, by switching between the following three variables
With Clang3.3+:
After some more testing, I have some more details that might help track this down.
This is what I get for doing all of my extensive testing and timing with different code permutations before minimizing the test case: As it turns out, rearranging the order of critical lines in the specific test case I attached has a much greater effect than I previously indicated. Specifically, using the following rand() implementation gives ideal (and comparable) performance for both arrays and vectors:
result_type rand()
{
const result_type max = (result_type)(b - 1);
const size_t next_circular_index = (circular_index_ == (r - 1)) ? 0 : circular_index_ + 1;
const uint_fast64_t mwc = (uint_fast64_t(state_[circular_index_]) * a) + carry_;
const result_type result = max - (result_type)(mwc % b);
state_[circular_index_] = result;
circular_index_ = next_circular_index;
carry_ = mwc / b;
return result;
}
The main difference of course is that the final write to the array is done as early as possible, which masks the latency when the optimizer maintains that order. Strangely, vectors still perform much better with this code in the context of my full (messier) test program, so even this can still be a pathological case depending on the surrounding code.
Long story short, the underlying code generation bug still remains: For some reason, there are pathological cases where Clang produces much slower code for array accesses than for vector accesses (perhaps specifically array/vector writes), which is quite unintuitive.
Extended Description
I'm writing a complementary multiply-with carry random number generator that can be configured with templates to use an std::vector or std::array (or C-style array) for its internal state.
Strangely, Clang/LLVM generates significantly slower code when stack-based arrays are used as a drop-in replacement for vectors. GCC 4.8 generates performance-comparable code for each, and Clang's vector-based code gives comparable speed to GCC. I'm not familiar enough with assembly to know whether Clang is accessing stack arrays in a fundamentally suboptimal way when surrounded by certain computations, or if the stack arrays are merely triggering different code generation for the surrounding computations, but the resulting assembly has a number of differences, and it's quite a bit worse with stack arrays.
Compilation line: clang++ -Wall -Wextra -std=c++11 -msse4.1 -O3 SlowerArrayExample.cpp
Test line: time ./a.out
Array Timing: real 0m2.875s user 0m2.833s sys 0m0.002s
Vector Timing: real 0m1.994s user 0m1.969s sys 0m0.002s
I've controlled for alignment/cache issues by using alignas(128) with the stack arrays, and I've controlled for initialization caching issues by zero-initializing in both cases. My original test program used clock_gettime() around the test loop only, so the issue is related to the array accesses (either reads or writes) in the test loop, not array/vector construction or initialization.
This is pretty unintuitive: I'd expect array/vector access to be equivalent. If there's any performance difference at all, I'd expect it to work in slight favor of stack arrays, since the start address of a stack array can be directly calculated based on the stack pointer, whereas the start address of a vector needs to be loaded from a member pointer.
I've attached the most minimal problem case I can find, but I've tried several things to further isolate the "triggers" for this behavior, and the following information may be helpful:
WHAT SOURCE CHANGES EQUALIZE VECTOR/ARRAY IMPLEMENTATIONS? The slowdown with stack arrays disappears if I remove all of the computations and turn rand() into something silly like this: result_type rand() { // Not semantically valid: const size_t next_circular_index = (circularindex == (r - 1)) ? 0 : circularindex + 1; resulttype result = state[circularindex] + a; state_[circularindex] = result; circularindex = next_circular_index; return result; }
In other words, the slow stack array accesses only emerge in the presence of some minimal amount of surrounding computation. Also, the array and vector versions perform equally [slowly] if I replace the ternary operator (in the original code) with the following: const size_t next_circular_index = (circularindex + 1) % r;
Using a full modulus operation makes the generator run a full three times slower than the original vector version, but the array/vector versions do technically perform equally in that case. However, the array slowdown does not seem specifically related to the branch in the ternary operator, because arrays are still slower than vectors if I replace the line with the following (valid only for power-of-2 r's): const size_t next_circular_index = (circularindex + 1) & (r - 1);
WHAT SOURCE CHANGES SEEM IRRELEVANT? The slowdown doesn't seem to be a result of suboptimal source code, because it's pretty much unaffected by source-level micro-optimization.
That is to say, I've tried all of the following in an attempt to debug this (in addition to what I mentioned above):
Some of these source-level tweaks affect the overall speed of the generator, but the stack version always incurs a slowdown compared to the vector version. The slowdown doesn't seem dependent on the specific presence of multiply or division operators in the source. However, it could still be dependent on shift instructions, "and" instructions, branches, or combinations thereof, since the arithmetically gutted code above doesn't reproduce the problem, and neither does the [slower] code using a modulus operation to update the circular buffer index.