Open llvmbot opened 12 years ago
Tried this again today, and the gcc performance advantage is still present, but much lower - although I had to reduce the test bound from 2 << 14 to 2 << 10 to sit within the default recursion limits in the current compiler. I think this ticket can be closed as resolved, as we are no longer looking at multiple orders of magnitude difference.
The problem is that count_primes(16384) is being evaluated a large number of times. We are apparently not storing the evaluated value for a non-type template argument, and are instead re-evaluating it each time we need it.
(240x slower doesn't sound so bad when we're doing 16384x as much work...)
With this patch (and the include of
@@ -227,11 +227,12 @@
template<unsigned N>
struct PrimeArrayReversed {
- PrimeArrayReverseImpl< typename GeneratePrimePack<count_primes(N)>::type> const VALUE;
+ static constexpr unsigned Primes = count_primes(N);
+ PrimeArrayReverseImpl< typename GeneratePrimePack<Primes>::type> const VALUE;
constexpr PrimeArrayReversed() : VALUE{} {}
- constexpr unsigned operator[](unsigned index) { return VALUE.d_data[count_primes(N)-index]; }
+ constexpr unsigned operator[](unsigned index) { return VALUE.d_data[Primes-index]; }
};
#define UNREVERSE_ARRAY
@@ -248,14 +249,14 @@
template<unsigned N>
struct PrimeArray {
- PrimeArrayImpl< N, typename GenerateLogPack<count_primes(N)>::type > const VALUE;
+ PrimeArrayImpl< N, typename GenerateLogPack<PrimeArrayReversed<N>::Primes>::type > const VALUE;
constexpr PrimeArray() : VALUE(PrimeArrayReversed<N>{}) {}
};
#else
template<unsigned N>
struct PrimeArray {
- PrimeArrayReverseImpl<typename GeneratePrimePack<count_primes(N)>::type> const VALUE;
+ PrimeArrayReverseImpl<typename GeneratePrimePack<PrimeArrayReversed<N>::Primes>::type> const VALUE;
constexpr PrimeArray() : VALUE{} {}
};
@llvm/issue-subscribers-clang-frontend
Extended Description
The attached program is one of my recreational C++11 programs, it tries to compute a large array of prime numbers entirely at compile time, and has been optimized to produce the largest number possible in the shortest compile-time, minimizing compiler resources.
As currently configured, it compiles on gcc 4.7 in approximately one second, but with my current version of Clang in just over 4 minutes. Configuration is essentially one constant towards the end of the file that determines the maximum range of numbers to search. The limiting factor with GCC is to avoid swap-file thrashing, which allows for growing by a factor of another 6 powers of 2.
The Clang version is clearly CPU bound, using only around 50Mb RAM in the attached configuration and not coming close to swapping to page file.
Clang configuration: clang version 3.1 (trunk 153754) Target: x86_64-apple-darwin11.3.0 Thread model: posix
Running on OSX Lion.