Open Yikai-Liao opened 4 months ago
I believe there is a lot of room to optimise the runtime performance of gcem. I was able to reduce the time consumption by about 40% by simply changing the recursion in the tan operation to a loop.
template<int max_depth, typename T>
constexpr
T
tan_cf_loop(const T xx)
noexcept
{
T ans = T(2*max_depth - 1);
for(int depth = max_depth - 1; depth > 0; --depth) {
ans = T(2*depth - 1) - xx / ans;
}
return ans;
}
template<typename T>
constexpr
T
tan_cf_main(const T x)
noexcept
{
return( (x > T(1.55) && x < T(1.60)) ? \
tan_series_exp(x) : // deals with a singularity at tan(pi/2)
//
x > T(1.4) ? \
x/tan_cf_loop<45>(x*x) :
x > T(1) ? \
x/tan_cf_loop<35>(x*x) :
// else
x/tan_cf_loop<25>(x*x) );
}
And, I don't really understand why gcem uses tan(x/2) (45 iterations for the worst case) for calculating sine and cosine. Using Chebyshev polynomials to approximate sine and cosine should be a better choice.
I have created a pull request optimised for trigonometry calculations #46 I'll try to optimize other functions
I have some functions in my library that need to be called at both compile-time and runtime, and cmath has varying degrees of support for constexpr on different platforms, so I chose to use gcem. But in using it, I found that many of gcem's functions are an order of magnitude slower than cmath under O3 optimization. I know that I can write two versions that are called at compile time and at runtime, but I'm wondering why gcem is so much slower at runtime?
I've tested this under x86 linux, windows and mac, compiling with g++, msvc and apple clang respectively, and all get roughly the same results.