shibatch / sleef

SIMD Library for Evaluating Elementary Functions, vectorized libm and DFT
https://sleef.org
Boost Software License 1.0
661 stars 131 forks source link

Slow Sleef erfc or my bad code? #336

Open mikecroucher opened 4 years ago

mikecroucher commented 4 years ago

I'm using C++ via gcc 9.2 on the Windows Subsystem for Linux. My laptop supports AVX. I fill up a vector of x values like this

int N = 2000000; //Number of elemenets
double start = -50; // minimum x value
double end  = 50;   // maximum x value
double delta = (end - start) / (N - 1);

std::vector<double> x(N);
std::vector<double> y(N);
std::vector<double> y_sleef(N);

for(int count=0;count<N;count++)
{
   x[count] = start + (delta * count);
}

I time the system erfc like this

// System erfc
// Record start time
auto start_time = std::chrono::high_resolution_clock::now();
for(int count=0;count<N;count++)
{
  y[count] = erfc(x[count]);
}
// Record end time
auto finish_time = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed = finish_time - start_time;
std::cout << "Elapsed time for gcc erfc " << elapsed.count() << " s\n";

and Sleef like this

//Sleef SIMD
// Record start time
start_time = std::chrono::high_resolution_clock::now();
__m256d vx1,vy1;
for(int count=0;count<N;count=count+4)
{
  vx1 = _mm256_loadu_pd(&x[count]);
  vy1 = Sleef_erfcd4_u15avx(vx1);
  _mm256_storeu_pd(&y_sleef[count],vy1);
}
// Record end time
finish_time = std::chrono::high_resolution_clock::now();
elapsed = finish_time - start_time;
std::cout << "Elapsed time for SIMD sleef " << elapsed.count() << " s\n";

Compilation is g++ erfc_test.cpp -o erfc -lsleef -mavx and I get the following results

Elapsed time for gcc erfc 0.0303879 s
Elapsed time for SIMD sleef 0.0703444 s

It seems that SIMD Sleef version of erfc is slower than the system one.

I haven't done much SIMD programming and I am guessing that I am not loading and storing efficiently but I am not sure what to do about this. Can anyone help me out please?

shibatch commented 4 years ago

erfc in sleef is slow. I didn't think that people would care about erf or gamma.

mikecroucher commented 4 years ago

OK thanks. erf and similar are not as popular as sin,cos and tan but they have their uses Is there anything I could have done better while iterating over the std::vector?

Thanks for a great library btw, I've enjoyed checking it out.

shibatch commented 4 years ago

How about using avx2 instead of avx? Sleef is particularly slow if fma is not available. Use the dispatcher if you are not sure. It’s not too slow as people may think.

shibatch commented 4 years ago

Oh, you are the author of walkingrandomly.com. Thank you for introducing sleef at your site. 😃

mikecroucher commented 4 years ago

You are very welcome. That article is pretty old -- I should write an updated version. Using the dispatcher amounts to calling Sleef_erfcd4_u15 right? If so, that doesn't help on my machine.

I've just tested the Intel SVML implementation of erfc and it is faster than the system one and so also faster than sleef.