vetter / shoc

The SHOC Benchmark Suite
Other
243 stars 104 forks source link

MIC MaxFlops Doesn't Match Peak Performance #37

Open jyoung3131 opened 10 years ago

jyoung3131 commented 10 years ago

The MIC version of MaxFlops reports a value that seems to be too high by a factor of 10.

Running benchmark MaxFlops result for maxspflops: 18065.5000 GFLOPS result for maxdpflops: 13108.8000 GFLOPS

From Reza's blog, the max performance should be something around the theoretical peaks of 2112 GFLOPS, SP and 1056 GFLOPS, DP. https://software.intel.com/en-us/articles/intel-xeon-phi-core-micro-architecture

This could possibly just be a reporting error.

kspaff commented 10 years ago

Hi Jeff, besides checking the reporting code, the next thing to do is to ensure that the compiler is not optimizing away any of the operations. We have ran into overestimations of performance in maxflops because of this before.

jyoung3131 commented 9 years ago

This does indeed seem to be due to optimization of operations. MAdd8 is close but not over the theoretical max (2K GFLOPS), but all the Add operations seem to be affected by removal of ops. The driver is picking up these inaccurate results as the max for the device.

I tested with the -O1 and -O2 flags, and the overestimation occurs when O2 or O3 is used. Unfortunately not using -O2 means that the code is way too slow (possibly due to other vector-related optimizations for the Phi). Currently testing some tricks to try and keep the compiler from optimizing out operations.

5110P results for Ph (GFLOPS): Add1-DP: 53.6607 Add1-SP: 60.8723 Add2-DP: 12542.6 Add2-SP: 16056.9 Add4-DP: 12516.6 Add4-SP: 17495.2 Add8-DP: 12820.2 Add8-SP: 18006.6

mitchel-d-horton commented 9 years ago

I've got code from a book that is supposed to get over 2 TFLOPS/S on the Xeon Phi in single precision.

Are we locked into getting the max flops using a particular maxflops algorithm?

I've got the entire code, but the meat of it looks like this:

pragma omp parallel for private(j,k)

for (i=0; i<numthreads; i++) { int offset = i_LOOP_COUNT; for(j=0; j<MAXFLOPS_ITERS; j++) { for(k=0; k<LOOP_COUNT; k++) fa[k+offset]=a_fa[k+offset]+fb[k+offset]; } }

This is what the author reports on 61 cores:

% export OMP_NUM_THREADS=122 % export KMP_AFFINITY=scatter % ./helloflops3 GFlops = 3123.200, Secs = 1.530, GFlops per sec = 2041.090

Mitch

----- Original Message -----

From: "jyoung3131" notifications@github.com To: "vetter/shoc" shoc@noreply.github.com Sent: Tuesday, October 14, 2014 4:08:18 PM Subject: Re: [shoc] MIC MaxFlops Doesn't Match Peak Performance (#37)

This does indeed seem to be due to optimization of operations. MAdd8 is close but not over the theoretical max (2K GFLOPS), but all the Add operations seem to be affected by removal of ops. The driver is picking up these inaccurate results as the max for the device. I tested with the -O1 and -O2 flags, and the overestimation occurs when O2 or O3 is used. Unfortunately not using -O2 means that the code is way too slow (possibly due to other vector-related optimizations for the Phi). Currently testing some tricks to try and keep the compiler from optimizing out operations. 5110P results for Ph (GFLOPS): Add1-DP: 53.6607 Add1-SP: 60.8723 Add2-DP: 12542.6 Add2-SP: 16056.9 Add4-DP: 12516.6 Add4-SP: 17495.2 Add8-DP: 12820.2 Add8-SP: 18006.6 — Reply to this email directly or view it on GitHub .