Recreating Polybench - Githubissues

daniel7558 commented 5 years ago

Hello! I am trying to recreate the polybench results from the "DawnCC: Automatic Annotation for Data Parallelism and Offloading" paper. I got DawnCC up and running, but the annotations it creates are not the same as the ones from the benchmarks.zip. It doesn't add the independent clause to the OpenACC pragmas which result in a massive slowdown compared to the CPU only execution. (for example 2DConv CPU: 0.14s; GPU: 11.79s with pgcc 18.10) When I add the independent clause manually, it works perfectly. As far as I can tell, pgcc uses #pragma acc loop seq, which results in this slowdown.

Has anything changed in DawnCC since the benchmark code was annotated which could result in this behaviour? Or maybe I am using it incorrectly? Thanks for any advice.

Details about what I did:

Let's use 2DConv (but the same problem exists with the other benchmarks as well):

I have removed the annotations from the code, which results in this:

void GPU__conv2D(DATA_TYPE *A, DATA_TYPE *B) {
  int i, j;
  DATA_TYPE c11, c12, c13, c21, c22, c23, c31, c32, c33;

  c11 = +0.2;
  c21 = +0.5;
  c31 = -0.8;
  c12 = -0.3;
  c22 = +0.6;
  c32 = -0.9;
  c13 = +0.4;
  c23 = +0.7;
  c33 = +0.10;

  for (i = 1; i < NI - 1; ++i) {
    for (j = 1; j < NJ - 1; ++j) {
      B[i * NJ + j] =
          c11 * A[(i - 1) * NJ + (j - 1)] + c12 * A[(i + 0) * NJ + (j - 1)] +
          c13 * A[(i + 1) * NJ + (j - 1)] + c21 * A[(i - 1) * NJ + (j + 0)] +
          c22 * A[(i + 0) * NJ + (j + 0)] + c23 * A[(i + 1) * NJ + (j + 0)] +
          c31 * A[(i - 1) * NJ + (j + 1)] + c32 * A[(i + 0) * NJ + (j + 1)] +
          c33 * A[(i + 1) * NJ + (j + 1)];
    }
  }
}

I now invoke dawncc on it with:

bash run.sh -d /home/daniel/dawncc -src /home/daniel/source/ -ps 0 -mc true -k false -pl false -G true -pd true -ma true -cc true

I would assume that I have to use -pl true but setting it to true results in no annotations at all. For the above command, it creates these annotations:

void GPU__conv2D(DATA_TYPE *A, DATA_TYPE *B) {
  int i, j;
  DATA_TYPE c11, c12, c13, c21, c22, c23, c31, c32, c33;

  c11 = +0.2;
  c21 = +0.5;
  c31 = -0.8;
  c12 = -0.3;
  c22 = +0.6;
  c32 = -0.9;
  c13 = +0.4;
  c23 = +0.7;
  c33 = +0.10;

  char RST_AI1 = 0;
  RST_AI1 |= !(((void*) (A + 0) > (void*) (B + 67092478))
  || ((void*) (B + 8193) > (void*) (A + 67108864)));
  #pragma acc data pcopyin(A[0:67108864]) pcopy(B[8193:67092478]) if(!RST_AI1)
  {
  #pragma acc kernels if(!RST_AI1)
  for (i = 1; i < NI - 1; ++i) {
    for (j = 1; j < NJ - 1; ++j) {
      B[i * NJ + j] =
          c11 * A[(i - 1) * NJ + (j - 1)] + c12 * A[(i + 0) * NJ + (j - 1)] +
          c13 * A[(i + 1) * NJ + (j - 1)] + c21 * A[(i - 1) * NJ + (j + 0)] +
          c22 * A[(i + 0) * NJ + (j + 0)] + c23 * A[(i + 1) * NJ + (j + 0)] +
          c31 * A[(i - 1) * NJ + (j + 1)] + c32 * A[(i + 0) * NJ + (j + 1)] +
          c33 * A[(i + 1) * NJ + (j + 1)];
    }
  }
}
}

When I compile this with pgcc and execute it, I get the following:

$ pgcc -ta=tesla -Minfo 2DConvolution_AI.c -o 2DConvolution
rtclock:
     22, include "polybenchUtilFuncts.h"
          18, FMA (fused multiply-add) instruction(s) generated
conv2D:
     54, FMA (fused multiply-add) instruction(s) generated
GPU__conv2D:
    183, Generating copyin(A[AI1:AI1])
         Generating copy(B[AI1:AI1])
    185, Generating implicit copyin(A[AI1:AI1])
         Generating implicit copy(B[AI1:AI1])
    186, Complex loop carried dependence of A-> prevents parallelization
         Loop carried dependence of B-> prevents parallelization
         Loop carried backward dependence of B-> prevents vectorization
         Complex loop carried dependence of B-> prevents parallelization
         Accelerator serial kernel generated
         Accelerator kernel generated
         Generating Tesla code
        186, #pragma acc loop seq
        187, #pragma acc loop seq
    186, Complex loop carried dependence of A-> prevents parallelization
         Loop carried dependence of B-> prevents parallelization
         Loop carried backward dependence of B-> prevents vectorization
    187, Complex loop carried dependence of A->,B-> prevents parallelization
         FMA (fused multiply-add) instruction(s) generated

$ ./2DConvolution
>> Two dimensional (2D) convolution <<
GPU Runtime: 11.791395s
CPU Runtime: 0.141337s
Non-Matching CPU-GPU Outputs Beyond Error Threshold of 0.05 Percent: 0

Apparently, pgcc creates a sequential loop, and when I run the resulting executable, it shows a significant slowdown.

In comparison, the code in polybench/auto_acc/2DCONV/2DConvolution.c contains these anntoations:

void GPU__conv2D(DATA_TYPE* A, DATA_TYPE* B)
{
  int i, j;
  DATA_TYPE c11, c12, c13, c21, c22, c23, c31, c32, c33;

  c11 = +0.2;  c21 = +0.5;  c31 = -0.8;
  c12 = -0.3;  c22 = +0.6;  c32 = -0.9;
  c13 = +0.4;  c23 = +0.7;  c33 = +0.10;

  char RST_AI1 = 0;
  RST_AI1 |= !((A + 0 > B + 67092478)
  || (B + 8193 > A + 67108864));
  #pragma acc data pcopyin(A[0:67108864]) pcopyout(B[8193:67092478]) if(!RST_AI1)
  #pragma acc kernels if(!RST_AI1)
  #pragma acc loop independent
  for (i = 1; i < NI - 1; ++i)
    {
      for (j = 1; j < NJ - 1; ++j)
    {
      B[i*NJ + j] = c11 * A[(i - 1)*NJ + (j - 1)]  +  c12 * A[(i + 0)*NJ + (j - 1)]  +  c13 * A[(i + 1)*NJ + (j - 1)]
        + c21 * A[(i - 1)*NJ + (j + 0)]  +  c22 * A[(i + 0)*NJ + (j + 0)]  +  c23 * A[(i + 1)*NJ + (j + 0)] 
        + c31 * A[(i - 1)*NJ + (j + 1)]  +  c32 * A[(i + 0)*NJ + (j + 1)]  +  c33 * A[(i + 1)*NJ + (j + 1)];
    }
    }
}

which result in much better execution time:

$ pgcc -ta=tesla -Minfo 2DConvolution.c -o 2DConvolution
rtclock:
     21, include "polybenchUtilFuncts.h"
          18, FMA (fused multiply-add) instruction(s) generated
conv2D:
     46, FMA (fused multiply-add) instruction(s) generated
GPU__conv2D:
     67, Generating copyin(A[:67108864])
         Generating copyout(B[8193:67092478])
     68, Generating implicit copyin(A[:67108864])
         Generating implicit copyout(B[8193:67092478])
     70, Loop is parallelizable
     72, Complex loop carried dependence of A->,B-> prevents parallelization
         Inner sequential loop scheduled on accelerator
         Accelerator kernel generated
         Generating Tesla code
         70, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
         72, #pragma acc loop seq
     72, Complex loop carried dependence of A->,B-> prevents parallelization
         FMA (fused multiply-add) instruction(s) generated

$ ./2DConvolution
>> Two dimensional (2D) convolution <<
GPU Runtime: 2.857586s
CPU Runtime: 0.306916s
Non-Matching CPU-GPU Outputs Beyond Error Threshold of 0.05 Percent: 0

The paper used -O3 optimization level for pgcc - I reran the tests with -O3, but the results are about the same. However again, as far as I can tell, the problem is the independent clause, which the version in benchmarks.zip has, and my code doesn't.

My first guess was that I must have used dawncc incorrectly, and I, therefore, used the online tool to double check. When I copy the function code into dawncc's online tool (with replacing the NI, NJ with actual numbers), I get the same result as my local dawncc created for me. (My local dawncc created an extra {} for the data region, which the online tool didn't - conceptually they are the same.)

I can solve the runtime issue when I replace the #pragma acc kernels with #pragma acc loop independent which is then the same as the distributed polybench code.

I further tried the saxpy code from the paper and the online tool, but it doesn't generate the independent clause. The sample code from the tutorial page works flawlessly.

Tested with PGI v18.10, AWS EC2 p2.xlarge (Nvidia K80)

Thanks, Daniel

pronesto commented 5 years ago

Hi Daniel,

I don't think DawnCC will insert the loop independent pragma into the loops anymore. We have to talk to Péricles to confirm with him, but, as far as I can recall, we have removed those annotations from DawnCC: the programmer must add this line to the loop herself (that should be the only line though). So, at this point, the programmer must certify that a loop is parallel by inserting the "independent" pragma.

DawnCC's goal is to insert the copy annotation and pointer disambiguation routines. It should be easy to change DawnCC to put the loop independent annotation back there, if you need the tool to be fully automatic.

Regards,

Fernando

ruixueqingyang commented 5 years ago

Hello Gleison,

I have a similar problem to Daniel. I try to transfer the GEMM in Polybench, with the following modification in "run.sh". The function "init(..)" is parallelized. However, function "gemm" and "GPU_gemm" are not changed. By the way, I compiled DawnCC using "build.sh". Can you give some suggestions about my situation?

CURRENT_DIR=pwd DEFAULT_ROOT_DIR=pwd KEEP_INTERMEDIARY_FILES_BOOL="false" GPUONLY_BOOL="false" PARALELLIZE_LOOPS_BOOL="true" PRAGMA_STANDARD_INT=1 POINTER_DESAMBIGUATION_BOOL="true" MEMORY_COALESCING_BOOL="true" MINIMIZE_ALIASING_BOOL="true" CODE_CHANGE_BOOL="true" FILES_FOLDER="" FILE=""

Thanks, ruixueqingyang

gleisonsdm commented 5 years ago

Hi,

I don't think DawnCC will insert the loop independent pragma into the loops anymore. Our parallel analysis is very naive, as the goal is to insert copy annotation and pointer disambiguation routines. My suggestion is avoid the use of our parallel analysis. Can you please try to use the following options:

CURRENT_DIR=pwd DEFAULT_ROOT_DIR=pwd KEEP_INTERMEDIARY_FILES_BOOL="false" GPUONLY_BOOL="false" PARALELLIZE_LOOPS_BOOL="false" PRAGMA_STANDARD_INT=1 POINTER_DESAMBIGUATION_BOOL="true" MEMORY_COALESCING_BOOL="false" MINIMIZE_ALIASING_BOOL="true" CODE_CHANGE_BOOL="true" FILES_FOLDER="" FILE=""

Let me know the results (It should just insert copy annotations and pointer disambiguation).

Cheers,

Gleison

gleisonsdm / DawnCC-Compiler

Recreating Polybench #20