Open daniel7558 opened 5 years ago
Hi Daniel,
I don't think DawnCC will insert the loop independent pragma into the loops anymore. We have to talk to Péricles to confirm with him, but, as far as I can recall, we have removed those annotations from DawnCC: the programmer must add this line to the loop herself (that should be the only line though). So, at this point, the programmer must certify that a loop is parallel by inserting the "independent" pragma.
DawnCC's goal is to insert the copy annotation and pointer disambiguation routines. It should be easy to change DawnCC to put the loop independent annotation back there, if you need the tool to be fully automatic.
Regards,
Fernando
Hello Gleison,
I have a similar problem to Daniel. I try to transfer the GEMM in Polybench, with the following modification in "run.sh". The function "init(..)" is parallelized. However, function "gemm" and "GPU_gemm" are not changed. By the way, I compiled DawnCC using "build.sh". Can you give some suggestions about my situation?
CURRENT_DIR=
pwd DEFAULT_ROOT_DIR=
pwd KEEP_INTERMEDIARY_FILES_BOOL="false" GPUONLY_BOOL="false" PARALELLIZE_LOOPS_BOOL="true" PRAGMA_STANDARD_INT=1 POINTER_DESAMBIGUATION_BOOL="true" MEMORY_COALESCING_BOOL="true" MINIMIZE_ALIASING_BOOL="true" CODE_CHANGE_BOOL="true" FILES_FOLDER="" FILE=""
Thanks, ruixueqingyang
Hi,
I don't think DawnCC will insert the loop independent pragma into the loops anymore. Our parallel analysis is very naive, as the goal is to insert copy annotation and pointer disambiguation routines. My suggestion is avoid the use of our parallel analysis. Can you please try to use the following options:
CURRENT_DIR=pwd DEFAULT_ROOT_DIR=pwd KEEP_INTERMEDIARY_FILES_BOOL="false" GPUONLY_BOOL="false" PARALELLIZE_LOOPS_BOOL="false" PRAGMA_STANDARD_INT=1 POINTER_DESAMBIGUATION_BOOL="true" MEMORY_COALESCING_BOOL="false" MINIMIZE_ALIASING_BOOL="true" CODE_CHANGE_BOOL="true" FILES_FOLDER="" FILE=""
Let me know the results (It should just insert copy annotations and pointer disambiguation).
Cheers,
Gleison
Hello! I am trying to recreate the polybench results from the "DawnCC: Automatic Annotation for Data Parallelism and Offloading" paper. I got DawnCC up and running, but the annotations it creates are not the same as the ones from the benchmarks.zip. It doesn't add the independent clause to the OpenACC pragmas which result in a massive slowdown compared to the CPU only execution. (for example 2DConv CPU: 0.14s; GPU: 11.79s with pgcc 18.10) When I add the independent clause manually, it works perfectly. As far as I can tell, pgcc uses #pragma acc loop seq, which results in this slowdown.
Has anything changed in DawnCC since the benchmark code was annotated which could result in this behaviour? Or maybe I am using it incorrectly? Thanks for any advice.
Details about what I did:
Let's use 2DConv (but the same problem exists with the other benchmarks as well):
I have removed the annotations from the code, which results in this:
I now invoke dawncc on it with:
bash run.sh -d /home/daniel/dawncc -src /home/daniel/source/ -ps 0 -mc true -k false -pl false -G true -pd true -ma true -cc true
I would assume that I have to use
-pl true
but setting it totrue
results in no annotations at all. For the above command, it creates these annotations:When I compile this with pgcc and execute it, I get the following:
Apparently, pgcc creates a sequential loop, and when I run the resulting executable, it shows a significant slowdown.
In comparison, the code in
polybench/auto_acc/2DCONV/2DConvolution.c
contains these anntoations:which result in much better execution time:
The paper used -O3 optimization level for pgcc - I reran the tests with -O3, but the results are about the same. However again, as far as I can tell, the problem is the independent clause, which the version in benchmarks.zip has, and my code doesn't.
My first guess was that I must have used dawncc incorrectly, and I, therefore, used the online tool to double check. When I copy the function code into dawncc's online tool (with replacing the NI, NJ with actual numbers), I get the same result as my local dawncc created for me. (My local dawncc created an extra {} for the data region, which the online tool didn't - conceptually they are the same.)
I can solve the runtime issue when I replace the #pragma acc kernels with #pragma acc loop independent which is then the same as the distributed polybench code.
I further tried the saxpy code from the paper and the online tool, but it doesn't generate the independent clause. The sample code from the tutorial page works flawlessly.
Tested with PGI v18.10, AWS EC2 p2.xlarge (Nvidia K80)
Thanks, Daniel