Add support for the `async` clause for OpenACC PARALLEL and KERNELS directives.

stfc / PSyclone

Domain-specific compiler and code transformation system for Finite Difference/Volume/Element Earth-system models in Fortran

BSD 3-Clause "New" or "Revised" License

107 stars 29 forks source link

Add support for the `async` clause for OpenACC PARALLEL and KERNELS directives. #1664

Open arporter opened 2 years ago

arporter commented 2 years ago

Alex and Victoria have shown that there are performance benefits to be had by using asynchronous kernel launches, even if the kernels themselves have to be run in order. This is also bourne out by this Stackoverflow post: https://stackoverflow.com/questions/54355791/reducing-time-to-launch-kernels-in-time-stepping-loop-openacc.

An example of code using asynch is:

!$acc kernels async(1) 
!$acc loop independent 
do ... 

!$acc end kernels 

!$acc parallel async(1) 
!$acc loop independent 
do ... 

!$acc end parallel 

!$acc wait

where the 1 here means despatch all kernels to that command queue.

Presumably, OpenMP offload offers the same functionality?

nmnobre commented 2 years ago

For future reference, this behaviour is laid out in the OpenACC specification:

"Two asynchronous operations on the same device with the same async-value will be enqueued onto the same activity queue, and therefore will be executed on the device in the order they are encountered by the local thread."

amg56 commented 2 years ago

One area where this may be used to improve performance is with single kernel regions directly enclosed in a do loop, as per the below.

do ...
!$acc kernels async(1)
  ...
!$acc end kernels
end do
!$acc wait

In this scenario, no dependency issues (such as overlapping non-asynchronous kernels) can arise. Adding asynchronous launches to do loops like this could thus be safely automated. Further, an initial look at NEMO (in the ORCA1 configuration, without sea ice) suggests that many of the most frequently launched kernels fall into this category.

nmnobre commented 6 months ago

So I spent the day assisting Jony with the practical sessions at the GPU school and it was all on OpenACC. At some point I suggested Mat that he could use the async clause just like we were thinking of using it here. And then I realised... isn't the work required to do this, i.e. deciding on where to place the wait directive, the same algorithm we developed to place the update directives: basically, as late as possible and otherwise just at the end of the procedure?

sergisiso commented 6 months ago

Yes, and same to where we place halo exchanges and to check if an array is used anymore after a loop (including possible above its location if its inside another loop). We need some kind of variable lifetime/dataflow analysis that we are generalising in new methods on the Reference node, @LonelyCat124 started it.

@LonelyCat124 Can you check if you can use the new methods for the update directive? And that it contains all the necessary logic?