Open arporter opened 2 years ago
For future reference, this behaviour is laid out in the OpenACC specification:
"Two asynchronous operations on the same device with the same async-value will be enqueued onto the same activity queue, and therefore will be executed on the device in the order they are encountered by the local thread."
One area where this may be used to improve performance is with single kernel regions directly enclosed in a do loop, as per the below.
do ...
!$acc kernels async(1)
...
!$acc end kernels
end do
!$acc wait
In this scenario, no dependency issues (such as overlapping non-asynchronous kernels) can arise. Adding asynchronous launches to do loops like this could thus be safely automated. Further, an initial look at NEMO (in the ORCA1 configuration, without sea ice) suggests that many of the most frequently launched kernels fall into this category.
So I spent the day assisting Jony with the practical sessions at the GPU school and it was all on OpenACC. At some point I suggested Mat that he could use the async
clause just like we were thinking of using it here. And then I realised... isn't the work required to do this, i.e. deciding on where to place the wait
directive, the same algorithm we developed to place the update
directives: basically, as late as possible and otherwise just at the end of the procedure?
Yes, and same to where we place halo exchanges and to check if an array is used anymore after a loop (including possible above its location if its inside another loop). We need some kind of variable lifetime/dataflow analysis that we are generalising in new methods on the Reference node, @LonelyCat124 started it.
@LonelyCat124 Can you check if you can use the new methods for the update directive? And that it contains all the necessary logic?
Alex and Victoria have shown that there are performance benefits to be had by using asynchronous kernel launches, even if the kernels themselves have to be run in order. This is also bourne out by this Stackoverflow post: https://stackoverflow.com/questions/54355791/reducing-time-to-launch-kernels-in-time-stepping-loop-openacc.
An example of code using asynch is:
where the
1
here means despatch all kernels to that command queue.Presumably, OpenMP offload offers the same functionality?