stfc / PSycloneBench

Various benchmarks used to inform PSyclone optimisations
BSD 3-Clause "New" or "Revised" License
6 stars 5 forks source link

Psykal OpenMP taskloop implementation of NemoLite2D #72

Closed LonelyCat124 closed 2 years ago

LonelyCat124 commented 3 years ago

I have implemented a script using the OMPTaskloopTrans and OMPTaskwaitDirective.

I think this correctly captures all of the dependencies, while minimising the number of statements required to do this. This is done by lines 31-62 of the code, and lines 68-70.

What this does, is use forward_dependence to find the next dependency for each loop, and find the index of this dependency in the schedule. It then loops through these dependencies. If there are other dependencies between X and its dependent which are covered by the creation of such a dependence, it is deleted (line 61-62). If another dependence will be created sooner, this dependence is removed.

I'm sure there are some cases which this doesn't entirely minimise, so if anyone can think of those examples (or ways to implement tests of these?). I think this still handles a case such as:

[ 5, 4, 3, None, None, None, None, ...]

which would turn into:

[ None, 4, 3, None, None, None, None, ...]
[ None, None, 3, None, None, None, ...]

This requires https://github.com/stfc/PSyclone/pull/1368 to work right now.

The dependence analysis here is a prototype that will be used to build the OMPTaskwaitTransformation at a later stage. I am going to add comments to explain the process now too.

LonelyCat124 commented 3 years ago

I have an implementation now supporting distributed memory as well which I'll push once the VPN is working for me. I've not tested it with an up-to-date compiler, I will attempt on glados also once the VPN is working for me again.

LonelyCat124 commented 3 years ago

Ok, this builds on Scafellpike with OpenMPI4 & gcc 9 so I'm happy the code produced is valid. I'll check correctness next.

LonelyCat124 commented 3 years ago

Non-mpi version:

ua checksum =   0.52670513E+02
va checksum =   0.57100321E+03

psykal_omp manual version:

ua checksum =   0.52670513E+02
va checksum =   0.57100321E+03

MPI task version:

ua checksum =   0.52670513E+02
va checksum =   0.57100321E+03

Looks like its correct! Think this is ready for review.

LonelyCat124 commented 2 years ago

Ok, I think this now works using PSyclone master branch. My only issue is that (on my laptop) the MPI build fails with an ICE in mpifort. I will test it on glados when my VPN is behaving properly.

@arporter @sergisiso do either of you have time to have a quick look over this and check the script seems appropriate? The non-MPI version is pretty simple, the MPI one has to do a lot of checking and I don't know if there's a better way to do it.

arporter commented 2 years ago

Hi @LonelyCat124, I've had a look and I think it's OK. I think it's the same problem we tackle in the OpenACC script where we want to find the biggest region that we can put inside an ACC Kernels block, see https://github.com/stfc/PSyclone/blob/b1d9c0107057af563a782aeb8f8becbe9906627d/examples/nemo/scripts/kernels_trans.py#L456-L499 I'm not sure what we have there is much better than what you have though.

We should probably provide some sort of generalised support for implementing this pattern. A user could provide a function that determines whether it's OK to include a given node in the region and get back a bunch of regions.

LonelyCat124 commented 2 years ago

Yeah it looks pretty similar.

Shall I make an issue in PSyclone for that functionality?

arporter commented 2 years ago

Yes please!

arporter commented 2 years ago

Hi @LonelyCat124, presumably this now needs a full review? (Well, as 'full' as we do for PSycloneBench.)

LonelyCat124 commented 2 years ago

I think its ready now yes! (I'll try to double check a few things if I can tomorrow)

LonelyCat124 commented 2 years ago

@arporter I think I fixed the review changes, but I've either broken this Makefile or I've broken my PSyclone installation. Can you double check this still builds for you?

LonelyCat124 commented 2 years ago

I think I fixed the pylint issue, and had to add an extra declaration of idx so it was visible outside the loop.