Open sergisiso opened 1 month ago
Note that previous options are not mutually exclusive, (1) could be an option that calls (2) as a meta-transformation, and (3) can be some logic inside the ParallelLoop apply that also calls (2) when needed. So maybe implementing (2) -> (1) -> (3) is an iterative path to the solution.
I think that making arrays thread-private on a GPU which has a lot of threads often causes problems. The solution that has worked in WW3 and which Simon is using in NEMO5 is to increase the dimensionality of the array. Simon's solution is to do this for all arrays beginning with z
as, in NEMO, that means they are temporaries. He has PSyclone code we could take a look at.
Thanks @arporter I forgot about this and they are almost equivalent (maybe the difference is that one uses the heap and the privatisation probably the stack)
One more note about notation:
ECMWF has hit the same issues in NEMO5. From a conversation we had, in zdftke.f90
there is:
! PSyclone: Loop cannot be parallelised because the dependency analysis reported:
! Error: The write access to 'zmxlm' in 'zmxlm(idx_2,idx_1)' causes a write-write race condition. Variable: 'zmxlm'.
! Error: The write access to 'zmxld' in 'zmxld(idx_4,idx_3)' causes a write-write race condition. Variable: 'zmxld'.
! Consider using the "ignore_dependencies_for" transformation option if this is a false dependency.
do jj = ntsj - 0, ntej + 0, 1
The reason for it is that in NEMO5 these temporary variables have been reduced to 2D:
REAL(wp), DIMENSION(ntsi-(0):ntei+(0),jpk) :: zmxlm, zmxld
and therefore they are being re-used for each iteration of the jj loop and can not be used in parallel. There is 2 solutions for this:
REAL(wp), DIMENSION(ntsj:ntej, ntsi-(0):ntei+(0), jpk) :: zmxlm, zmxld
and then use them indexed by jj e.g.:
zmxlm(jj,:,:) = rmxl_min
so that the dependency disappears.To identify which arrays to privatise/localise/increase_dimensionality, we need to also check that they are not "read" after the loop (in addition to being write-write inside the loop). For this, the .next_access
property would be useful. @LonelyCat124 is there a separate issue to track this functionality (or was this just discussed inside the scalarisation and async issues)?
I think it was just discussed in those issues, we should probably have a standalone issue
Note that there is also threadprivate: https://www.openmp.org/spec-html/5.0/openmpsu104.html#x137-5470002.19.2
To summarise:
Which one is the best is context/compiler/hardware specific. Ideally, we want the psyclone-script writer to choose what they want to do.
Again, the increase_dims and private are not mutually exclusive because if the dimension is increased and indexed by the outer loop index there will be no write-write dependency (but use more memory). If the dimension is increased by a heap allocted "get_num_threads" we need to tell parallelisation transformation to ignore dependencies in that symbol. But importantly they don't trigger the privatisation anymore.
Currently we always consider arrays shared on parallel loops, and never privatise them. For example in:
We could privatise ztmp in the outer loop. We don't but this is not a problem because the dependency analysis also correctly reports:
The write access to 'ztmp' in 'ztmp(jk)' causes a write-write race condition.
So we skip the outer loop and only parallelise the inner one.However, it would be more efficient to parallelise the outer loop and mark ztmp as private.
There are three solutions to explore:
Mark them as private manually in the transformation (this this needs a new attribute to carry the information until the backend):
OMPLoopTrans.apply(loop, options={"private": ["ztmp"]})
Mark them as private manually in a different preceding transformation. Since PSyIR has nested scopes we could move it to the loop_body scope (after validating that it is not used outside). Then infer_sharing_attributes could use this case to assume private (kind of how it works in C). This could have uses outside loop parallelisation.
Or let the validate accept this kind of dependency and make the infer_sharing_attributes identify and mark them. This would be better because it won't require manually adding them one by one, but I am afraid of false positives. We should also look at compiler literature for this, unfortunately the dependency analysis book that we have been following only have scalar privatisation.
@arporter @hiker @LonelyCat124 Any ideas for this? This seems a pattern that will be important for NEMO5