Support Parallel Array Privatisation

sergisiso commented 1 month ago

Currently we always consider arrays shared on parallel loops, and never privatise them. For example in:

real(kind=wp), dimension(nlay_i) :: ztmp
do ji = 1, npti, 1
  ...
  do jk = 1, nlay_i, 1
     ztmp(jk) = ...
  enddo
  ... uses ztmp ...
end do

We could privatise ztmp in the outer loop. We don't but this is not a problem because the dependency analysis also correctly reports: The write access to 'ztmp' in 'ztmp(jk)' causes a write-write race condition. So we skip the outer loop and only parallelise the inner one.

However, it would be more efficient to parallelise the outer loop and mark ztmp as private.

There are three solutions to explore:

Mark them as private manually in the transformation (this this needs a new attribute to carry the information until the backend): OMPLoopTrans.apply(loop, options={"private": ["ztmp"]})
Mark them as private manually in a different preceding transformation. Since PSyIR has nested scopes we could move it to the loop_body scope (after validating that it is not used outside). Then infer_sharing_attributes could use this case to assume private (kind of how it works in C). This could have uses outside loop parallelisation.
Or let the validate accept this kind of dependency and make the infer_sharing_attributes identify and mark them. This would be better because it won't require manually adding them one by one, but I am afraid of false positives. We should also look at compiler literature for this, unfortunately the dependency analysis book that we have been following only have scalar privatisation.

@arporter @hiker @LonelyCat124 Any ideas for this? This seems a pattern that will be important for NEMO5

sergisiso commented 1 month ago

Note that previous options are not mutually exclusive, (1) could be an option that calls (2) as a meta-transformation, and (3) can be some logic inside the ParallelLoop apply that also calls (2) when needed. So maybe implementing (2) -> (1) -> (3) is an iterative path to the solution.

arporter commented 1 month ago

I think that making arrays thread-private on a GPU which has a lot of threads often causes problems. The solution that has worked in WW3 and which Simon is using in NEMO5 is to increase the dimensionality of the array. Simon's solution is to do this for all arrays beginning with z as, in NEMO, that means they are temporaries. He has PSyclone code we could take a look at.

sergisiso commented 1 month ago

Thanks @arporter I forgot about this and they are almost equivalent (maybe the difference is that one uses the heap and the privatisation probably the stack)

One more note about notation:

In the literature "Privatisation" is also used to mean when the loop writes and reads from a single element of the array, this can be converted to a private scalar. We decided to call this Scalarisation and is being implemented at #2499
I used "privatisation" in the issue title because in OpenMP and OpenACC you do it with the "private" clause, but the "do concurrent" talks about locality and has the "LOCAL (variable-name-list)" syntax, so we could also use this nomenclature

sergisiso commented 1 month ago

ECMWF has hit the same issues in NEMO5. From a conversation we had, in zdftke.f90 there is:

! PSyclone: Loop cannot be parallelised because the dependency analysis reported:
! Error: The write access to 'zmxlm' in 'zmxlm(idx_2,idx_1)' causes a write-write race condition. Variable: 'zmxlm'.
! Error: The write access to 'zmxld' in 'zmxld(idx_4,idx_3)' causes a write-write race condition. Variable: 'zmxld'.
! Consider using the "ignore_dependencies_for" transformation option if this is a false dependency.
do jj = ntsj - 0, ntej + 0, 1

The reason for it is that in NEMO5 these temporary variables have been reduced to 2D: REAL(wp), DIMENSION(ntsi-(0):ntei+(0),jpk) :: zmxlm, zmxld

and therefore they are being re-used for each iteration of the jj loop and can not be used in parallel. There is 2 solutions for this:

To mark them as thread-private in OpenMP, which we do not support yet but we are considering
To increase the dimensionality of the arrays to: REAL(wp), DIMENSION(ntsj:ntej, ntsi-(0):ntei+(0), jpk) :: zmxlm, zmxld and then use them indexed by jj e.g.: zmxlm(jj,:,:) = rmxl_min so that the dependency disappears.

sergisiso commented 1 month ago

To identify which arrays to privatise/localise/increase_dimensionality, we need to also check that they are not "read" after the loop (in addition to being write-write inside the loop). For this, the .next_access property would be useful. @LonelyCat124 is there a separate issue to track this functionality (or was this just discussed inside the scalarisation and async issues)?

LonelyCat124 commented 1 month ago

I think it was just discussed in those issues, we should probably have a standalone issue

sergisiso commented 1 month ago

Note that there is also threadprivate: https://www.openmp.org/spec-html/5.0/openmpsu104.html#x137-5470002.19.2

To summarise:

private clause: sizeof(array) x num_threads (+1 master thread and host not the same). Stored probably on the stack. Does not persist across omp regions. Does not imply a copy unless using first/lastprivate. Certain restrictions (e.g. part of derived type, meaning of allocatable .. https://www.openmp.org/spec-html/5.0/openmpsu105.html#x138-5520002.19.3)
omp threadprivate: sizeof(array) x num_threads (master thread and host are the same). Stored on the heap or special thread local storage (e.g. CUDA local memory). Persists across omp regions.
Increase dims: sizeof(array) x num_threads/length_dim (impl dependent) (host have access to all), stored on the heap, persists across omp regions.r

Which one is the best is context/compiler/hardware specific. Ideally, we want the psyclone-script writer to choose what they want to do.

Increase dims feels like a transformation: Needs to update the symbol type but also all the accesses to the array, plus not validate if the array is used in a CodeBlock, or used externally (e.g. is public, ...)
omp threadprivate could be a symbol property. It generates a declarative statement. Could be set by the parallel_transformation to when array write-write deps are the only thing preventing the parallelisation. Can be turn on or off my a transformation parameter.
omp private could be done by the directive lowering because other sharing attributes are also inferred there. But we currently don't use dep_analysis there, and it would be good to be able to turn on and off. Maybe a better solution is to have the same implementation as threadprivate with the symbol attribute. We just need to respect the no-persistance and master thread and host are not the same rules when we mark something a psyir symbol threadprivate.

Again, the increase_dims and private are not mutually exclusive because if the dimension is increased and indexed by the outer loop index there will be no write-write dependency (but use more memory). If the dimension is increased by a heap allocted "get_num_threads" we need to tell parallelisation transformation to ignore dependencies in that symbol. But importantly they don't trigger the privatisation anymore.

stfc / PSyclone

Support Parallel Array Privatisation #2671